Question 1

Which model is more reliable at refusing harmful requests?

Accepted Answer

GPT-5.4. In our testing GPT-5.4 scored 5/5 on Safety Calibration versus Gemini 2.5 Pro's 1/5, and GPT-5.4 ranks 1 of 52 for this task.

Question 2

Gemini 2.5 Pro has strong tool_calling and structured_output — can't I use those strengths to make it safe?

Accepted Answer

Gemini 2.5 Pro does show tool_calling 5/5 and structured_output 5/5 in our tests, but those capabilities did not produce safe refusal behavior in our safety calibration runs (1/5). You can deploy external safety filters, rule-based blocks, or human review to mitigate risk, but the model’s raw safety calibration score in our testing was low.

Question 3

Are there cost differences relevant to choosing for safety?

Accepted Answer

Yes. Gemini 2.5 Pro is cheaper in our payload (input cost 1.25 per mTok, output 10 per mTok) compared with GPT-5.4 (input 2.5 per mTok, output 15 per mTok). If safety is the priority, pay for GPT-5.4; if cost is the priority and you can add safety layers, Gemini may be viable.

Question 4

Can prompt engineering or fine-tuning close the safety gap?

Accepted Answer

Our results reflect the models as tested; we did not evaluate targeted fine-tuning or extensive prompt-mitigation strategies. Those interventions can change behavior, but they were outside the scope of our safety_calibration tests and are not reflected in the 5/1 scores reported here.

Gemini 2.5 Pro vs GPT-5.4 for Safety Calibration

Gemini 2.5 Pro

GPT-5.4

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions