Claude Haiku 4.5 vs Gemini 2.5 Flash for Safety Calibration
Winner: Gemini 2.5 Flash. In our testing Gemini 2.5 Flash scores 4/5 on Safety Calibration vs Claude Haiku 4.5's 2/5. That 2‑point gap (rank 6 vs rank 12 of 52) shows Gemini is substantially better at refusing harmful requests while permitting legitimate ones on the safety_calibration suite.
anthropic
Claude Haiku 4.5
Benchmark Scores
External Benchmarks
Pricing
Input
$1.00/MTok
Output
$5.00/MTok
modelpicker.net
Gemini 2.5 Flash
Benchmark Scores
External Benchmarks
Pricing
Input
$0.300/MTok
Output
$2.50/MTok
modelpicker.net
Task Analysis
Safety Calibration requires an LLM to refuse clearly harmful prompts, allow legitimate or benign requests, and make fine-grained, context-sensitive permission decisions. The primary measure for this task in our data is the safety_calibration score (1–5). Secondary capabilities that support safety calibration include structured_output (for policy-compliant refusal formats), tool_calling (accurate function selection when delegating enforcement), persona_consistency (resisting injection that tries to bypass refusals), and faithfulness (sticking to policy text when justifying decisions). In our testing: Gemini 2.5 Flash scores 4 on safety_calibration, structured_output 4, tool_calling 5, persona_consistency 5, and faithfulness 4. Claude Haiku 4.5 scores 2 on safety_calibration, structured_output 4, tool_calling 5, persona_consistency 5, and faithfulness 5. Because the primary task metric is safety_calibration, Gemini's higher 4/5 is the decisive signal; supporting scores explain that both models handle structured formats and tool calls well, but Gemini makes more correct permit/refuse judgments on the safety suite.
Practical Examples
- Content-moderation refusal: In our tests Gemini 2.5 Flash refused unsafe user instructions more reliably (4 vs 2). Use Gemini when you need conservative, consistent refusals across adversarial phrasing. 2) Legitimate-but-sensitive permit: Gemini more often allowed benign, policy-compliant requests in our suite (score 4), so it's a safer default for services that must balance availability with safety. 3) Policy-anchored justification: Claude Haiku 4.5 scored 5/5 on faithfulness vs Gemini's 4/5, so when you need outputs tightly tied to source policy text (for auditability or exact quoting), Haiku may provide cleaner source fidelity despite its lower safety_calibration score. 4) Integration scenarios: Both models scored 5 on tool_calling in our testing, so when enforcement relies on external tools or structured refusal formats, either model integrates well—but Gemini's 4/5 safety calibration means fewer manual overrides post-tool-call in our benchmarks. 5) Cost and context tradeoffs: Gemini 2.5 Flash is cheaper per mtoken (input $0.30, output $2.50) vs Claude Haiku 4.5 (input $1, output $5) and also has a larger context window (1,048,576 vs 200,000), which matters for long, policy-heavy conversations where consistent refusal/permit behavior must consider long history.
Bottom Line
For Safety Calibration, choose Claude Haiku 4.5 if you prioritize maximum faithfulness to source policy text (Haiku faithfulness 5/5) or need its specific performance profile for other tasks. Choose Gemini 2.5 Flash if your primary goal is correct refusal/permit behavior—Gemini scored 4/5 vs Haiku 2/5 on safety_calibration in our testing, and ranked 6 of 52 vs Haiku's 12 of 52.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.