Claude Haiku 4.5 vs Devstral Medium for Safety Calibration
Claude Haiku 4.5 is the clear winner for Safety Calibration in our tests. In our 12-test suite Haiku scored 2 vs Devstral Medium's 1 on the safety_calibration measure, and ranks 12 of 52 vs 31 of 52. The score gap is supported by Haiku’s higher internal scores on tool_calling (5 vs 3), faithfulness (5 vs 4), persona_consistency (5 vs 3) and long_context (5 vs 4), all of which matter for reliably refusing harmful requests while permitting legitimate ones.
anthropic
Claude Haiku 4.5
Benchmark Scores
External Benchmarks
Pricing
Input
$1.00/MTok
Output
$5.00/MTok
modelpicker.net
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
Task Analysis
Safety Calibration requires a model to refuse harmful or disallowed requests and still permit legitimate, safety-compliant queries (benchmark description: "Refuses harmful requests, permits legitimate ones"). Key capabilities that matter: consistent policy-grounded refusal behavior (measured directly by safety_calibration), accurate classification and routing of ambiguous content, faithfulness to source constraints, robust tool selection for guardrails, and context awareness to avoid mislabeling long or multi-turn requests. No external benchmark score is provided for this task, so our winner call is based on internal task scores: Claude Haiku 4.5 scored 2 while Devstral Medium scored 1. Supporting proxy scores: Haiku’s tool_calling=5 vs Devstral=3 (helps integrate safety tools and enforcement), faithfulness=5 vs 4 (reduces hallucinated permissive answers), persona_consistency=5 vs 3 (resists prompt-injection that could flip refusals), and structured_output tied at 4 (both can emit schema-compliant refusal formats). These internal signals explain why Haiku better balances refusal and allowance in our safety tests.
Practical Examples
Example 1 — Moderation gateway: A moderation agent must refuse an instruction to build a weapon while allowing historical or policy discussion. In our testing Haiku (safety_calibration=2) is more likely to produce policy-aligned refusals and route allowed queries — supported by tool_calling=5 and faithfulness=5. Devstral Medium (safety_calibration=1) is more prone to borderline permissive outputs and weaker tool orchestration (tool_calling=3). Example 2 — Ambiguous user intent: When users provide partial or obfuscated prompts, Haiku’s persona_consistency=5 and long_context=5 help it maintain refusal stances across turns; Devstral’s persona_consistency=3 and long_context=4 make it less consistent across the same conversation. Example 3 — Low-cost bulk classification: If you need large-volume, low-cost pre-filtering where occasional false positives are acceptable, Devstral Medium (input $0.40/mTok, output $2/mTok) can be cost-effective despite weaker safety calibration. Example 4 — Guardrailed tool workflows: Systems that rely on tool-based enforcement (e.g., calling an external policy-checker) benefit from Haiku’s tool_calling advantage (5 vs 3), making Haiku better for integrated, multi-step safety pipelines despite higher costs (Claude Haiku 4.5: $1 input / $5 output per mTok; Devstral Medium: $0.40 input / $2 output per mTok).
Bottom Line
For Safety Calibration, choose Claude Haiku 4.5 if you need stronger refusal behavior, consistent policy alignment, and robust tool-based guardrails (Haiku: safety_calibration=2, tool_calling=5, faithfulness=5). Choose Devstral Medium if your priority is lower runtime cost and you can tolerate weaker safety calibration (Devstral: safety_calibration=1, lower costs: $0.40/$2 per mTok) or plan to add external moderation layers.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.