Claude Haiku 4.5 vs Devstral Medium for Safety Calibration

Claude Haiku 4.5 is the clear winner for Safety Calibration in our tests. In our 12-test suite Haiku scored 2 vs Devstral Medium's 1 on the safety_calibration measure, and ranks 12 of 52 vs 31 of 52. The score gap is supported by Haiku’s higher internal scores on tool_calling (5 vs 3), faithfulness (5 vs 4), persona_consistency (5 vs 3) and long_context (5 vs 4), all of which matter for reliably refusing harmful requests while permitting legitimate ones.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Task Analysis

Safety Calibration requires a model to refuse harmful or disallowed requests and still permit legitimate, safety-compliant queries (benchmark description: "Refuses harmful requests, permits legitimate ones"). Key capabilities that matter: consistent policy-grounded refusal behavior (measured directly by safety_calibration), accurate classification and routing of ambiguous content, faithfulness to source constraints, robust tool selection for guardrails, and context awareness to avoid mislabeling long or multi-turn requests. No external benchmark score is provided for this task, so our winner call is based on internal task scores: Claude Haiku 4.5 scored 2 while Devstral Medium scored 1. Supporting proxy scores: Haiku’s tool_calling=5 vs Devstral=3 (helps integrate safety tools and enforcement), faithfulness=5 vs 4 (reduces hallucinated permissive answers), persona_consistency=5 vs 3 (resists prompt-injection that could flip refusals), and structured_output tied at 4 (both can emit schema-compliant refusal formats). These internal signals explain why Haiku better balances refusal and allowance in our safety tests.

Practical Examples

Example 1 — Moderation gateway: A moderation agent must refuse an instruction to build a weapon while allowing historical or policy discussion. In our testing Haiku (safety_calibration=2) is more likely to produce policy-aligned refusals and route allowed queries — supported by tool_calling=5 and faithfulness=5. Devstral Medium (safety_calibration=1) is more prone to borderline permissive outputs and weaker tool orchestration (tool_calling=3). Example 2 — Ambiguous user intent: When users provide partial or obfuscated prompts, Haiku’s persona_consistency=5 and long_context=5 help it maintain refusal stances across turns; Devstral’s persona_consistency=3 and long_context=4 make it less consistent across the same conversation. Example 3 — Low-cost bulk classification: If you need large-volume, low-cost pre-filtering where occasional false positives are acceptable, Devstral Medium (input $0.40/mTok, output $2/mTok) can be cost-effective despite weaker safety calibration. Example 4 — Guardrailed tool workflows: Systems that rely on tool-based enforcement (e.g., calling an external policy-checker) benefit from Haiku’s tool_calling advantage (5 vs 3), making Haiku better for integrated, multi-step safety pipelines despite higher costs (Claude Haiku 4.5: $1 input / $5 output per mTok; Devstral Medium: $0.40 input / $2 output per mTok).

Bottom Line

For Safety Calibration, choose Claude Haiku 4.5 if you need stronger refusal behavior, consistent policy alignment, and robust tool-based guardrails (Haiku: safety_calibration=2, tool_calling=5, faithfulness=5). Choose Devstral Medium if your priority is lower runtime cost and you can tolerate weaker safety calibration (Devstral: safety_calibration=1, lower costs: $0.40/$2 per mTok) or plan to add external moderation layers.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions