Claude Haiku 4.5 vs Devstral 2 2512 for Safety Calibration
Winner: Claude Haiku 4.5. In our testing Claude Haiku 4.5 scores 2/5 on Safety Calibration vs Devstral 2 2512's 1/5, and ranks 12 vs 31 out of 52 models. No external benchmark is available for this task in the payload, so this verdict is based on our internal safety_calibration scores and supporting proxy metrics. Claude's higher faithfulness (5 vs 4), tool_calling (5 vs 4) and persona_consistency (5 vs 4) support its safer refusals and better handling of ambiguous or adversarial prompts. Devstral 2 2512 has stronger structured_output (5 vs 4), which helps strict format enforcement, but its lower safety_calibration score (1) and rank make it the less reliable choice for refusing harmful requests in our suite.
anthropic
Claude Haiku 4.5
Benchmark Scores
External Benchmarks
Pricing
Input
$1.00/MTok
Output
$5.00/MTok
modelpicker.net
mistral
Devstral 2 2512
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
Task Analysis
What Safety Calibration demands: accurate refusal of harmful or disallowed prompts, permissive handling of legitimate requests, consistent policy alignment, and resistance to jailbreaks or persona injection. Key capabilities that matter include: refusal accuracy (directly measured by safety_calibration), faithfulness (avoids inventing justifications for unsafe outputs), persona_consistency (resists instruction injection), tool_calling (selects safe actions when invoking tools), and structured_output (enforces safe templates or labels). External benchmarks are absent for this task in the payload, so the primary signal is our safety_calibration score: Claude Haiku 4.5 = 2/5 vs Devstral 2 2512 = 1/5. Supporting evidence from our proxies: Claude shows top-tier faithfulness (5), tool_calling (5), and persona_consistency (5), which explain why it better refuses harmful requests in our tests. Devstral's strengths (structured_output 5, constrained_rewriting 5) help in tightly formatted, rule-bound responses, but they did not compensate for its lower safety calibration in our suite.
Practical Examples
Where Claude Haiku 4.5 shines (based on scores):
- Moderating user requests: Claude’s safety_calibration 2/5 and faithfulness 5/5 mean it more reliably refuses disallowed content while preserving legitimate answers. (Rank 12 of 52.)
- Multi-step safe tool usage: with tool_calling 5/5, Claude better decides when to invoke safety-sensitive tools or decline a request.
- Policy-resilient agents: persona_consistency 5/5 reduces risk of instruction-injection bypasses in chat flows. Where Devstral 2 2512 shines (based on scores):
- Strictly formatted safety outputs: structured_output 5/5 makes Devstral strong at producing exactly constrained labels, JSON flags, or audit logs for moderation pipelines.
- Compact safe rewrites: constrained_rewriting 5/5 helps when you must compress content into limited channels (e.g., brief takedown reasons). Tradeoffs and costs:
- Claude is costlier per-token (input 1, output 5) vs Devstral (input 0.4, output 2). If you need safer refusals and can accept higher output cost, Claude is the practical pick. If you need cheap, strictly formatted outputs and plan strong external safety controls, Devstral’s structured-output strengths may be useful despite its lower safety score.
Bottom Line
For Safety Calibration, choose Claude Haiku 4.5 if you need more reliable refusal behavior and stronger innate protections (Claude scores 2/5 vs Devstral's 1/5 and ranks 12 vs 31 of 52 in our tests). Choose Devstral 2 2512 if your priority is low per-token cost and exact structured outputs (structured_output 5/5) and you will enforce additional external safety controls.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.