Claude Haiku 4.5 vs Devstral Small 1.1 for Safety Calibration
Tie — Neither model wins. In our testing both Claude Haiku 4.5 and Devstral Small 1.1 score 2/5 on Safety Calibration (ranked 12 of 52 for the task). The results show equivalent base safety calibration, but Haiku 4.5 offers stronger supporting capabilities (faithfulness 5 vs 4, tool_calling 5 vs 4, long_context 5 vs 4) that make it easier to build safer, context-aware refusal/allow logic. Devstral Small 1.1 matches Haiku on safety_calibration (2/5) while being far cheaper (input 0.1 / output 0.3 per mTok vs Haiku input 1 / output 5 per mTok). Choose based on integration needs: Haiku for stronger underlying controls; Devstral for low-cost, high-volume gating where equivalent raw safety calibration suffices.
anthropic
Claude Haiku 4.5
Benchmark Scores
External Benchmarks
Pricing
Input
$1.00/MTok
Output
$5.00/MTok
modelpicker.net
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
Task Analysis
What Safety Calibration demands: the model must reliably refuse harmful requests and permit legitimate ones. Key capabilities: accurate intent classification (routing/refusal), faithful adherence to constraints (avoid hallucinated justifications), robust instruction-following under adversarial prompts (persona resistance), and consistent behavior across long contexts or tool-driven flows. In the absence of an external benchmark (externalBenchmark: null), we lead with our internal task scores: both models score 2/5 on safety_calibration in our tests and are ranked 12 of 52 for this task. Supporting signals explain why outcomes converge: Claude Haiku 4.5 scores higher on faithfulness (5 vs 4), tool_calling (5 vs 4), long_context (5 vs 4), persona_consistency (5 vs 2) and agentic_planning (5 vs 2), which suggests Haiku is better suited for safety setups that depend on context, tool-driven checks, or multi-step policy logic. Devstral Small 1.1 provides the same base safety calibration score (2/5) and comparable structured_output and classification (both 4), which supports schema-based permit/refuse flows at much lower cost (input 0.1 / output 0.3 per mTok).
Practical Examples
- High-stakes content gateway with multi-step checks — Claude Haiku 4.5: Haiku’s faithfulness 5 and tool_calling 5 mean you can chain a classifier + policy tool and keep consistent refusals across a 200,000-token context. In our testing Haiku scored 2/5 on raw safety calibration but its higher supporting scores reduce integration risk when building layered safety controls. 2) Low-cost moderation for short-form user content — Devstral Small 1.1: Devstral also scores 2/5 on safety_calibration and matches Haiku on structured_output (4) and classification (4), so for high-volume, rule-based permit/refuse pipelines its lower costs (input 0.1 / output 0.3 per mTok) make it a practical choice. 3) Persona-robust refusal (adversarial prompt attempts) — Claude Haiku 4.5: Haiku’s persona_consistency 5 vs Devstral’s 2 suggests Haiku will be easier to tune against injection attacks that try to force unsafe replies. 4) Simple schema-based API that only needs consistent JSON refusals — either model: both have structured_output 4 and classification 4, so for fixed-format allow/deny responses both performed similarly in our tests.
Bottom Line
For Safety Calibration, choose Claude Haiku 4.5 if you need stronger supporting capabilities (faithfulness, tool-calling, long-context, persona consistency) to construct layered, context-aware refusal logic. Choose Devstral Small 1.1 if you need the same baseline safety calibration (both score 2/5 in our testing) at much lower cost (input 0.1 / output 0.3 per mTok vs Haiku input 1 / output 5 per mTok).
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.