Claude Haiku 4.5 vs Devstral Small 1.1 for Safety Calibration

Tie — Neither model wins. In our testing both Claude Haiku 4.5 and Devstral Small 1.1 score 2/5 on Safety Calibration (ranked 12 of 52 for the task). The results show equivalent base safety calibration, but Haiku 4.5 offers stronger supporting capabilities (faithfulness 5 vs 4, tool_calling 5 vs 4, long_context 5 vs 4) that make it easier to build safer, context-aware refusal/allow logic. Devstral Small 1.1 matches Haiku on safety_calibration (2/5) while being far cheaper (input 0.1 / output 0.3 per mTok vs Haiku input 1 / output 5 per mTok). Choose based on integration needs: Haiku for stronger underlying controls; Devstral for low-cost, high-volume gating where equivalent raw safety calibration suffices.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

Task Analysis

What Safety Calibration demands: the model must reliably refuse harmful requests and permit legitimate ones. Key capabilities: accurate intent classification (routing/refusal), faithful adherence to constraints (avoid hallucinated justifications), robust instruction-following under adversarial prompts (persona resistance), and consistent behavior across long contexts or tool-driven flows. In the absence of an external benchmark (externalBenchmark: null), we lead with our internal task scores: both models score 2/5 on safety_calibration in our tests and are ranked 12 of 52 for this task. Supporting signals explain why outcomes converge: Claude Haiku 4.5 scores higher on faithfulness (5 vs 4), tool_calling (5 vs 4), long_context (5 vs 4), persona_consistency (5 vs 2) and agentic_planning (5 vs 2), which suggests Haiku is better suited for safety setups that depend on context, tool-driven checks, or multi-step policy logic. Devstral Small 1.1 provides the same base safety calibration score (2/5) and comparable structured_output and classification (both 4), which supports schema-based permit/refuse flows at much lower cost (input 0.1 / output 0.3 per mTok).

Practical Examples

  1. High-stakes content gateway with multi-step checks — Claude Haiku 4.5: Haiku’s faithfulness 5 and tool_calling 5 mean you can chain a classifier + policy tool and keep consistent refusals across a 200,000-token context. In our testing Haiku scored 2/5 on raw safety calibration but its higher supporting scores reduce integration risk when building layered safety controls. 2) Low-cost moderation for short-form user content — Devstral Small 1.1: Devstral also scores 2/5 on safety_calibration and matches Haiku on structured_output (4) and classification (4), so for high-volume, rule-based permit/refuse pipelines its lower costs (input 0.1 / output 0.3 per mTok) make it a practical choice. 3) Persona-robust refusal (adversarial prompt attempts) — Claude Haiku 4.5: Haiku’s persona_consistency 5 vs Devstral’s 2 suggests Haiku will be easier to tune against injection attacks that try to force unsafe replies. 4) Simple schema-based API that only needs consistent JSON refusals — either model: both have structured_output 4 and classification 4, so for fixed-format allow/deny responses both performed similarly in our tests.

Bottom Line

For Safety Calibration, choose Claude Haiku 4.5 if you need stronger supporting capabilities (faithfulness, tool-calling, long-context, persona consistency) to construct layered, context-aware refusal logic. Choose Devstral Small 1.1 if you need the same baseline safety calibration (both score 2/5 in our testing) at much lower cost (input 0.1 / output 0.3 per mTok vs Haiku input 1 / output 5 per mTok).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions