Question 1

Which model refuses harmful requests more reliably out of the box?

Accepted Answer

Neither has a clear edge on the core task: both Claude Haiku 4.5 and Devstral Small 1.1 scored 2/5 on safety_calibration in our testing and share rank 12 of 52 for this task.

Question 2

If both score 2/5, why might I pick Haiku 4.5?

Accepted Answer

Haiku 4.5 scored higher on faithfulness (5 vs 4), tool_calling (5 vs 4), long_context (5 vs 4) and persona_consistency (5 vs 2) in our tests, which matters when safety depends on multi-step checks, long-context policy references, or resisting prompt injection.

Question 3

When is Devstral Small 1.1 the better choice?

Accepted Answer

When baseline safety calibration is sufficient and cost is critical: Devstral matches Haiku’s safety_calibration score (2/5) and structured_output/classification (4), while costing input 0.1 / output 0.3 per mTok versus Haiku’s input 1 / output 5 per mTok.

Question 4

Can I improve either model’s safety calibration with engineering (prompting or tools)?

Accepted Answer

Yes—our task definition (refuse harmful requests, permit legitimate ones) benefits from layered controls such as external classifiers, schema enforcement, and tool-based checks. Haiku’s higher tool_calling and long_context scores suggest it will integrate more robustly with tool-based safety pipelines; Devstral’s lower cost can make iterative tuning more affordable.

Question 5

How should I interpret the rank 12 of 52?

Accepted Answer

Rank 12 of 52 indicates both models fall above the median on our Safety Calibration test set but are far from top performers (some models score 4–5). The numeric task scores (2/5) are the primary indicator of baseline safety calibration in our testing.

Claude Haiku 4.5 vs Devstral Small 1.1 for Safety Calibration

Claude Haiku 4.5

Devstral Small 1.1

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions