Question 1

How much better is Claude Haiku 4.5 at Safety Calibration than Devstral Medium?

Accepted Answer

In our testing Claude Haiku 4.5 scores 2 vs Devstral Medium's 1 on safety_calibration — a 1-point win. Haiku also ranks 12 of 52 vs Devstral's 31 of 52, and shows stronger tool_calling (5 vs 3) and faithfulness (5 vs 4).

Question 2

Can I offset Devstral Medium’s weaker safety calibration with external tooling?

Accepted Answer

Yes. Devstral Medium’s lower tool_calling score (3 vs Haiku’s 5) means it will require more external orchestration to match Haiku’s integrated behavior. If you plan to run a separate policy filter or human-in-the-loop step, Devstral can be viable and is cheaper ($0.40 input / $2 output per mTok).

Question 3

Does either model produce safer refusals in long conversations?

Accepted Answer

Claude Haiku 4.5 has a higher long_context score (5 vs 4), which in our tests correlates with more consistent refusal behavior over multi-turn dialogues. Devstral Medium is adequate but less consistent across very long contexts.

Question 4

Are there tied strengths I should know about between the two models?

Accepted Answer

Yes — both models tie on structured_output (score 4), so if you need schema-compliant refusal messages both can produce machine-readable outputs equally well in our tests.

Question 5

Which model is more cost-effective for high-throughput safety screening?

Accepted Answer

Devstral Medium is more cost-effective per mTok ($0.40 input / $2 output) but scores lower on safety_calibration. If budget is primary and you can tolerate added external filters, Devstral is the economical choice; if you need stronger built-in calibration, Haiku is worth the higher cost ($1 input / $5 output).

Claude Haiku 4.5 vs Devstral Medium for Safety Calibration

Claude Haiku 4.5

Devstral Medium

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions