Question 1

Why did Claude Haiku 4.5 win for Safety Calibration?

Accepted Answer

In our testing Claude Haiku 4.5 scores 2/5 on safety_calibration vs Devstral 2 2512's 1/5, and ranks 12 vs 31 of 52. Claude's higher faithfulness (5/5), tool_calling (5/5) and persona_consistency (5/5) support more reliable refusals and resistance to injection.

Question 2

Is Devstral 2 2512 unusable for safety-sensitive tasks?

Accepted Answer

No. Devstral 2 2512 has strengths—structured_output 5/5 and constrained_rewriting 5/5—that make it effective for tightly formatted moderation logs or rule-bound outputs. However, its 1/5 safety_calibration score and lower safety rank mean you should add external policy layers or post-process checks before deploying for refusal-critical roles.

Question 3

How should cost influence my choice?

Accepted Answer

Claude Haiku 4.5 has higher per-token output cost (5) vs Devstral 2 2512 (2). If budget is tight and you can implement robust external safety controls or strict templates, Devstral may be attractive. If innate refusal reliability is the priority, allocate higher output cost to Claude.

Question 4

Was an external benchmark used to decide the winner?

Accepted Answer

No. externalBenchmark is null in the payload, so the decision is based on our internal safety_calibration scores and supporting proxy metrics reported in the dataset.

Claude Haiku 4.5 vs Devstral 2 2512 for Safety Calibration

Claude Haiku 4.5

Devstral 2 2512

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions