Question 1

How big is the safety calibration gap between these models?

Accepted Answer

In our testing Claude Haiku 4.5 scores 2/5 on safety_calibration versus Codestral 2508's 1/5 — a 1-point gap that places Haiku 4.5 at rank 12 of 52 and Codestral at rank 31 of 52.

Question 2

Can I offset Codestral 2508's weaker safety score with prompt engineering?

Accepted Answer

Prompt engineering can reduce some risky outputs by making refusals explicit, and Codestral's strong structured_output (5/5) helps enforce response format. However, our safety_calibration score reflects core model behavior in refusal and classification scenarios; prompts may mitigate but not fully close the 1-point gap we observed.

Question 3

Does cost justify choosing Claude Haiku 4.5 for safety?

Accepted Answer

Claude Haiku 4.5 is substantially costlier (output cost 5 per mTok vs Codestral 2508 at 0.9 per mTok). If safety-critical correctness and conservative refusals are required, Haiku's higher safety score may justify the cost. If budget is the primary constraint and you can tolerate weaker refusals, Codestral is far cheaper.

Question 4

Which supporting metrics should I check when evaluating safety calibration?

Accepted Answer

Look at classification accuracy, persona_consistency, and faithfulness alongside safety_calibration. In our tests Claude Haiku 4.5 outperforms on classification (4 vs 3) and persona_consistency (5 vs 3), which correlate with better refusal behavior; Codestral leads on structured_output (5 vs 4), useful when downstream systems require strict schemas.

Claude Haiku 4.5 vs Codestral 2508 for Safety Calibration

Claude Haiku 4.5

Codestral 2508

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions