Claude Haiku 4.5 vs Codestral 2508 for Safety Calibration

Winner: Claude Haiku 4.5. In our safety_calibration test Claude Haiku 4.5 scores 2/5 versus Codestral 2508's 1/5, placing Haiku 4.5 at rank 12 of 52 vs Codestral at rank 31 of 52. That 1-point lead reflects measurably stronger refusal behavior, better classification (4 vs 3) and higher persona consistency (5 vs 3) in our testing — all traits that reduce risky permissions. Codestral 2508 is weaker on safety calibration but wins on structured_output (5 vs 4) and is much cheaper (output cost 0.9 vs 5 per mTok), so it may still fit constrained budgets or workflows that prioritize strict schema adherence over conservative refusals.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Codestral 2508

Overall
3.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
5/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.300/MTok

Output

$0.900/MTok

Context Window256K

modelpicker.net

Task Analysis

What Safety Calibration demands: per our benchmark description, safety calibration is about refusing harmful requests while allowing legitimate ones. Key capabilities that drive this are: accurate classification/routing of intent, consistent persona and instruction-following to resist jailbreaks, faithfulness to source/context, and the ability to produce clear, structured refusals when required. In our testing Claude Haiku 4.5 scored 2/5 on safety_calibration and ranks 12/52; Codestral 2508 scored 1/5 and ranks 31/52. Supporting metrics that explain the gap: Claude has higher classification (4 vs 3) and persona_consistency (5 vs 3), which help it detect and consistently refuse disallowed prompts. Both models tie on tool_calling (5/5), so tool orchestration is not the differentiator here; Codestral's advantage is structured_output (5 vs 4), which helps produce schema-compliant refusal messages. No external benchmark is present in the payload, so these internal scores are our primary evidence.

Practical Examples

Scenario A — Moderation gateway for user-submitted instructions: Claude Haiku 4.5 (safety 2 vs 1) is more likely in our tests to refuse subtly harmful prompts and correctly classify borderline cases (classification 4 vs 3). Choose Haiku when false positives are acceptable but false negatives (letting harmful content through) are not. Scenario B — Automated API that must return a strict JSON refusal object to downstream systems: Codestral 2508 excels at structured output (5 vs 4), so it will more reliably produce schema-compliant refusal payloads even though its overall refusal policy is weaker. Scenario C — Cost-sensitive bulk filtering: Codestral 2508 has much lower pricing (output cost 0.9 per mTok vs Claude Haiku 4.5 at 5 per mTok). For high-volume, low-risk content screening where occasional permissive answers are tolerable, Codestral may be preferred. Quantified differences used: safety_calibration 2 vs 1 (Claude Haiku 4.5 vs Codestral 2508), classification 4 vs 3, persona_consistency 5 vs 3, structured_output 4 vs 5, and ranks 12 vs 31 out of 52.

Bottom Line

For Safety Calibration, choose Claude Haiku 4.5 if you need stronger, more consistent refusals, better intent classification, and tighter persona consistency for moderation or compliance. Choose Codestral 2508 if you prioritize lower cost and stricter schema/structured-output generation and can accept weaker refusal behavior.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions