Claude Haiku 4.5 vs Codestral 2508 for Classification

Winner: Claude Haiku 4.5. In our testing Claude Haiku 4.5 outscored Codestral 2508 on the Classification task (4 vs 3) and ranks 1 of 52 vs Codestral's 31 of 52. That 1-point edge reflects stronger safety calibration (2 vs 1), persona consistency (5 vs 3), and agentic/context handling (long_context 5 vs 5; faithfulness 5 vs 5 tie) which matter for accurate routing and category decisions. Codestral 2508 wins on structured output (5 vs 4) and is materially cheaper (output cost per mTok 0.9 vs Haiku's 5), so it remains a strong choice when strict schema compliance or cost is the priority.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Codestral 2508

Overall
3.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
5/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.300/MTok

Output

$0.900/MTok

Context Window256K

modelpicker.net

Task Analysis

What Classification demands: fast, repeatable mapping of inputs to categories or routes with reliable, schema-compliant outputs and safe refusal behavior. Key capabilities: 1) classification accuracy (our classification test is the primary measure), 2) structured output (JSON/schema adherence), 3) faithfulness and context sensitivity (to avoid misrouting), 4) safety calibration (correctly refuse harmful labels), and 5) cost/latency tradeoffs for high-volume inference. Primary signal: our Classification score and task rank — Claude Haiku 4.5 scores 4 (taskRank 1/52) vs Codestral 2508 at 3 (taskRank 31/52). Supporting evidence from our proxies: Haiku leads on safety_calibration (2 vs 1), persona_consistency (5 vs 3), and agentic_planning (5 vs 4), all helpful when classification depends on context or multi-step interpretation. Codestral leads on structured_output (5 vs 4), making it advantageous when strict JSON/schema adherence is essential. Tool_calling, faithfulness, and long_context are ties (both models score 5) and will not differentiate basic routing workflows.

Practical Examples

  1. Customer-support routing: Haiku 4.5 (score 4) — preferable when intent is subtle, context matters across a long conversation, and safety/policy-aware refusals are required (safety_calibration 2 vs 1). 2) High-throughput labeler with strict JSON schema: Codestral 2508 — better structured output (5 vs 4) and much lower output cost (0.9 vs 5 per mTok), reducing inference bill while enforcing schema. 3) Moderation triage: Haiku 4.5 reduces risky misclassifications because persona_consistency and safety are stronger; both models tie on faithfulness (5) so neither is prone to inventing sources. 4) Embedded device or edge batch classification where price matters: Codestral is ~5.56x cheaper per output mTok (priceRatio 5.555...), making it the cost-efficient option if a 1-point drop in our classification score is acceptable. 5) Multi-turn classification that needs tool orchestration: both models score 5 on tool_calling, so either can select and sequence tools reliably; prefer Haiku if context-sensitive decisions matter, prefer Codestral if schema cost constraints dominate.

Bottom Line

For Classification, choose Claude Haiku 4.5 if you need higher accuracy in subtle, context-heavy routing, better safety calibration, and stronger persona/context handling (score 4 vs 3, rank 1 vs 31). Choose Codestral 2508 if you require top-tier structured output (5 vs 4) and much lower inference cost (output cost per mTok 0.9 vs 5), and you can accept a 1-point drop on our Classification test.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions