Claude Haiku 4.5 vs Codestral 2508 for Classification
Winner: Claude Haiku 4.5. In our testing Claude Haiku 4.5 outscored Codestral 2508 on the Classification task (4 vs 3) and ranks 1 of 52 vs Codestral's 31 of 52. That 1-point edge reflects stronger safety calibration (2 vs 1), persona consistency (5 vs 3), and agentic/context handling (long_context 5 vs 5; faithfulness 5 vs 5 tie) which matter for accurate routing and category decisions. Codestral 2508 wins on structured output (5 vs 4) and is materially cheaper (output cost per mTok 0.9 vs Haiku's 5), so it remains a strong choice when strict schema compliance or cost is the priority.
anthropic
Claude Haiku 4.5
Benchmark Scores
External Benchmarks
Pricing
Input
$1.00/MTok
Output
$5.00/MTok
modelpicker.net
mistral
Codestral 2508
Benchmark Scores
External Benchmarks
Pricing
Input
$0.300/MTok
Output
$0.900/MTok
modelpicker.net
Task Analysis
What Classification demands: fast, repeatable mapping of inputs to categories or routes with reliable, schema-compliant outputs and safe refusal behavior. Key capabilities: 1) classification accuracy (our classification test is the primary measure), 2) structured output (JSON/schema adherence), 3) faithfulness and context sensitivity (to avoid misrouting), 4) safety calibration (correctly refuse harmful labels), and 5) cost/latency tradeoffs for high-volume inference. Primary signal: our Classification score and task rank — Claude Haiku 4.5 scores 4 (taskRank 1/52) vs Codestral 2508 at 3 (taskRank 31/52). Supporting evidence from our proxies: Haiku leads on safety_calibration (2 vs 1), persona_consistency (5 vs 3), and agentic_planning (5 vs 4), all helpful when classification depends on context or multi-step interpretation. Codestral leads on structured_output (5 vs 4), making it advantageous when strict JSON/schema adherence is essential. Tool_calling, faithfulness, and long_context are ties (both models score 5) and will not differentiate basic routing workflows.
Practical Examples
- Customer-support routing: Haiku 4.5 (score 4) — preferable when intent is subtle, context matters across a long conversation, and safety/policy-aware refusals are required (safety_calibration 2 vs 1). 2) High-throughput labeler with strict JSON schema: Codestral 2508 — better structured output (5 vs 4) and much lower output cost (0.9 vs 5 per mTok), reducing inference bill while enforcing schema. 3) Moderation triage: Haiku 4.5 reduces risky misclassifications because persona_consistency and safety are stronger; both models tie on faithfulness (5) so neither is prone to inventing sources. 4) Embedded device or edge batch classification where price matters: Codestral is ~5.56x cheaper per output mTok (priceRatio 5.555...), making it the cost-efficient option if a 1-point drop in our classification score is acceptable. 5) Multi-turn classification that needs tool orchestration: both models score 5 on tool_calling, so either can select and sequence tools reliably; prefer Haiku if context-sensitive decisions matter, prefer Codestral if schema cost constraints dominate.
Bottom Line
For Classification, choose Claude Haiku 4.5 if you need higher accuracy in subtle, context-heavy routing, better safety calibration, and stronger persona/context handling (score 4 vs 3, rank 1 vs 31). Choose Codestral 2508 if you require top-tier structured output (5 vs 4) and much lower inference cost (output cost per mTok 0.9 vs 5), and you can accept a 1-point drop on our Classification test.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.