Claude Haiku 4.5 vs Devstral Medium for Classification

Winner: Claude Haiku 4.5. In our tests both models score 4/5 on Classification (accurate categorization and routing), but Claude Haiku 4.5 wins because it delivers stronger supporting capabilities that matter for robust classification pipelines: tool_calling 5 vs 3, faithfulness 5 vs 4, long_context 5 vs 4 and multilingual 5 vs 4. Those gaps make Haiku a better choice when you need reliable routing, downstream tool integration, or large-context classification. Devstral Medium ties on raw classification accuracy but is the lower-cost option (input cost 0.4 vs 1 per mTok; output cost 2 vs 5 per mTok) and is appropriate when budget and latency are the primary constraints.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Task Analysis

What Classification demands: accurate, repeatable category labels or routing decisions; reliable adherence to structured output (JSON/schema); low hallucination (faithfulness); tool selection/argument accuracy when driving downstream systems; and sometimes long-context and multilingual support for large or international datasets. Our primary signal for this task is the internal classification score: both Claude Haiku 4.5 and Devstral Medium score 4/5 on the classification test in our 12-test suite. Because they tie on raw classification, we examine supporting metrics to break the tie. Claude Haiku 4.5 shows higher tool_calling (5 vs 3), faithfulness (5 vs 4), long_context (5 vs 4), and multilingual (5 vs 4), plus stronger persona_consistency and agentic_planning—attributes that reduce routing errors, improve schema adherence across languages, and enable safe, reliable integrations. Devstral Medium matches structured_output (4 vs 4) but lags on tool integration and safety calibration (1 vs 2). Cost and context window also matter: Claude Haiku 4.5 offers a 200,000-token context window versus Devstral Medium's 131,072, while Devstral is cheaper (input 0.4 vs 1 per mTok; output 2 vs 5 per mTok).

Practical Examples

When Claude Haiku 4.5 shines for Classification:

  • Multilingual support and large context: Classifying and routing support tickets across languages with a 200k token history (Haiku multilingual 5 vs 4, long_context 5 vs 4).
  • Tool-driven routing: A pipeline that needs accurate function selection and argument generation (Haiku tool_calling 5 vs 3) to call downstream microservices for complaint triage.
  • High-trust classification: Legal or medical triage where faithfulness reduces hallucinated labels (Haiku faithfulness 5 vs 4). When Devstral Medium shines for Classification:
  • Bulk, cost-sensitive batch labeling: High-throughput classification where per-token cost matters (Devstral input 0.4 vs 1 and output 2 vs 5 per mTok), and you can accept simpler tool workflows.
  • Lightweight routing in single-language systems where long-context and advanced tool integration are less critical; Devstral still scores 4/5 on classification and 4/5 on structured_output, so it delivers reliable labels at lower cost. Concrete numbers used: both models score 4/5 on our classification test; Haiku leads on tool_calling (5 vs 3), faithfulness (5 vs 4), long_context (5 vs 4), and multilingual (5 vs 4).

Bottom Line

For Classification, choose Claude Haiku 4.5 if you need robust routing with tool integration, high faithfulness, large-context or multilingual classification (Haiku tool_calling 5 vs 3, faithfulness 5 vs 4, long_context 5 vs 4). Choose Devstral Medium if you need equal raw classification accuracy at a lower cost and can accept weaker tool-calling and slightly lower faithfulness (Devstral input cost 0.4 vs 1 per mTok; output cost 2 vs 5 per mTok).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions