Claude Haiku 4.5 vs Devstral Medium for Math
Winner: Claude Haiku 4.5. In our testing Claude Haiku 4.5 scored 5/5 on strategic_analysis vs Devstral Medium's 2/5, while both tie at 4/5 on structured_output. The decisive gap on strategic_analysis — the core skill for multi-step mathematical reasoning — makes Haiku the better Math model here. Note: an external MATH Level 5 benchmark is present in the dataset (Epoch AI) but neither model has a reported score, so the winner call is based on our internal tests.
anthropic
Claude Haiku 4.5
Benchmark Scores
External Benchmarks
Pricing
Input
$1.00/MTok
Output
$5.00/MTok
modelpicker.net
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
Task Analysis
What Math demands: precise stepwise reasoning, reliable numeric fidelity, and the ability to present answers in strict formats (e.g., exact expressions or JSON). For this task we treat strategic_analysis as the primary internal proxy (nuanced tradeoff reasoning with real numbers) and structured_output as a critical secondary capability (JSON/schema compliance). External resource: the dataset includes MATH Level 5 (Epoch AI) as the authoritative external benchmark, but scoreA and scoreB are null — we therefore lead with our internal results. Key supporting capabilities in our tests: faithfulness (avoids hallucinated numeric steps), tool_calling (selecting/using calculators or numeric tools), long_context (holds multi-step derivations across many tokens), agentic_planning (decomposes problems), and creative_problem_solving (non-obvious strategies). Claude Haiku 4.5: strategic_analysis 5, structured_output 4, faithfulness 5, tool_calling 5, long_context 5, agentic_planning 5, creative_problem_solving 4. Devstral Medium: strategic_analysis 2, structured_output 4, faithfulness 4, tool_calling 3, long_context 4, agentic_planning 4, creative_problem_solving 2. Those internal scores explain why Haiku handles complex, multi-step math more reliably; structured_output parity means both can meet format requirements equally well.
Practical Examples
Where Claude Haiku 4.5 shines (based on scores): - Multi-step contest problems requiring tradeoffs or case analysis (strategic_analysis 5 vs 2); Haiku is more likely to select correct decomposition strategies. - Long derivations or proofs needing token continuity (long_context 5 vs 4) and faithful intermediate calculations (faithfulness 5 vs 4). - Workflows that call external numeric tools or sequence functions (tool_calling 5 vs 3). Where Devstral Medium is appropriate (based on scores and cost): - Short to medium complexity math tasks with strict formatting needs — both models tie on structured_output (4/5), so Devstral can produce compliant JSON or answer templates. - Cost-sensitive batch jobs or inference where deep strategic reasoning is not required: Devstral is cheaper (input_cost_per_mtok: 0.4 vs 1; output_cost_per_mtok: 2 vs 5). Concrete grounded differences: strategic_analysis 5 vs 2 is a 3-point internal gap favoring Haiku for reasoning-heavy math; tool_calling 5 vs 3 indicates Haiku is materially better at orchestrating calculator/tool steps; structured_output 4 vs 4 shows both meet format constraints equally.
Bottom Line
For Math, choose Claude Haiku 4.5 if you need deep, multi-step mathematical reasoning, reliable symbolic/numeric fidelity, tool orchestration, or long derivations (strategic_analysis 5 vs 2, tool_calling 5 vs 3, faithfulness 5 vs 4). Choose Devstral Medium if you need cheaper inference (input 0.4 vs 1, output 2 vs 5 per mTok), your problems are shorter or format-driven (structured_output tie 4/5), and you can tolerate weaker strategic reasoning.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.
For math tasks, we supplement our benchmark suite with MATH/AIME scores from Epoch AI, an independent research organization.