Claude Haiku 4.5 vs Devstral Small 1.1 for Math

Winner: Claude Haiku 4.5. In our testing Claude Haiku 4.5 is clearly better for Math: strategic_analysis 5/5 vs Devstral Small 1.1's 2/5, plus 5/5 faithfulness, 5/5 tool_calling and 5/5 long_context. Both models tie on structured_output (4/5). There are no public MATH Level 5 (Epoch AI) scores for either model in the payload, so this verdict is based on our internal benchmarks and ranking data.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

Task Analysis

Math demands reliable multi-step reasoning, faithful algebraic manipulation, long-context handling for proofs, precise formatted outputs (LaTeX/JSON), and the ability to call external calculators or tools when needed. The external MATH Level 5 benchmark (Epoch AI) is the authoritative test for this task but has no scores for either model in the dataset, so we rely on our internal task-relevant metrics. On those metrics Claude Haiku 4.5 scores 5/5 in strategic_analysis (tied for 1st of 54) and 5/5 faithfulness (tied for 1st of 55), indicating better multi-step analytic reasoning and fewer hallucinations. Claude also scores 5/5 on tool_calling and long_context, which supports calculator use and long derivations. Devstral Small 1.1 scores 2/5 on strategic_analysis and 4/5 on structured_output, faithfulness, tool_calling and long_context — sufficient for many routine math tasks but weaker on high-stakes, multi-step reasoning and planning (agentic_planning 2/5). Structured output is equal (4/5) for both, so formatting and schema adherence are comparable.

Practical Examples

Where Claude Haiku 4.5 shines (based on our scores):

  • Olympiad-style multi-step problems and proofs: strategic_analysis 5/5 and long_context 5/5 let Claude hold long derivations and reason about tradeoffs across steps. (Claude strategic_analysis = 5/5, tied for 1st of 54.)
  • Tool-assisted numeric verification: tool_calling 5/5 improves calculator selection and argument accuracy when workflows call external math tools. (Claude tool_calling = 5/5.)
  • High-fidelity explanations for teaching or publication: faithfulness 5/5 reduces hallucinated steps. Where Devstral Small 1.1 shines:
  • Low-cost batch grading, exercise generation, or quick algebraic steps where deep strategic planning is not required — structured_output 4/5 and faithfulness 4/5 give reliable formatting at far lower cost. (Devstral structured_output = 4/5, faithfulness = 4/5.)
  • Embedded or on-device pipelines with tight compute/budget constraints: Devstral input/output cost-per-M-token are 0.1/0.3 vs Claude Haiku 4.5's 1/5, a price ratio of 16.67× in this dataset, making Devstral much cheaper for high-volume runs. Quantified differences to guide choice: Claude beats Devstral by 3 points on strategic_analysis (5 vs 2), by 1 point on faithfulness (5 vs 4) and tool_calling (5 vs 4). Structured_output is tied at 4/5.

Bottom Line

For Math, choose Claude Haiku 4.5 if you need reliable multi-step reasoning, long derivations, high faithfulness, or tool-assisted numeric verification (strategic_analysis 5/5, faithfulness 5/5, tool_calling 5/5). Choose Devstral Small 1.1 if you prioritize cost and throughput for routine algebra or exercise generation (structured_output 4/5, faithfulness 4/5) — it is ~16.67× cheaper per the cost-per-M-token values in the dataset.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

For math tasks, we supplement our benchmark suite with MATH/AIME scores from Epoch AI, an independent research organization.

Frequently Asked Questions