Question 1

Does an external MATH Level 5 score decide the winner?

Accepted Answer

No. The dataset includes MATH Level 5 (Epoch AI) as the external benchmark, but both models have null scores there. Our verdict is therefore based on internal benchmarks (strategic_analysis and structured_output) in our testing.

Question 2

How large is the reasoning gap between the models?

Accepted Answer

In our testing strategic_analysis is 5/5 for Claude Haiku 4.5 vs 2/5 for Devstral Medium — a 3-point gap on the 1–5 scale. That gap is the primary reason Haiku wins for Math.

Question 3

Which model is more cost-effective for bulk math tasks?

Accepted Answer

Devstral Medium is materially cheaper: input_cost_per_mtok 0.4 vs Claude Haiku 4.5's 1, and output_cost_per_mtok 2 vs 5. If tasks are short, format-only, or not reasoning-heavy, Devstral gives better cost efficiency.

Question 4

Are both models equally good at producing machine-readable answers?

Accepted Answer

Yes. Both models score 4/5 on structured_output in our testing, so they perform similarly on JSON/schema compliance and format adherence.

Question 5

What other capabilities should I consider for math pipelines?

Accepted Answer

Beyond raw reasoning, check faithfulness (to avoid incorrect steps), tool_calling (for calculator integration), long_context (for long derivations), and safety_calibration if you run user-submitted tasks. In our tests Haiku leads on faithfulness (5 vs 4), tool_calling (5 vs 3), and long_context (5 vs 4).

Claude Haiku 4.5 vs Devstral Medium for Math

Claude Haiku 4.5

Devstral Medium

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions