Question 1

Do either model have MATH Level 5 scores from Epoch AI?

Accepted Answer

No. The external benchmark field (MATH Level 5, Epoch AI) is present in the payload but both scoreA and scoreB are null. Our winner call uses internal benchmark metrics because the external scores are not available for either model.

Question 2

How large is the performance gap on core math reasoning?

Accepted Answer

On our internal tests Claude Haiku 4.5 outperforms Devstral Small 1.1 by 3 points on strategic_analysis (5/5 vs 2/5) and by 1 point on faithfulness and tool_calling (5/5 vs 4/5). Structured_output is tied at 4/5.

Question 3

What about cost — how much cheaper is Devstral?

Accepted Answer

In the dataset Devstral Small 1.1 lists input/output costs per M-token of 0.1 / 0.3 versus Claude Haiku 4.5's 1 / 5, which corresponds to a price ratio of 16.67× in favor of Devstral on those per-M-token values.

Question 4

Which model is better for exam-style, high-stakes math problems?

Accepted Answer

Claude Haiku 4.5. Its 5/5 strategic_analysis, 5/5 faithfulness and 5/5 long_context make it the safer choice for complex, multi-step or high-stakes problems in our testing.

Question 5

When should I pick Devstral Small 1.1 instead?

Accepted Answer

Pick Devstral when you need low-cost, high-volume processing for routine math tasks (works well with structured outputs and reasonable faithfulness) or when compute/budget constraints dominate.

Claude Haiku 4.5 vs Devstral Small 1.1 for Math

Claude Haiku 4.5

Devstral Small 1.1

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions