Question 1

Does an external MATH Level 5 score decide the winner?

Accepted Answer

The payload includes the external benchmark math_level_5 (MATH Level 5, Epoch AI) but both models have null scores there. Because no external scores are available for either model, we used our internal tests (strategic_analysis and structured_output) to decide the winner.

Question 2

Which exact subtests determined Haiku 4.5's win?

Accepted Answer

Haiku 4.5 leads on strategic_analysis (5 vs 2) and creative_problem_solving (4 vs 2) in our testing; it also scores higher on agentic_planning (5 vs 4) and persona_consistency (5 vs 3). Those strategic and creative strengths are decisive for mathematical reasoning.

Question 3

When is Codestral 2508 the better choice despite losing?

Accepted Answer

Pick Codestral 2508 when strict structured outputs or cost matter: it scores 5 vs Haiku's 4 on structured_output and costs $0.30 per M-token input and $0.90 per M-token output versus Haiku's $1.00/$5.00 — useful for high-volume or format-sensitive math tasks.

Question 4

Are there ties or equal strengths relevant to Math?

Accepted Answer

Yes. Both models tie on tool_calling (5), faithfulness (5), and long_context (5) in our tests, so both handle function selection, stay faithful to sources, and manage long derivations well.

Question 5

How should developers balance cost vs accuracy from these results?

Accepted Answer

If accuracy in strategy and creative problem solving is critical, prioritize Haiku 4.5 despite higher output cost ($5.00/M-token). If throughput, strict format, and lower per-token cost matter more, Codestral 2508 (input $0.30/M-token, output $0.90/M-token) may be preferable.

Claude Haiku 4.5 vs Codestral 2508 for Math

Claude Haiku 4.5

Codestral 2508

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions