Question 1

Which model is more accurate on math reasoning?

Accepted Answer

In our tests both models tie on strategic_analysis (5 vs 5). Claude Haiku 4.5 is more accurate in practice for correctness-sensitive math because it scores higher on faithfulness (5 vs 3) and tool_calling (5 vs 3), which reduce hallucinations and help with correct calculator/tool usage.

Question 2

Which model formats answers more reliably for autograders?

Accepted Answer

DeepSeek V3.1 Terminus scores 5 on structured_output versus Claude Haiku 4. That makes DeepSeek the safer choice when strict JSON/format compliance is the single priority.

Question 3

Are there external MATH Level 5 benchmark scores for these models?

Accepted Answer

The payload includes an external MATH Level 5 field (Epoch AI), but both scoreA and scoreB are null. No external math_level_5 values are available for either model, so our comparison relies on internal task tests.

Question 4

Does modality or context window matter for math tasks here?

Accepted Answer

Yes. Claude Haiku 4.5 supports text+image->text and has a larger context window (200,000 tokens vs 163,840), helpful for diagram parsing and very long problem sets. DeepSeek is text->text only.

Question 5

How do costs compare for large-scale grading or evaluation?

Accepted Answer

Claude Haiku 4.5 output cost per mTok is $5.00; DeepSeek V3.1 Terminus output cost per mTok is $0.79. DeepSeek is substantially cheaper per output token (payload priceRatio ~6.33 in favor of DeepSeek), which matters for high-throughput workloads.

Claude Haiku 4.5 vs DeepSeek V3.1 Terminus for Math

Claude Haiku 4.5

DeepSeek V3.1 Terminus

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions