Question 1

Is there a definitive external math_level_5 score for either model in the payload?

Accepted Answer

No. The payload includes a top-level externalBenchmark entry for math_level_5 but scoreA and scoreB are null. We therefore rely on other available evidence in the payload (SWE-bench Verified and AIME 2025 for Opus, plus our internal 1–5 scores) to form the verdict.

Question 2

How much more does Opus 4.6 cost compared to Haiku 4.5 for token usage?

Accepted Answer

Per the payload, Claude Haiku 4.5 input/output cost per mTok is 1/5 while Claude Opus 4.6 is 5/25. That makes Haiku roughly 20% of Opus's per-mTok price on both input and output (priceRatio in the payload is 0.2).

Question 3

Which internal capabilities most influence the Math verdict?

Accepted Answer

We prioritized: external math bench signals present in the payload (SWE-bench Verified 78.7% and AIME 94.4% for Opus), creative_problem_solving (Opus 5 vs Haiku 4), safety_calibration (Opus 5 vs Haiku 2), and long_context/structured_output/tool_calling/faithfulness (tied or equal in our tests). Those together favor Opus for demanding math tasks.

Question 4

Does Haiku 4.5 match Opus on any math-relevant dimensions?

Accepted Answer

Yes. In our testing Haiku 4.5 ties Opus 4.6 at 5/5 for strategic_analysis, tool_calling, faithfulness, long_context, persona_consistency, and agentic_planning — meaning Haiku handles many core math workflows well despite being much cheaper.

Claude Haiku 4.5 vs Claude Opus 4.6 for Math

Claude Haiku 4.5

Claude Opus 4.6

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions