Question 1

Which model should I pick for contest math problems (AIME/MATH-level)?

Accepted Answer

R1 0528. It scores 96.6% on MATH Level 5 (Epoch AI) in our data, making it the clear choice for competition-style mathematical reasoning.

Question 2

Does Claude Haiku 4.5 have a MATH Level 5 score in your data?

Accepted Answer

No. The externalBenchmark entry shows scoreB=96.6 for R1 0528 and scoreA=null for Claude Haiku 4.5, so Claude Haiku 4.5 has no external MATH Level 5 score in our payload.

Question 3

How do internal proxy scores compare for math-relevant capabilities?

Accepted Answer

Both models score 5/5 on tool_calling, long_context, and faithfulness in our internal tests. Differences: Claude Haiku 4.5 scores 5/5 on strategic_analysis vs R1’s 4/5; R1 has constrained_rewriting 4/5 and safety_calibration 4/5 while Claude’s safety_calibration is 2/5.

Question 4

Are there quirks I should know about when using R1 0528 for structured math outputs?

Accepted Answer

Yes. R1 0528’s quirks note that it can return empty responses on structured_output and constrained_rewriting tasks and that reasoning tokens consume output budget on short tasks. Plan for higher max completion tokens and test structured-output flows.

Question 5

Which model is more cost-efficient per token?

Accepted Answer

According to the payload, Claude Haiku 4.5 input/output costs are 1/mtok and 5/mtok respectively; R1 0528 input/output costs are 0.5/mtok and 2.15/mtok. R1 0528 is materially cheaper per-token on both input and output in this data.

Claude Haiku 4.5 vs R1 0528 for Math

Claude Haiku 4.5

R1 0528

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions