Which model scored higher on the primary external Math benchmark?

R1 0528 scored 96.6% on MATH Level 5 (Epoch AI). Claude Sonnet 4.6 has no MATH Level 5 score in our data, so R1 wins on the external benchmark.

Are there internal scores where Claude Sonnet 4.6 beats R1 0528 for math-like tasks?

Yes. In our internal proxies, Claude Sonnet 4.6 scores 85.8 on AIME 2025 vs R1's 66.4 and 75.2 on SWE-bench Verified (R1 lacks a SWE-bench entry). Sonnet also leads on strategic_analysis (5 vs 4) and creative_problem_solving (5 vs 4).

How do costs compare for math workloads?

R1 0528 is substantially cheaper: input $0.50/mtok and output $2.15/mtok. Claude Sonnet 4.6 costs $3.00/mtok input and $15.00/mtok output — roughly 6.98× more expensive by our priceRatio.

Which model is better for multimodal math (images of equations, scanned problems)?

Claude Sonnet 4.6 supports text+image->text modality and has a 1,000,000-token context window, making it better suited for long, multimodal math workflows. R1 0528 is text->text only.

Claude Sonnet 4.6 vs R1 0528 for Math

Q: Will R1's 'reasoning tokens' behavior affect math deployments?

Potentially. R1 is flagged as a reasoning_model that uses reasoning tokens and can return empty responses for structured_output unless configured with high max_completion_tokens. For short, strict-format math outputs, you should increase max completion tokens or test prompts first.

R1 0528 is the Math winner. On the external benchmark MATH Level 5 (Epoch AI), R1 0528 scores 96.6%, while Claude Sonnet 4.6 has no MATH Level 5 result in our dataset. Because the external MATH Level 5 score is the primary evidence for Math performance, R1 0528 is the definitive pick for raw Math problem-solving per Epoch AI. That said, Sonnet 4.6 posts higher internal scores on related proxies (AIME 2025: 85.8 for Sonnet vs 66.4 for R1, and SWE-bench Verified 75.2 for Sonnet while R1 lacks a SWE-bench entry), and Sonnet leads on strategic_analysis (5 vs 4) and creative_problem_solving (5 vs 4). Use R1 for highest MATH Level 5 performance; consider Sonnet for some contest-style or strategic reasoning workflows where our AIME and internal strategic scores favor Sonnet.

anthropic

Claude Sonnet 4.6

Overall

4.67/5Strong

Benchmark Scores

Faithfulness

5/5

Long Context

5/5

Multilingual

5/5

Tool Calling

5/5

Classification

4/5

Agentic Planning

5/5

Structured Output

4/5

Safety Calibration

5/5

Strategic Analysis

5/5

Persona Consistency

5/5

Constrained Rewriting

3/5

Creative Problem Solving

5/5

External Benchmarks

SWE-bench Verified

75.2%

MATH Level 5

N/A

AIME 2025

85.8%

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window1000K

modelpicker.net

deepseek

R1 0528

Overall

4.50/5Strong

Benchmark Scores

Faithfulness

5/5

Long Context

5/5

Multilingual

5/5

Tool Calling

5/5

Classification

4/5

Agentic Planning

5/5

Structured Output

4/5

Safety Calibration

4/5

Strategic Analysis

4/5

Persona Consistency

5/5

Constrained Rewriting

4/5

Creative Problem Solving

4/5

External Benchmarks

SWE-bench Verified

N/A

MATH Level 5

96.6%

AIME 2025

66.4%

Pricing

Input

$0.500/MTok

Output

$2.15/MTok

Context Window164K

modelpicker.net

Task Analysis

What Math demands: accurate multi-step symbolic reasoning, stable step-by-step derivations, ability to follow and produce structured solutions, long-context handling for multi-part proofs, and reliable faithfulness (no hallucinated steps). When an external benchmark exists, it is the primary measure: on MATH Level 5 (Epoch AI), R1 0528 scores 96.6% — this is the main signal we use to call the winner for Math. Supporting internal signals: both models score 5 on long_context and 5 on faithfulness, and both have 5/5 on tool_calling (useful for tool-assisted computation). Differences that explain nuance: Claude Sonnet 4.6 scores 85.8 on AIME 2025 (our internal proxy) and 75.2 on SWE-bench Verified (coding-related math), plus top marks in strategic_analysis (5) and creative_problem_solving (5), indicating strong multi-step tradeoff reasoning and inventive solutions. R1 0528 has a lower AIME 2025 internal score (66.4) but the decisive MATH Level 5 result (96.6%) and good internal marks in structured_output (4) and tool_calling (5). Also note R1's quirks: it is a reasoning_model that uses reasoning tokens and can return empty responses on structured_output unless configured with high max_completion_tokens — this can affect short, strictly formatted prompts. Claude Sonnet 4.6 offers a far larger context window (1,000,000 tokens) and explicit structured_outputs support, which matters for very long derivations or multimodal math workflows.

Practical Examples

High-stakes MATH Level 5 style problem sets (external benchmark alignment): R1 0528 — scores 96.6% on MATH Level 5 (Epoch AI), so expect the most consistent success on that specific benchmark. 2) AIME-like contest practice and step-by-step olympiad reasoning: Claude Sonnet 4.6 — AIME 2025 internal score 85.8 vs R1 66.4, so Sonnet is stronger on our AIME proxy and may produce better contest-style derivations or creative solution strategies. 3) Very long derivations, notebooks, or multimodal math (images→text): Claude Sonnet 4.6 — context_window 1,000,000 and modality text+image->text help for long proofs or image-based problems. 4) Cost-sensitive bulk grading or automated problem solvers: R1 0528 — input_cost_per_mtok $0.50 and output_cost_per_mtok $2.15 vs Claude Sonnet 4.6 at $3.00 input / $15.00 output per mtoken (Sonnet is ~7× more expensive by our priceRatio). 5) Strict JSON solutions or short structured outputs: both tie on structured_output (4/5), but R1 has a quirk where it may return empty responses on structured_output unless given a high max completion token budget — plan prompt/config accordingly.

Bottom Line

For Math, choose R1 0528 if you need top performance on MATH Level 5 problems (96.6% on MATH Level 5, Epoch AI) or are cost-sensitive. Choose Claude Sonnet 4.6 if you need higher AIME-style performance (AIME 2025: 85.8 in our tests), massive context windows or multimodal math workflows, or stronger strategic/creative reasoning despite higher cost.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

For math tasks, we supplement our benchmark suite with MATH/AIME scores from Epoch AI, an independent research organization.