Claude Sonnet 4.6 vs GPT-5.4 for Math
Winner: GPT-5.4. The external contest benchmark evidence we have (AIME 2025, Epoch AI) favors GPT-5.4 by 95.3% vs Claude Sonnet 4.6's 85.8% (a 9.5-point margin). GPT-5.4 also scores higher on structured-output in our tests (5 vs 4), which matters for exact formatting of solutions. Claude Sonnet 4.6 is strong in tool calling (5 vs 4) and creative problem solving (5 vs 4) but loses on the primary contest-style metrics we have available, so GPT-5.4 is the definitive pick for high-stakes math problem solving in our benchmarks.
anthropic
Claude Sonnet 4.6
Benchmark Scores
External Benchmarks
Pricing
Input
$3.00/MTok
Output
$15.00/MTok
modelpicker.net
openai
GPT-5.4
Benchmark Scores
External Benchmarks
Pricing
Input
$2.50/MTok
Output
$15.00/MTok
modelpicker.net
Task Analysis
What Math demands: precise multi-step reasoning, exact numeric and symbolic manipulation, reliable step-by-step justification, and strict format/structure for final answers (e.g., short-form contest answers, LaTeX, or JSON). External contest benchmarks are the primary signal for math performance when available. Here, the math_level_5 external benchmark is present in the dataset but neither model has a math_level_5 score recorded, so we rely on other external measures (AIME 2025 and SWE-bench Verified, Epoch AI) plus our internal proxies. On those measures GPT-5.4 scores 95.3% on AIME 2025 (Epoch AI) vs Claude Sonnet 4.6's 85.8% and 76.9% vs 75.2% on SWE-bench Verified (Epoch AI). Internally, task-relevant proxies also matter: structured_output (JSON/format adherence) is 5 for GPT-5.4 vs 4 for Sonnet; strategic_analysis (nuanced numeric tradeoffs) is 5 for both. Tool calling (helpful for external computation or calculators) is 5 for Sonnet vs 4 for GPT-5.4. Use these numbers together: GPT-5.4 leads on contest and formatting metrics; Sonnet 4.6 can be preferable when external tool orchestration or exploratory idea generation is central.
Practical Examples
AIME-style contest problems: GPT-5.4 — 95.3% on AIME 2025 (Epoch AI) vs Sonnet 85.8% — will more reliably produce correct final answers and compact contest-format responses. Multi-step proofs requiring precise output formatting: GPT-5.4 (structured_output 5 vs Sonnet 4) is better at adhering to required output schemas or concise numeric answers. Large symbolic derivations with external computation: Claude Sonnet 4.6 is advantageous when you need tool workflows (tool_calling 5 vs 4) — Sonnet better selects and sequences functions/arguments in our tests. Exploratory problem solving and brainstorming multiple solution paths: Sonnet 4.6 (creative_problem_solving 5 vs 4) generates more varied approaches. Batch grading or programmatic answer extraction: GPT-5.4’s stronger structured_output and higher AIME score reduce post-processing fixes. For mixed-modality tasks (images or diagrams), both models support images to text; GPT-5.4 also accepts files, which may help when problems are provided as PDFs.
Bottom Line
For Math, choose Claude Sonnet 4.6 if you need superior tool orchestration and exploratory solution generation (tool_calling 5, creative_problem_solving 5). Choose GPT-5.4 if you need contest-level accuracy and strict output formatting — GPT-5.4 scores 95.3% on AIME 2025 (Epoch AI) vs Sonnet 85.8% and has better structured-output (5 vs 4) in our tests.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.
For math tasks, we supplement our benchmark suite with MATH/AIME scores from Epoch AI, an independent research organization.