Claude Sonnet 4.6 vs GPT-5 for Coding

Claude Sonnet 4.6 wins for Coding, but the margin is narrow. On SWE-bench Verified (Epoch AI) — the authoritative benchmark for real-world software engineering tasks — Sonnet 4.6 scores 75.2% versus GPT-5's 73.6%, a 1.6-point gap. That puts Sonnet 4.6 at rank 4 of 12 evaluated models and GPT-5 at rank 6. The gap is real but not dramatic; both models are strong, and both sit above the field median of 70.8% on SWE-bench Verified. What tips the balance more decisively is the supporting infrastructure: Sonnet 4.6 scores 5/5 on tool calling and 5/5 on agentic planning in our testing, compared to GPT-5's 3/5 on both. For coding workflows that involve agents, IDEs, or multi-step task execution, Sonnet 4.6's advantage compounds well beyond the headline SWE-bench gap. GPT-5 does outperform on math — scoring 98.1% on MATH Level 5 and 91.4% on AIME 2025 versus Sonnet 4.6's 85.8% on AIME 2025 (MATH Level 5 not evaluated for Sonnet 4.6 in our data). If your coding work is heavily algorithmic or mathematical, that's a meaningful counterpoint. Sonnet 4.6 is also meaningfully more expensive at $15/MTok output versus GPT-5's $10/MTok, so the coding edge comes at a 50% output cost premium.

anthropic

Claude Sonnet 4.6

Overall
4.67/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
75.2%
MATH Level 5
N/A
AIME 2025
85.8%

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window1000K

modelpicker.net

openai

GPT-5

Overall
4.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
73.6%
MATH Level 5
98.1%
AIME 2025
91.4%

Pricing

Input

$1.25/MTok

Output

$10.00/MTok

Context Window400K

modelpicker.net

Task Analysis

Coding demands several distinct capabilities from an LLM: generating correct, idiomatic code; debugging by tracing logic across files; reviewing code for security and style issues; and — increasingly — operating as an autonomous agent inside a codebase. The most direct measure of real-world software engineering performance is SWE-bench Verified (Epoch AI), which tests models on actual GitHub issues requiring code changes across real repositories. On that benchmark, Sonnet 4.6 scores 75.2% and GPT-5 scores 73.6% in our testing — both above the median of 70.8% across the 12 models evaluated.

Our internal proxy scores provide context for why Sonnet 4.6 pulls ahead. Tool calling — which underlies any agentic coding workflow, from running tests to calling linters to navigating file systems — scores 5/5 for Sonnet 4.6 versus 3/5 for GPT-5 in our testing, with Sonnet 4.6 tied for 1st among 53 models and GPT-5 ranked 19th. Agentic planning, which governs how a model decomposes a multi-file refactor or end-to-end feature build, follows the same pattern: 5/5 for Sonnet 4.6, 3/5 for GPT-5. Structured output — critical for code review pipelines, CI integrations, and JSON-based tool interfaces — scores 4/5 for Sonnet 4.6 versus 2/5 for GPT-5 (rank 45 of 53 for GPT-5 on that dimension).

GPT-5's strength is on the mathematical end of coding. Its 98.1% on MATH Level 5 (rank 1 of 14 evaluated) and 91.4% on AIME 2025 (rank 6 of 23) point to superior symbolic and numerical reasoning — relevant for algorithm design, competitive programming, and numerical methods work. GPT-5 also uses reasoning tokens, which can help on harder multi-step problems.

Sonnet 4.6 carries a 1M-token context window versus GPT-5's 400K, and scores 5/5 on long-context retrieval in our testing. For large codebase navigation — reading an entire monorepo or scanning across many files — that's a practical ceiling difference.

Practical Examples

Autonomous agent fixing a GitHub issue: This is exactly what SWE-bench Verified measures. Sonnet 4.6's 75.2% vs GPT-5's 73.6% means Sonnet 4.6 successfully resolves more real repository issues end-to-end. The tool calling gap (5/5 vs 3/5 in our testing) amplifies this: Sonnet 4.6 is more reliable at selecting the right functions, sequencing tool calls correctly, and recovering from failures — all critical when an agent needs to run tests, read files, and apply patches in sequence.

Structured code review pipeline: A CI/CD system that calls an LLM to output JSON-structured review comments (severity, file, line, recommendation) will hit GPT-5's structured output weakness more directly. GPT-5 scores 2/5 on structured output in our testing (rank 45 of 53), while Sonnet 4.6 scores 4/5. Malformed outputs from GPT-5 in this context mean broken pipelines or silent failures.

Large codebase refactor: Sonnet 4.6's 1M-token context window versus GPT-5's 400K means you can load more of a codebase into a single context. Sonnet 4.6 also scores 5/5 on long-context retrieval in our testing. For a large-scale migration or dependency audit, GPT-5 may require chunking that Sonnet 4.6 handles in one pass.

Algorithm design for a math-heavy problem: GPT-5's AIME 2025 score of 91.4% versus Sonnet 4.6's 85.8% represents a 5.6-point gap on hard competition math. For implementing numerical algorithms, cryptographic primitives, or performance-critical computational geometry, GPT-5's stronger mathematical reasoning is the relevant edge.

Cost-sensitive high-volume code generation: GPT-5 costs $10/MTok on output versus Sonnet 4.6's $15/MTok — a 50% premium for Sonnet 4.6. At scale, generating boilerplate, docstrings, or unit tests across a large codebase, GPT-5's lower price narrows the ROI case for Sonnet 4.6's modest SWE-bench lead.

Bottom Line

For Coding, choose Claude Sonnet 4.6 if you're building or using agentic coding tools, need reliable tool calling and multi-step task execution, work with large codebases that benefit from a 1M-token context window, or depend on structured output in code review pipelines — and the $15/MTok output cost is acceptable for the quality ceiling. Choose GPT-5 if your coding work is math-heavy (algorithms, numerical methods, competitive programming), you're operating at a cost scale where the $10/MTok output price matters, or the 1.6-point SWE-bench gap doesn't justify a 50% output cost increase for your specific use case.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

For coding tasks, we supplement our benchmark suite with SWE-bench scores from Epoch AI, an independent research organization.

Frequently Asked Questions