Claude Sonnet 4.6 vs GPT-5.4 for Coding
Winner: GPT-5.4. On the primary external measure for Coding (SWE-bench Verified, Epoch AI) GPT-5.4 scores 76.9% vs Claude Sonnet 4.6's 75.2% — a 1.7-point lead. That margin is small, so the race is close: in our internal tests Sonnet 4.6 outperforms GPT-5.4 on tool calling (5 vs 4) and creative problem solving (5 vs 4), while GPT-5.4 leads on structured output (5 vs 4) and constrained rewriting (4 vs 3). Use the external SWE-bench result as the primary signal for correctness; use the internal scores to pick for workflow tradeoffs.
anthropic
Claude Sonnet 4.6
Benchmark Scores
External Benchmarks
Pricing
Input
$3.00/MTok
Output
$15.00/MTok
modelpicker.net
openai
GPT-5.4
Benchmark Scores
External Benchmarks
Pricing
Input
$2.50/MTok
Output
$15.00/MTok
modelpicker.net
Task Analysis
What Coding demands: code correctness, exact schema/format output (JSON or test harnesses), reliable tool calling (tests, linters, repo actions), long-context reasoning across multi-file codebases, faithfulness to source code, and the ability to compress or rewrite code under constraints. Primary benchmark evidence: on SWE-bench Verified (Epoch AI) — the authoritative external test in our payload — GPT-5.4 scores 76.9% vs Claude Sonnet 4.6's 75.2%. That external gap (1.7 points) is small but decisive here. Supporting internal signals (our 1–5 proxies): structured_output (JSON/schema) is GPT-5.4 5 vs Sonnet 4; tool_calling (function selection, argument accuracy, sequencing) is Sonnet 5 vs GPT-5.4 4; constrained_rewriting is GPT-5.4 4 vs Sonnet 3; long_context and faithfulness tie at 5 for both. Also note context windows and costs in the payload: Sonnet 4.6 context_window 1,000,000 and input cost 3 per mTOK; GPT-5.4 context_window 1,050,000 and input cost 2.5 per mTOK (both have output cost 15 per mTOK). These concrete scores explain why GPT-5.4 narrowly wins overall while Sonnet is preferable for some engineering workflows.
Practical Examples
- Single-function correctness and test-passing code generation: GPT-5.4 wins on SWE-bench (76.9% vs 75.2%) and has the higher structured_output score (5 vs 4) — better for exact JSON responses, unit-test-ready snippets, and API payloads. 2) Multi-file refactor with automated tool steps (run tests, apply patch, open PR): Claude Sonnet 4.6 shines in our testing at tool_calling (5 vs GPT-5.4's 4) and agentic planning (both 5), making Sonnet stronger at accurate function selection and argument sequencing for tool-driven workflows. 3) Tight-character/minified solutions or hard compression tasks: GPT-5.4 leads on constrained_rewriting (4 vs 3), so it handles compact rewrites and concise submissions better. 4) Very long-context codebases and large patch generation: both models score 5 on long_context and offer 1M+ token windows (Sonnet 1,000,000; GPT-5.4 1,050,000), so either is suitable for large-repo tasks. 5) Cost-sensitive batch input (many files sent as context): GPT-5.4 has a lower input cost (2.5 vs Sonnet's 3 per mTOK) while output costs match (15 per mTOK), which impacts large-context workflows.
Bottom Line
For Coding, choose Claude Sonnet 4.6 if you prioritize tool calling, iterative multi-step dev workflows, and integration with agentic sequences (Sonnet: tool_calling 5 in our tests). Choose GPT-5.4 if you want the slightly higher external correctness on SWE-bench Verified (76.9% vs 75.2%, Epoch AI), stronger structured-output/schema fidelity (5 vs 4), better constrained rewriting (4 vs 3), and marginally lower input cost (2.5 vs 3 per mTOK).
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.
For coding tasks, we supplement our benchmark suite with SWE-bench scores from Epoch AI, an independent research organization.