Question 1

Which model scored higher on the primary Coding benchmark?

Accepted Answer

GPT-5.4 scored higher on SWE-bench Verified (Epoch AI): 76.9% vs Claude Sonnet 4.6's 75.2%.

Question 2

Should I pick Sonnet 4.6 because it beats GPT-5.4 on tool calling?

Accepted Answer

Pick Sonnet 4.6 when accurate function selection, argument formatting, and multi-step tool workflows matter — Sonnet scores 5 vs GPT-5.4's 4 on tool_calling in our testing. However, for single-shot correctness measured by SWE-bench, GPT-5.4 holds a small lead.

Question 3

Which is better at producing exact JSON or schema-compliant output?

Accepted Answer

GPT-5.4 is better in our structured_output metric (5 vs Claude Sonnet 4.6's 4), so it is preferable when exact JSON/schema compliance is a hard requirement.

Question 4

How do costs and context windows affect this decision?

Accepted Answer

Both models support 1M+ token windows (Claude Sonnet 4.6: 1,000,000; GPT-5.4: 1,050,000). Input cost per mTOK is lower for GPT-5.4 (2.5 vs Sonnet 3); output cost per mTOK is the same (15). For very large-context or high-volume input, GPT-5.4 can be cheaper on the input side.

Question 5

Are the internal proxy scores or the external benchmark the right signal?

Accepted Answer

We treat the external SWE-bench Verified (Epoch AI) score as the primary signal for Coding performance in this payload. Use our internal 1–5 proxies (tool_calling, structured_output, constrained_rewriting, etc.) to understand workflow tradeoffs and why a model may be stronger in particular engineering scenarios.

Claude Sonnet 4.6 vs GPT-5.4 for Coding

Claude Sonnet 4.6

GPT-5.4

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions