Question 1

Which model should I pick for automated CI that consumes LLM JSON outputs?

Accepted Answer

Choose GPT-5.4. In our testing GPT-5.4 scores 5/5 on structured_output and it posts a 76.9% on SWE-bench Verified (Epoch AI). R1 0528 has a documented empty_on_structured_output quirk that produced empty responses in our structured-output tests.

Question 2

Is R1 0528 ever the better buy for coding workflows?

Accepted Answer

Yes—if cost is the primary constraint and your workflow tolerates non-strict outputs. R1 0528 is far cheaper (input $0.50/mTok, output $2.15/mTok) and scores 5/5 on tool_calling in our tests, making it suitable for interactive, tool-driven debugging or pair-programming without CI-style schema requirements.

Question 3

Why did GPT-5.4 win—was it just price or benchmarks?

Accepted Answer

GPT-5.4 won on the external SWE-bench Verified (Epoch AI) score of 76.9%, which is the primary signal for Coding in this payload. Our internal tests corroborate the result: GPT-5.4 scored 5/5 on structured_output and strong marks on safety and strategic analysis. R1 lacked a SWE-bench score and failed structured-output tests due to its empty_on_structured_output behavior.

Question 4

How do context windows compare for large-repo tasks?

Accepted Answer

GPT-5.4 has a much larger context_window (1,050,000 tokens) in the payload versus R1 0528’s 163,840 tokens. Larger context helps when reviewing many files or long histories; this aligns with GPT-5.4’s higher Coding benchmark performance.

Question 5

Can I use R1 0528 if I fix its structured-output issues with prompt engineering?

Accepted Answer

Possibly—but note the payload documents a persistent quirk (empty_on_structured_output) and a requirement for high completion token budgets (min_max_completion_tokens). In our testing that produced a taskScoreA of 0 for Coding tasks requiring structured output, so relying on prompt engineering alone is risky for automated pipelines.

R1 0528 vs GPT-5.4 for Coding

R1 0528

GPT-5.4

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions