Question 1

They both score 4/5 on Creative Problem Solving—why is GPT-5.4 the winner?

Accepted Answer

Although both models scored 4/5 and share rank 9/52, GPT-5.4’s higher structured_output (5 vs 4), strategic_analysis (5 vs 4) and safety_calibration (5 vs 4) make it more reliable for producing precise, tradeoff-aware, and policy-safe creative solutions. R1 0528 ties on the core task but has practical quirks (empty responses on structured_output/agentic_planning) that reduce reliability in formatted or short-turn workflows.

Question 2

How should I weigh cost vs quality between these two?

Accepted Answer

R1 0528 is substantially cheaper (input $0.50/mTok, output $2.15/mTok) versus GPT-5.4 (input $2.50/mTok, output $15.00/mTok). Choose R1 for high-volume, tool-heavy brainstorming when you can tolerate its structured_output quirks; choose GPT-5.4 when fewer, higher-quality, format-sensitive outputs are more important than per-run cost.

Question 3

Do external benchmarks change the recommendation?

Accepted Answer

External scores in the payload add nuance but don’t overturn the internal strengths: R1 scores 96.6% on MATH Level 5 (Epoch AI) and 66.4% on AIME 2025 (Epoch AI); GPT-5.4 scores 76.9% on SWE-bench Verified (Epoch AI) and 95.3% on AIME 2025 (Epoch AI). These suggest domain differences (R1 strong on MATH Level 5, GPT-5.4 strong on AIME and SWE-bench) — use the external results as a supplement to the internal structured_output and strategic_analysis advantages when choosing.

Question 4

How do R1’s quirks affect real workflows?

Accepted Answer

R1’s quirks in the payload note that it can return empty responses on structured_output, constrained_rewriting, and agentic_planning, that it uses reasoning tokens (consuming output budget on short tasks), and that it requires high minimum completion tokens (min_max_completion_tokens: 1000). In practice, that means workflows that expect immediate JSON or short-format step plans may fail unless you increase completion token limits and avoid those structured modes.

R1 0528 vs GPT-5.4 for Creative Problem Solving

R1 0528

GPT-5.4

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions