Question 1

Both models score 4/5 on Writing — why pick GPT-5.4?

Accepted Answer

They tie on our Writing task score, but GPT-5.4 scores higher on structured_output (5 vs 4), safety_calibration (5 vs 4), and strategic_analysis (5 vs 4). Those advantages reduce errors when you need reliable JSON/templates, compliant copy, and nuanced edits, making GPT-5.4 the more predictable choice.

Question 2

When is R1 0528 the better option?

Accepted Answer

Pick R1 0528 when cost and tool-driven workflows matter: input $0.50 vs $2.50 and output $2.15 vs $15 per mTok, and R1 scores 5/5 on tool_calling and matches GPT-5.4 on persona_consistency and faithfulness. Avoid R1 for constrained rewriting or structured-output tasks unless you can handle its quirk of returning empty responses in those modes.

Question 3

Does R1 0528’s quirk affect real writing use cases?

Accepted Answer

Yes. In our tests R1 0528 can return empty responses when asked to produce structured_output or perform constrained_rewriting, and it uses reasoning tokens that consume output budget on short tasks. That can break headline compression, strict JSON templates, or short-format rewrites unless you change how you call the model.

Question 4

How do context windows and multimodality affect Writing?

Accepted Answer

GPT-5.4 offers a ~1,050,000-token context window and supports text+image+file inputs, which helps with long research-driven drafts and multimodal briefs. R1 0528 provides 163,840 tokens — still large — but lacks GPT-5.4’s multimodal input support in the payload.

R1 0528 vs GPT-5.4 for Writing

R1 0528

GPT-5.4

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions