Question 1

Why did R1 0528 win if GPT-5.4 has better structured output?

Accepted Answer

R1 0528 scored 4 vs GPT-5.4’s 3 on our Classification test and ranks 1/52 vs 31/52. R1’s higher tool_calling (5), faithfulness (5) and multilingual/long-context capabilities drive that win. However, GPT-5.4’s structured_output 5 makes it better when schema guarantees are mandatory.

Question 2

How should I handle R1’s empty structured outputs?

Accepted Answer

R1’s quirks note that it can return empty responses on structured_output and requires high max completion tokens. If you need reliable JSON from R1, avoid relying on its structured_output path or ensure your pipeline tolerates empty responses (e.g., fall back to free-form parsing or re-run with adjusted completion token settings).

Question 3

What are the cost differences I should budget for?

Accepted Answer

Per the payload, R1 0528 costs $0.50 per mTok input and $2.15 per mTok output. GPT-5.4 costs $2.50 per mTok input and $15.00 per mTok output. For high-volume classification, R1 is substantially cheaper.

Question 4

Which model is safer for routing hostile or sensitive content?

Accepted Answer

GPT-5.4 has safety_calibration 5 vs R1’s 4, so GPT-5.4 is stronger on our safety calibration metric and may be a better default where conservative refusal behavior is required.

Question 5

Does context window matter for Classification?

Accepted Answer

Yes. GPT-5.4’s context_window is 1,050,000 tokens vs R1 0528’s 163,840 tokens. For classification tasks that must consider extremely long documents or whole-corpus context, GPT-5.4’s larger window is an advantage.

R1 0528 vs GPT-5.4 for Classification

R1 0528

GPT-5.4

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions