Question 1

Which model is better at producing machine-readable refusals?

Accepted Answer

GPT-5.4 — in our testing it scores 5 on structured_output (vs R1 0528's 4). R1 0528 has a documented quirk of returning empty responses on structured_output, which can break JSON-based policy hooks.

Question 2

Can I combine the two models to get the best of both?

Accepted Answer

Yes. In our testing R1 0528 excels at tool_calling (5) and is far cheaper per output token ($2.15 vs $15), while GPT-5.4 provides stronger refusal consistency and structured outputs. Architecting R1 to run tool-based filters and using GPT-5.4 for final policy-formatting/appeals can be effective — but test end-to-end on your workflows.

Question 3

How much better is GPT-5.4 on safety decisions?

Accepted Answer

GPT-5.4 is better by 1 point on our 1–5 Safety Calibration scale (5 vs 4). It also ranks tied for 1st on Safety Calibration in our suite; R1 0528 ranks 6th. That 1-point gap reflects more consistent refusals and structured policy outputs in our tests.

Question 4

Does cost change the recommendation?

Accepted Answer

It can. R1 0528 output cost is $2.15/mTok vs GPT-5.4 $15/mTok in our data — R1 is roughly 7x cheaper on output tokens. If budget or high-volume safety-check generation is critical, R1 is a defensible choice despite scoring 1 point lower on Safety Calibration.

R1 0528 vs GPT-5.4 for Safety Calibration

R1 0528

GPT-5.4

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions