Question 1

Both models scored 5/5 on Faithfulness — why declare a winner?

Accepted Answer

Both scored 5/5 on our faithfulness test and are tied for 1st in that metric, but we picked GPT-5.4 because its supporting strengths (safety_calibration 5 vs 4 and structured_output 5 vs 4) and vastly larger context_window give it a practical advantage in reducing hallucinations on long or schema‑sensitive tasks.

Question 2

Does R1 0528’s reasoning_tokens behavior affect faithfulness?

Accepted Answer

Yes. R1 0528 is a reasoning_model that uses reasoning tokens which consume output budget and has a quirk of returning empty responses on structured_output. In short, that can harm faithfulness for short, tightly formatted outputs despite its 5/5 faithfulness score on our tests.

Question 3

When should I pick R1 0528 over GPT-5.4 for faithfulness?

Accepted Answer

Pick R1 0528 when cost and tool‑calling accuracy matter: output_cost_per_mtok is $2.15 vs GPT-5.4’s $15.00, and R1’s tool_calling score is 5 vs GPT-5.4’s 4. For workflows that rely on accurate function arguments and high throughput, R1 is the pragmatic choice.

Question 4

How do context windows affect faithfulness between these models?

Accepted Answer

A larger context window reduces the chance of dropping source material and having the model invent details. GPT-5.4’s context_window is 1,050,000 tokens versus R1 0528’s 163,840 tokens, which benefits long‑document faithful outputs in our evaluation.

R1 0528 vs GPT-5.4 for Faithfulness

R1 0528

GPT-5.4

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions