Question 1

Why does R1 0528 win if it can return empty responses?

Accepted Answer

R1 0528 wins on the constrained_rewriting test (4/5 vs 3/5) and ranks 6th of 52 in our task ranks, indicating stronger compression ability in our evaluations. The win is qualified: R1’s documented quirk can produce empty responses on constrained_rewriting unless you set high max_completion_tokens and account for reasoning-token usage. With proper configuration R1 outperforms Sonnet for this task.

Question 2

When is Claude Sonnet 4.6 the better option despite scoring lower?

Accepted Answer

Choose Claude Sonnet 4.6 when you need guaranteed non-empty, schema-compliant outputs without extra tuning (it lists structured_outputs and related parameters and does not carry R1’s empty-response quirk), when your inputs include images (Sonnet supports text+image->text), or when predictability matters more than cost. Note Sonnet’s output cost is $15 per mTok vs R1’s $2.15.

Question 3

How should I configure R1 0528 to avoid empty outputs on constrained rewriting?

Accepted Answer

Follow the model’s quirks guidance: allocate a high max_completion_tokens (R1’s quirks indicate min_max_completion_tokens: 1000 and that it ‘needs_high_max_completion_tokens’), avoid enabling structured_output flags if you see empty responses, and test whether reasoning tokens are consuming the short-task budget. These steps reduce the chance of empty replies in short constrained outputs.

Question 4

Are their faithfulness and format-following ability different?

Accepted Answer

In our tests both models score 5/5 on faithfulness and 4/5 on structured_output (a tie). That means both preserve source meaning well and can follow formats in many scenarios, but R1’s empty-response quirk can undermine format delivery unless configured correctly.

Claude Sonnet 4.6 vs R1 0528 for Constrained Rewriting

Claude Sonnet 4.6

R1 0528

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions