Claude Sonnet 4.6 vs R1 0528 for Constrained Rewriting
R1 0528 is the better choice for Constrained Rewriting in our testing. It scores 4/5 vs Claude Sonnet 4.6's 3/5 on the constrained_rewriting test and ranks 6th vs Sonnet's 31st out of 52 models. That 1‑point lead — supported by R1's stronger constrained_rewriting proxy score and higher task rank — makes R1 the winner. Caveat: R1 has a documented quirk that can return empty responses on constrained_rewriting unless you set high max_completion_tokens and account for its reasoning-token behavior; if you need guaranteed non-empty, schema-compliant outputs without extra tuning, Claude Sonnet 4.6 may be the safer (but much more expensive) fallback.
anthropic
Claude Sonnet 4.6
Benchmark Scores
External Benchmarks
Pricing
Input
$3.00/MTok
Output
$15.00/MTok
modelpicker.net
deepseek
R1 0528
Benchmark Scores
External Benchmarks
Pricing
Input
$0.500/MTok
Output
$2.15/MTok
modelpicker.net
Task Analysis
Constrained Rewriting (compression within hard character limits) demands tight length control, high faithfulness to the source, reliable adherence to output constraints (including schema/format when required), and efficient token usage for short outputs. There is no external benchmark for this task in the payload, so our internal constrained_rewriting scores are the primary signal: R1 0528 scores 4/5 (taskRank 6/52) vs Claude Sonnet 4.6 at 3/5 (taskRank 31/52). Supporting signals: both models score 5/5 on faithfulness and 4/5 on structured_output (a tie), meaning both can preserve meaning and follow formats in many cases. Differences that matter operationally: R1's quirks note it “returns empty responses on structured_output, constrained_rewriting, and agentic_planning — reasoning tokens consume output budget on short tasks” and requires high max_completion_tokens (min_max_completion_tokens: 1000). Claude Sonnet 4.6 supports structured_outputs and verbosity in its parameter set and does not list the same empty-response quirk, but it costs substantially more (output_cost_per_mtok $15 vs R1’s $2.15), and its constrained_rewriting score is lower. Use the internal 1–5 task scores and these implementation details to weigh raw capability vs reliability and cost.
Practical Examples
- High-volume push-notification compression to 140 chars: R1 0528 is preferable — it scores 4/5 on constrained_rewriting and has much lower output cost ($2.15 per mTok vs $15), reducing operating expense at scale. Ensure you set max_completion_tokens above R1's min (>=1000) and test for empty outputs. 2) One-off legal boilerplate compressed to a strict template (must produce JSON schema): Claude Sonnet 4.6 is a safer pick if you require guaranteed, non-empty schema-compliant output without tuning, because Sonnet lists structured_outputs in its supported params and does not carry R1’s empty-response quirk — but expect a higher cost (Sonnet output $15 per mTok). 3) Mixed-media source (image + text) compressed into a short caption: Claude Sonnet 4.6 supports text+image->text modality (R1 is text->text), so Sonnet may be necessary if the input includes images; however, for pure text compression, R1 wins on our constrained_rewriting proxy. 4) Tight short-task pipelines where reasoning tokens matter: R1 can consume reasoning tokens and return empties on short tasks; in those automated pipelines you must allocate large max_completion_tokens or avoid structured_output flags, otherwise Sonnet's more predictable behavior is preferable despite its lower constrained_rewriting score.
Bottom Line
For Constrained Rewriting, choose R1 0528 if you need higher measured compression performance and much lower output costs (R1 scores 4 vs Sonnet 4.6's 3 on our test and costs $2.15 vs $15 per mTok) and you can configure high max_completion_tokens to avoid empty responses. Choose Claude Sonnet 4.6 if you require guaranteed non-empty, schema-compliant outputs, image->text input support, or you prefer a model without R1’s empty-response quirk and are willing to pay ~7× more per output token.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.