R1 0528 vs GPT-5.4 for Constrained Rewriting

GPT-5.4 is the better choice for Constrained Rewriting. In our testing both models scored 4/5 and share rank 6 of 52 on this task, but R1 0528 has a documented quirk that can return empty responses on constrained_rewriting tasks and requires large max-completion settings (min_max_completion_tokens: 1000). That functional failure makes GPT-5.4 the reliable winner despite R1 0528's much lower input/output costs (R1 input $0.50/mTok, output $2.15/mTok vs GPT-5.4 input $2.50/mTok, output $15.00/mTok).

deepseek

R1 0528

Overall
4.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
4/5
Strategic Analysis
4/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
96.6%
AIME 2025
66.4%

Pricing

Input

$0.500/MTok

Output

$2.15/MTok

Context Window164K

modelpicker.net

openai

GPT-5.4

Overall
4.58/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
76.9%
MATH Level 5
N/A
AIME 2025
95.3%

Pricing

Input

$2.50/MTok

Output

$15.00/MTok

Context Window1050K

modelpicker.net

Task Analysis

Constrained Rewriting (defined in our benchmarks as "Compression within hard character limits") demands precise length control, faithfulness to the source, structured-output compliance when a format is required, and predictable completion behavior under tight token budgets. In our testing both R1 0528 and GPT-5.4 scored 4/5 on constrained_rewriting and are tied at rank 6 of 52, so raw task accuracy was comparable. Supporting signals matter when choosing between them: GPT-5.4 scores higher on structured_output (5 vs R1's 4) and safety_calibration (5 vs 4), which helps ensure format-adherent, policy-safe truncations and refusals. R1 0528 shows strong tool_calling (5) and classification (4) support and a 163,840-token context window, but its quirks — notably "empty_on_structured_output" and explicit note that it "Returns empty responses on structured_output, constrained_rewriting, and agentic_planning" and that it "needs_high_max_completion_tokens" — are directly relevant failures for constrained rewriting tasks that rely on short, exact outputs.

Practical Examples

GPT-5.4 (winner): Rewriting a legal paragraph to a 280-character SMS while preserving mandatory clauses and returning a JSON flag for omitted clauses. In our testing GPT-5.4's structured_output 5/5 and safety_calibration 5/5 make it reliable for strict format and compliance needs. It also has a 1,050,000-token context window for large source documents. R1 0528 (cost-efficient alternative): Bulk-compressing long product descriptions across multiple languages where you can supply high max-completion tokens and tolerate reasoning-token overhead — R1 0528 is much cheaper (input $0.50/mTok, output $2.15/mTok) and scored 5/5 on tool_calling and 5/5 on persona_consistency/multilingual in our tests. However, in constrained rewriting workflows that expect short, deterministic outputs, R1 0528 may return empty outputs unless you configure very high max_completion tokens (it documents min_max_completion_tokens: 1000) and handle its reasoning-token consumption, making it risky for one-shot short compression tasks.

Bottom Line

For Constrained Rewriting, choose R1 0528 if you need a low-cost option for large-scale, multilingual compression and you can set high max_completion_tokens and tolerate reasoning-token output behaviors. Choose GPT-5.4 if you need reliable, format-adherent, and policy-safe compressed outputs out of the box — GPT-5.4 avoids R1 0528's empty-output quirk and has stronger structured-output and safety signals.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions