Claude Sonnet 4.6 vs GPT-5.4 for Constrained Rewriting

Winner: GPT-5.4. In our Constrained Rewriting benchmark GPT-5.4 scores 4/5 vs Claude Sonnet 4.6's 3/5 (taskRank: GPT-5.4 = 6 of 52; Sonnet 4.6 = 31 of 52). That one-point lead and GPT-5.4's top structured_output score (5 vs Sonnet's 4) make it the better choice for reliably compressing content to hard character limits while preserving fidelity.

anthropic

Claude Sonnet 4.6

Overall
4.67/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
75.2%
MATH Level 5
N/A
AIME 2025
85.8%

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window1000K

modelpicker.net

openai

GPT-5.4

Overall
4.58/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
76.9%
MATH Level 5
N/A
AIME 2025
95.3%

Pricing

Input

$2.50/MTok

Output

$15.00/MTok

Context Window1050K

modelpicker.net

Task Analysis

Constrained Rewriting demands exact compression to hard character limits while preserving meaning and required elements. The key capabilities are: accurate character/byte budgeting (format/length control), faithfulness to source content, and reliable structured output when the compressed text must fit a schema. Long-context support helps when compressing large source documents. In our testing the primary task signal is the constrained_rewriting score (GPT-5.4 = 4, Claude Sonnet 4.6 = 3). Supporting signals: GPT-5.4 has structured_output = 5 vs Sonnet's 4 (helps enforce strict length and format rules), both models score faithfulness = 5 (both preserve source material in our tests), and both have long_context = 5 (useful for compressing long inputs). Sonnet scores higher on tool_calling (5 vs GPT-5.4's 4), which can help if your pipeline relies on external length-checking or iterative tool-based compression, but raw on-model constrained-rewriting performance favors GPT-5.4 in our suite.

Practical Examples

  1. Tight marketing copy (exact 280-char ad): GPT-5.4 (4/5) is more likely in our tests to produce a compliant, meaning-preserving 280 characters while adhering to format constraints thanks to its structured_output = 5. 2) SMS / push-notification conversion where schema and exact length matter: GPT-5.4’s 4/5 constrained_rewriting and higher structured_output score reduce format retries. 3) Batch compressing long docs into fixed-size abstracts: both models have long_context = 5, so either can handle long inputs; GPT-5.4 still outscored Sonnet on the constrained task (4 vs 3). 4) Tool-assisted pipelines that call a length-checker function: Claude Sonnet 4.6’s tool_calling = 5 and creative_problem_solving = 5 make it a strong choice when you intend to orchestrate iterative external checks (Sonnet produced better tool_calling behavior in our tests). 5) Cost/context note: Claude Sonnet 4.6 input cost_per_mtok = 3, GPT-5.4 input_cost_per_mtok = 2.5; both have output_cost_per_mtok = 15 and >1M token context windows (Sonnet: 1,000,000; GPT-5.4: 1,050,000), so budget and extremely long inputs should be considered alongside accuracy.

Bottom Line

For Constrained Rewriting, choose GPT-5.4 if you need the best on-model performance for strict character-limit compression and schema adherence (our tests: 4/5 vs 3/5; structured_output 5 vs 4). Choose Claude Sonnet 4.6 if your workflow uses external tool calls or iterative programmatic length checks (Sonnet: tool_calling 5) or you value Sonnet's higher creative/problem-solving capabilities in complex multi-step compression pipelines.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions