Claude Sonnet 4.6 vs Grok 4 for Constrained Rewriting
Winner: Grok 4. In our testing on the Constrained Rewriting benchmark (compression within hard character limits), Grok 4 scores 4 vs Claude Sonnet 4.6's 3 — a clear 1-point advantage. Grok ranks 6th of 52 for this task while Sonnet ranks 31st. Both models have equal input/output pricing per mTok, but Grok's higher constrained_rewriting score and task rank make it the better choice when strict length-preserving compression is the primary requirement.
anthropic
Claude Sonnet 4.6
Benchmark Scores
External Benchmarks
Pricing
Input
$3.00/MTok
Output
$15.00/MTok
modelpicker.net
xai
Grok 4
Benchmark Scores
External Benchmarks
Pricing
Input
$3.00/MTok
Output
$15.00/MTok
modelpicker.net
Task Analysis
Constrained Rewriting requires producing compact, accurate compressions that obey hard character limits while preserving meaning and required content. Key capabilities: precise brevity (no extraneous words), faithfulness to source facts, structured-output adherence when a particular format is required, and robust long-context handling to locate and compress relevant material. External benchmarks are not provided for this task in the payload, so our decision is based on the models' internal constrained_rewriting scores and supporting proxies. In our testing Grok 4 scores 4 on constrained_rewriting vs Claude Sonnet 4.6's 3. Supporting evidence: both models score 5 on faithfulness and 5 on long_context (so both preserve meaning and handle long inputs), and both score 4 on structured_output (schema/format compliance). Differences that explain the gap: Sonnet outperforms on creative_problem_solving (5 vs 3), tool_calling (5 vs 4), agentic_planning (5 vs 3), and safety_calibration (5 vs 2), which favors iterative, multi-step or safety-sensitive edits — but those strengths do not translate into superior raw compression performance in our constrained_rewriting tests. Grok's higher constrained_rewriting score indicates it produces tighter compressions more reliably under character limits in our test set.
Practical Examples
- Short product descriptions for a mobile UI (hard 280-char limit): Grok 4 (score 4) is likelier in our testing to hit tight limits while retaining required specs. Claude Sonnet 4.6 (score 3) may preserve nuance better but can exceed strict limits more often. 2) Legal clause compression where safety and exact preservation matter: both models score 5 on faithfulness, but Sonnet's safety_calibration 5 (vs Grok's 2) and higher creative_problem_solving (5 vs 3) make Sonnet the safer choice for risk-averse teams that will perform additional review. 3) Batch-file compression workflows: Grok supports text+image+file->text modality and has a 256k context window; Sonnet has a 1,000,000 context window and text+image->text modality. If source material comes as many attached files or you require larger single-document context, Sonnet's context headroom helps retain more source content before compression — but on pure constrained_rewriting metrics Grok still scores higher in our tests. 4) Tool-assisted multi-pass compression: Sonnet's tool_calling 5 vs Grok's 4 and agentic_planning 5 vs 3 indicate Sonnet is stronger for scripted, multi-step pipelines that call external tools to iteratively shorten text, even though Grok is better at single-pass hard-limit compression in our benchmark.
Bottom Line
For Constrained Rewriting, choose Grok 4 if your primary need is single-pass, tight compression that reliably meets hard character limits (Grok scores 4 vs Sonnet's 3 and ranks 6th vs 31st in our testing). Choose Claude Sonnet 4.6 if you value stronger safety calibration, multi-step tool-assisted workflows, or creative rephrasing where iterative planning and refusal behavior matter (Sonnet scores higher on tool_calling, agentic_planning, creative_problem_solving, and safety_calibration). Both models have equal per-mTok pricing in the payload; pick based on whether raw compression accuracy (Grok) or broader workflow safety and iteration (Sonnet) matters more.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.