Question 1

How much better is Grok 4 at Constrained Rewriting in your tests?

Accepted Answer

In our testing Grok 4 scores 4 on constrained_rewriting vs Claude Sonnet 4.6's 3 — a 1-point margin. Grok ranks 6th of 52 for this task, Sonnet ranks 31st.

Question 2

Did you use any external benchmark to decide the winner?

Accepted Answer

No. externalBenchmark is null in the payload, so the verdict is based on our internal constrained_rewriting scores and supporting proxy metrics in the provided data.

Question 3

Both models have high faithfulness and long-context scores — why not pick Sonnet 4.6?

Accepted Answer

Both models score 5 on faithfulness and long_context, but our constrained_rewriting tests measure compression within hard character limits specifically. Grok 4 scored higher on that task in our tests (4 vs 3), which is why it wins for pure constrained rewriting despite Sonnet's strengths in other areas.

Question 4

Are there cost differences to consider between the two?

Accepted Answer

In the provided payload both models have the same input_cost_per_mtok (3) and output_cost_per_mtok (15), so pricing per mTok is equal according to the data we were given.

Question 5

When should I still pick Claude Sonnet 4.6 for compression tasks?

Accepted Answer

Choose Claude Sonnet 4.6 when constrained rewriting is part of a larger safety-sensitive or multi-step workflow that leverages strong tool_calling (Sonnet 5 vs Grok 4), agentic_planning (5 vs 3), or high safety_calibration (5 vs 2). Sonnet is better in our testing for those broader, iterative scenarios even though it scores lower on single-pass constrained_rewriting.

Claude Sonnet 4.6 vs Grok 4 for Constrained Rewriting

Claude Sonnet 4.6

Grok 4

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions