Claude Sonnet 4.6 vs Grok 4 for Creative Writing

Winner: Claude Sonnet 4.6. In our Creative Writing suite Sonnet 4.6 scores 4.33 vs Grok 4's 4.00 — a 0.33-point advantage. Sonnet earns 5/5 on creative_problem_solving, 5/5 on safety_calibration, and 5/5 on persona_consistency in our testing, which translates to stronger ideation, safer handling of sensitive prompts, and more reliable voice/character maintenance. Grok 4 is competitive for constrained_rewriting (4 vs Sonnet's 3) and matches Sonnet on long-context and several format-oriented metrics, but overall Sonnet's higher creative problem-solving and safety scores make it the better pick for most fiction and storytelling workflows.

anthropic

Claude Sonnet 4.6

Overall
4.67/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
75.2%
MATH Level 5
N/A
AIME 2025
85.8%

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window1000K

modelpicker.net

xai

Grok 4

Overall
4.08/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window256K

modelpicker.net

Task Analysis

What Creative Writing demands: ideation of non-obvious plots and scenes, consistent character voice, safe handling of sensitive themes, and sometimes strict-length rewrites (microfiction/ad copy). Our Creative Writing task is driven by three benchmarks: creative_problem_solving (idea quality), persona_consistency (voice maintenance), and constrained_rewriting (compression within hard limits). No external benchmark is provided for this task, so our 3-test suite is the primary signal. In our testing Sonnet 4.6 leads on creative_problem_solving (5 vs 3) and safety_calibration (5 vs 2), supporting superior brainstorming, risk-aware content filtering, and stable character work. Grok 4 scores higher on constrained_rewriting (4 vs 3), so it handles tight character limits and tight editorial compression more reliably. Both models score 5 on long_context, so either can handle large drafts, but Sonnet's ideation and safety strengths are the deciding factors in our verdict.

Practical Examples

Where Claude Sonnet 4.6 shines (based on our scores):

  • Worldbuilding and plot ideation: Sonnet's 5/5 creative_problem_solving produces more non-obvious, feasible story directions when you need multiple distinct arcs.
  • Maintaining complex character voice across long drafts: persona_consistency 5 and long_context 5 help Sonnet keep tone and backstory coherent over tens of thousands of tokens.
  • Handling sensitive or boundary-pushing themes safely: safety_calibration 5 reduces unsafe outputs while permitting legitimate creative exploration. Where Grok 4 shines (based on our scores):
  • Microfiction, ad copy, and strict-length edits: constrained_rewriting 4 vs Sonnet's 3 — Grok more reliably compresses and preserves intent under hard character caps.
  • Format-focused editing and structured rewrites: Grok ties Sonnet on structured_output (4) and matches on long_context (5), so it's good when you need precise format adherence plus extended context. Concrete comparison point: Sonnet's creative_problem_solving 5 vs Grok's 3 means Sonnet is substantially better for ideation-heavy tasks; Grok's constrained_rewriting 4 vs Sonnet 3 means Grok is measurably better for tight compression tasks.

Bottom Line

For Creative Writing, choose Claude Sonnet 4.6 if you need superior ideation, robust persona consistency, and safer handling of sensitive themes (task score 4.33, rank 5/52). Choose Grok 4 if your priority is strict-length rewrites, tight editorial compression, or format-bound microcontent (task score 4.00, rank 28/52).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions