Claude Sonnet 4.6 vs GPT-5.4 for Creative Writing

Winner: Claude Sonnet 4.6. In our testing Sonnet 4.6 edges GPT-5.4 on the core Creative Writing subtest that matters most—creative_problem_solving (5 vs 4)—while persona_consistency and long_context tie. Both models score 4.333/5 on our three-test Creative Writing suite and rank 5th of 52, but Sonnet’s higher creative idea generation gives it a practical edge for storytelling and original concept work. GPT-5.4 retains advantages in constrained_rewriting (4 vs 3) and structured output (5 vs 4), so it’s decisively better when tight length limits or strict format compliance matter.

anthropic

Claude Sonnet 4.6

Overall
4.67/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
75.2%
MATH Level 5
N/A
AIME 2025
85.8%

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window1000K

modelpicker.net

openai

GPT-5.4

Overall
4.58/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
76.9%
MATH Level 5
N/A
AIME 2025
95.3%

Pricing

Input

$2.50/MTok

Output

$15.00/MTok

Context Window1050K

modelpicker.net

Task Analysis

Creative Writing demands: strong idea generation, consistent voice/persona across scenes, coherent long-context handling for multi-chapter work, faithful adherence to prompts, and sometimes tight constrained rewriting (flash fiction, microcopy) or strict formatting (scripts, submission metadata). Because no external benchmark is present for this task, our verdict is based on our internal three-test suite (creative_problem_solving, persona_consistency, constrained_rewriting). In our testing Claude Sonnet 4.6 scores 5 on creative_problem_solving vs GPT-5.4’s 4; persona_consistency is 5 for both; constrained_rewriting is 3 for Sonnet vs 4 for GPT-5.4. Both models score 5 on long_context and 5 on faithfulness in our tests. These component differences explain the practical tradeoffs: Sonnet generates more non-obvious, specific story ideas, while GPT-5.4 is better at compressing or strictly formatting output under hard limits. All quoted scores are from our testing on the 3 subtests for Creative Writing.

Practical Examples

  1. High-concept novel brainstorming — Sonnet 4.6: Better choice. In our testing Sonnet’s creative_problem_solving is 5 vs GPT-5.4’s 4, so Sonnet produces more non-obvious, feasible plot hooks and character beats. 2) Long serialized fiction with consistent character voice — Tie: Both score 5 on persona_consistency and 5 on long_context in our testing, so either model maintains voice and handles 30K+ token context reliably. 3) Flash fiction or Twitter-length serialized scenes with strict char limits — GPT-5.4: Better choice. GPT-5.4 scores 4 on constrained_rewriting vs Sonnet’s 3, meaning GPT-5.4 is measurably stronger at compressing and preserving meaning within hard character limits. 4) Formatted deliverables (screenplay with exact JSON metadata or publisher-ready front matter) — GPT-5.4: Advantage. GPT-5.4 scores 5 on structured_output vs Sonnet’s 4 in our tests, so it better adheres to precise schemas and formatting constraints. 5) Iterative development and agentic workflows (e.g., multi-step rewriting using tools) — Sonnet 4.6 shows higher tool_calling (5 vs 4) in our testing, which supports more accurate function/agent selection during iterative creative workflows.

Bottom Line

For Creative Writing, choose Claude Sonnet 4.6 if you prioritize original idea generation and wide-ranging, iterative storytelling (Sonnet edges GPT-5.4 on creative_problem_solving: 5 vs 4). Choose GPT-5.4 if you need strict length compression, tight constrained rewrites, or exact formatted outputs (GPT-5.4 wins constrained_rewriting 4 vs 3 and structured_output 5 vs 4). Note: both score 4.333/5 on our 3-test Creative Writing suite and tie at rank 5 of 52, so either model will handle general storytelling well.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions