Claude Haiku 4.5 vs Claude Opus 4.6 for Creative Writing

Winner: Claude Opus 4.6. In our testing Opus posts a task score of 4.333 vs Haiku’s 4.00 for Creative Writing and ranks 5th vs 28th. Opus’s higher creative_problem_solving (5 vs 4) and safety_calibration (5 vs 2) explain its edge; many other axes tie (persona_consistency, long_context, tool_calling, faithfulness, structured_output). Haiku is substantially cheaper (input 1 vs 5; output 5 vs 25 per mTok) and matches Opus on persona and long-context handling, making it the value pick for lower-cost workflows.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

anthropic

Claude Opus 4.6

Overall
4.58/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
3/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
78.7%
MATH Level 5
N/A
AIME 2025
94.4%

Pricing

Input

$5.00/MTok

Output

$25.00/MTok

Context Window1000K

modelpicker.net

Task Analysis

What Creative Writing demands: originality, coherent long-form narrative, consistent character voice, sensible plot problem-solving, and safely handling sensitive prompts. External benchmark data is not available for this task in the payload, so our decision uses our internal task score and component tests. Primary signals: creative_problem_solving (idea generation and plot invention), persona_consistency (maintaining character voice), long_context (handling chapters and serialized drafts), constrained_rewriting (compression and edits), and safety_calibration (refusing harmful requests while allowing legitimate creative content). In our testing Opus scores 5 on creative_problem_solving and 5 on safety_calibration versus Haiku’s 4 and 2 respectively, while both models score 5 on persona_consistency and long_context and tie on structured_output (4). Those differences—especially Opus’s stronger creative_problem_solving and safety—drive its win for Creative Writing.

Practical Examples

Where Claude Opus 4.6 shines (use Opus when):

  • You need a serialized novel or screenplay that reuses and edits 100k+ token context: Opus has a 1,000,000 token window and scores 5 on long_context (tied with Haiku) and 5 on creative_problem_solving.
  • You require complex plot fixes or non-obvious story beats: Opus scores 5 vs Haiku’s 4 on creative_problem_solving in our tests, so it produces more specific feasible ideas.
  • You must handle sensitive themes responsibly: Opus scores 5 on safety_calibration vs Haiku’s 2, reducing unsafe outputs in our testing. Where Claude Haiku 4.5 shines (use Haiku when):
  • You want fast, low-cost drafting and iteration: Haiku’s input/output cost per mTok is 1/5 and 5/25 respectively, making it ~5x cheaper per mTok than Opus.
  • You need strong persona and coherent voice at lower cost: Haiku scores 5 on persona_consistency (tied with Opus) and 5 on long_context, so it maintains characters across long drafts while saving budget.
  • You want good tool integration or structured outputs without the premium cost: Haiku ties Opus on tool_calling (5) and structured_output (4) in our tests.

Bottom Line

For Creative Writing, choose Claude Haiku 4.5 if you need high-quality persona consistency and long-context drafting at much lower cost (input 1 / output 5 per mTok). Choose Claude Opus 4.6 if you require stronger idea generation and safer handling of sensitive material (creative_problem_solving 5 vs 4; safety_calibration 5 vs 2) and you can justify higher compute (input 5 / output 25 per mTok) and larger context.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions