Claude Haiku 4.5 vs R1 for Creative Writing

R1 is the stronger Creative Writing model. In our testing, R1 scores 4.67 out of 5 on our Creative Writing task composite (ranked 1st of 52 models), compared to Claude Haiku 4.5's 4.0 (ranked 28th of 52). That two-thirds-of-a-point gap is meaningful when the task is scored across three tests — creative problem solving, persona consistency, and constrained rewriting. R1 wins on two of those three dimensions outright: it scores 5/5 on creative problem solving vs. Haiku 4.5's 4/5, and 4/5 on constrained rewriting vs. Haiku 4.5's 3/5. Both models tie on persona consistency at 5/5. There is no external benchmark in the payload for creative writing specifically, so we rely on our internal task scores — and they tell a consistent story. R1 leads across the board on the dimensions that matter most for fiction and storytelling. The win is clear.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

deepseek

R1

Overall
4.00/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
4/5
Multilingual
5/5
Tool Calling
4/5
Classification
2/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
93.1%
AIME 2025
53.3%

Pricing

Input

$0.700/MTok

Output

$2.50/MTok

Context Window64K

modelpicker.net

Task Analysis

Creative writing demands three things from an LLM: the ability to generate non-obvious, original ideas (creative problem solving); the ability to maintain a consistent voice, character, or narrator across a piece (persona consistency); and the ability to write within hard constraints — a word count, a form, a specific tone — without losing quality (constrained rewriting). These are the three tests we used to build the Creative Writing task score. R1 scores 5/5 on creative problem solving in our testing — tied for 1st with 7 other models out of 54 tested — while Claude Haiku 4.5 scores 4/5, placing it in a group of 21 models sharing that score. On constrained rewriting, R1 scores 4/5 (rank 6 of 53) while Haiku 4.5 scores 3/5 (rank 31 of 53). That is the decisive split: R1 not only generates more imaginative ideas, it also executes within formal constraints more reliably. Persona consistency is a wash — both models score 5/5, tied for 1st among 53 tested models. This means neither model will drift character mid-story. R1's edge comes from ideation depth and formal discipline, not from voice stability. There are no external creative-writing benchmarks in the payload. R1 does have external math benchmark data (93.1% on MATH Level 5, 53.3% on AIME 2025, per Epoch AI), but those scores are not relevant to creative writing performance and are noted here only for completeness. The task score gap — 4.67 vs. 4.0 — is the primary signal.

Practical Examples

Short story generation: R1's 5/5 creative problem solving score means it will propose less predictable narrative hooks, character motivations, and plot resolutions. Haiku 4.5 at 4/5 is still capable but more likely to reach for familiar story beats. If you are asking a model to draft an opening chapter with a surprising premise, R1 has a demonstrated edge in our testing. Constrained forms — sonnets, flash fiction under 100 words, structured haiku sequences: R1 scores 4/5 on constrained rewriting vs. Haiku 4.5's 3/5. In practice, this means R1 is more likely to hit a hard word count without sacrificing coherence, or to maintain a rhyme scheme without forcing awkward syntax. Haiku 4.5 at 3/5 is closer to the median (the p50 for constrained rewriting across all 52 models is 4) and will more often require a revision pass. Character-driven roleplay or serialized fiction: Both models score 5/5 on persona consistency, so either can maintain a character's voice and resist injection across a long exchange. This is a genuine tie — choose either for this use case without concern. Multilingual creative writing: Both models also score 5/5 on multilingual output, so writing fiction in French, Spanish, or other languages is equally strong on both. Cost consideration: Claude Haiku 4.5 costs $5.00 per million output tokens versus R1's $2.50 — R1 is half the price for output, which compounds when generating long-form fiction. R1's 16,000 max output token limit is lower than Haiku 4.5's 64,000, however, so for very long-form pieces (novellas, multi-chapter drafts), Haiku 4.5 has a structural advantage in a single generation.

Bottom Line

For Creative Writing, choose R1 if you want the highest-scoring creative writing model in our suite — it ranks 1st of 52 with a 4.67 task score, excels at original ideation and constrained forms, and costs half as much per output token ($2.50/MTok vs. $5.00/MTok). R1's 16,000 max output token ceiling is a real limitation for book-length generation in a single call, and its reasoning token quirks (minimum completion tokens, needs high max_completion_tokens set) require some API configuration care. Choose Claude Haiku 4.5 if you need to generate very long single-pass outputs — up to 64,000 output tokens — or if you are building a pipeline that relies on tool calling (5/5 vs. R1's 4/5), agentic planning (5/5 vs. 4/5), or classification (4/5 vs. R1's 2/5). Haiku 4.5 is also the better choice if your creative writing workflow integrates structured data, retrieval, or long-context source material (5/5 long context vs. R1's 4/5). For pure creative writing quality, R1 wins. For creative writing inside a broader agentic or tool-augmented system, Haiku 4.5 holds its own.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions