Claude Haiku 4.5 vs Devstral Small 1.1 for Creative Writing

Winner: Claude Haiku 4.5. In our testing Claude Haiku 4.5 posts a task score of 4.00 vs Devstral Small 1.1's 2.33 on Creative Writing (fiction, storytelling, creative content). Claude's advantage is driven by higher creative_problem_solving (4 vs 2), persona_consistency (5 vs 2), and long_context (5 vs 4). The two models tie on constrained_rewriting (3) and structured_output (4). Devstral Small 1.1 is far cheaper for output (0.3 vs 5 per mTok) and remains useful for high-volume, cost-sensitive generation, but for quality-focused creative writing Claude Haiku 4.5 is definitively better in our benchmarks.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

Task Analysis

What Creative Writing demands: consistent voice and character (persona_consistency), sustained narrative across long drafts (long_context), fresh non-obvious ideas and plot moves (creative_problem_solving), and the ability to rewrite under constraints (constrained_rewriting). External benchmarks are not present for this task in the payload, so our internal taskScore is the primary signal: Claude Haiku 4.5 scores 4.00 vs Devstral Small 1.1's 2.33 on the Creative Writing suite in our testing. Supporting sub-scores explain the gap: Claude leads persona_consistency 5→2, long_context 5→4, and creative_problem_solving 4→2. They tie on constrained_rewriting (3) and structured_output (4). Practical operational differences also matter to developers and buyers: Claude has a 200,000-token context window and max_output_tokens 64,000 (helpful for novel-length drafts), while Devstral has a 131,072-token window. Output cost per mTok is 5 for Claude Haiku 4.5 vs 0.3 for Devstral Small 1.1 (a price ratio ≈16.67x), which favors Devstral for scale but not for quality on creative tasks in our tests.

Practical Examples

Where Claude Haiku 4.5 shines: 1) Serial novel drafting — maintain voice across 100k+ token manuscripts (long_context 5, context_window 200000). 2) Character-driven short stories requiring strict persona maintenance and resisting prompt injection (persona_consistency 5). 3) Generating non-obvious plot beats and metaphors (creative_problem_solving 4). Where Devstral Small 1.1 is appropriate: 1) Bulk short-form content or A/B creative variations where output cost dominates (output_cost_per_mtok 0.3 vs Claude's 5). 2) Tasks needing reliable structured outputs or classification integrated into creative pipelines — it ties Claude on structured_output (4) and classification (4). 3) Fast prototyping of story scaffolds when budget is constrained, accepting lower persona fidelity and fewer inventive leaps (creative_problem_solving 2, persona_consistency 2). Concrete score references: Claude leads creative_problem_solving 4→2, persona_consistency 5→2, long_context 5→4; constrained_rewriting and structured_output are tied in our tests.

Bottom Line

For Creative Writing, choose Claude Haiku 4.5 if you prioritize narrative quality, persona fidelity, and long-form coherence (task score 4.00, persona_consistency 5, long_context 5). Choose Devstral Small 1.1 if your primary constraint is budget or you need very low-cost bulk generation and can accept weaker creative output (task score 2.33; output_cost_per_mtok 0.3 vs Claude's 5).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions