Claude Haiku 4.5 vs Devstral Medium for Creative Writing

Winner: Claude Haiku 4.5. In our testing Claude Haiku 4.5 scores 4.00 on the Creative Writing task vs Devstral Medium's 2.667 (difference = 1.333). Haiku 4.5 outperforms Devstral Medium on creative problem solving (4 vs 2), persona consistency (5 vs 3), and long context (5 vs 4), which are the core dimensions for fiction, voice, and extended narratives. Devstral Medium is cheaper (input 0.4 vs 1, output 2 vs 5 per mTok) and capable for short edits or structured formats, but it lost decisively on our Creative Writing tests.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Task Analysis

Creative Writing demands: sustained persona consistency, robust long-context handling, non-obvious idea generation (creative problem solving), and tidy constrained rewriting when needed. Our task uses three tests: creative problem solving, persona consistency, and constrained rewriting. In our testing Claude Haiku 4.5 leads on creative problem solving (4 vs 2) and persona consistency (5 vs 3) and ties on constrained rewriting (3 vs 3). These internal scores map directly to the task: persona consistency (keeps character voice and resists injection), long context (retrieval across 30K+ tokens) and creative problem solving (feasible, original plot/scene ideas) are the decisive capabilities. Cost and parameter support matter operationally: Claude Haiku 4.5 offers a 200k token context window and extra supported parameters (e.g., include_reasoning, structured outputs) that help with long-form narrative control; Devstral Medium has a 131k window but lags on creativity and persona in our benchmarks.

Practical Examples

Where Claude Haiku 4.5 shines (based on score gaps):

  • Serial novel drafting: keeps voice across long chapters (long context 5 vs 4) and maintains character consistency (persona consistency 5 vs 3).
  • Plot brainstorming and non-obvious twists: generates feasible, specific ideas (creative problem solving 4 vs 2).
  • Complex rewrite of a 30K-token outline into scene-by-scene beats: large context and structured outputs support. Where Devstral Medium is appropriate (given its strengths and cost):
  • Short-form fiction or micro-stories where budget matters: lower input/output costs (0.4/2 vs 1/5 per mTok) reduce spend.
  • Structured templates and format adherence: structured output scores tie at 4, so Devstral handles JSON/format constraints as well as Haiku for constrained outputs.
  • Fast prototyping of many short variants: acceptable classification/format behavior (classification 4) but weaker at sustained voice and deep creativity.

Bottom Line

For Creative Writing, choose Claude Haiku 4.5 if you need sustained voice, long-form drafts, and stronger idea-generation (it scored 4.00 vs Devstral Medium's 2.667 on our task). Choose Devstral Medium if budget per token is the priority and you work mainly on short-form pieces, structured templates, or many low-cost iterations (input 0.4 vs 1 and output 2 vs 5 per mTok).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions