Claude Haiku 4.5 vs Codestral 2508 for Writing

Winner: Claude Haiku 4.5. In our testing for Writing (blog posts, marketing copy, content creation) Claude Haiku 4.5 posts a task score of 3.5 vs Codestral 2508's 2.5, a clear 1.0-point advantage. That gap is driven by Haiku's higher creative_problem_solving (4 vs 2) and persona_consistency (5 vs 3) scores in our suite. Codestral 2508 does outperform on structured_output (5 vs 4) and is substantially cheaper (Haiku output cost per mTok = 5 vs Codestral output = 0.9), making it the better choice when strict format adherence and cost-per-generation are the priority. All benchmark claims are based on our testing.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Codestral 2508

Overall
3.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
5/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.300/MTok

Output

$0.900/MTok

Context Window256K

modelpicker.net

Task Analysis

What Writing demands: high-quality Writing requires creative idea generation, consistent brand/persona voice, the ability to compress or rewrite within limits, faithfulness to source facts, and handling long contexts (research, briefs). On this task our top-level signal is the taskScore (internal): Claude Haiku 4.5 = 3.5 vs Codestral 2508 = 2.5. The primary contributors in our tests were creative_problem_solving and constrained_rewriting; Haiku leads on creative_problem_solving (4 vs 2) while constrained_rewriting is tied (3 vs 3). Supporting proxies: Haiku's persona_consistency is 5 vs Codestral's 3, which explains better voice control and brand consistency in our samples. Both models tie on long_context (5) and faithfulness (5), so long-form drafts and sticking to source materials are comparable. Codestral's structured_output advantage (5 vs 4) makes it stronger for exact-format outputs (JSON/SEO metadata/templates). There is no external benchmark provided for Writing in the payload, so this verdict relies on our internal scores and task results.

Practical Examples

Where Claude Haiku 4.5 shines (based on our scores):

  • Marketing campaign ideation: Haiku's creative_problem_solving 4 vs 2 means more non-obvious, specific campaign angles and hooks.
  • Brand-voice blog series: persona_consistency 5 vs 3 yields more consistent tone across posts and avoids voice drift.
  • Strategy-led content: strategic_analysis 5 vs 2 supports nuanced tradeoff framing for product positioning and CTAs. Where Codestral 2508 shines (based on our scores and pricing):
  • High-volume content pipelines that require strict schemas: structured_output 5 vs 4 produces cleaner JSON/metadata and template compliance.
  • Cost-sensitive bulk generation: output cost per mTok is 0.9 for Codestral vs 5 for Haiku, so Codestral reduces per-piece cost for large batches.
  • Long-form drafting and source fidelity: both tie on long_context (5) and faithfulness (5), so either model handles long briefs and staying true to source material equally well in our tests.

Bottom Line

For Writing, choose Claude Haiku 4.5 if you need stronger creative ideation, consistent brand/persona voice, and higher-level planning (it scores 3.5 vs 2.5 on the Writing task in our tests). Choose Codestral 2508 if you need lower-cost, high-throughput content that must strictly follow schemas/templates (Codestral has structured_output 5 vs Haiku 4 and a lower output cost per mTok: 0.9 vs 5).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions