Claude Haiku 4.5 vs Devstral 2 2512 for Constrained Rewriting

Winner: Devstral 2 2512. In our testing Devstral scores 5/5 on Constrained Rewriting vs Claude Haiku 4.5's 3/5 — a clear 2-point margin. Devstral ranks tied for 1st on this task and also posts a top structured_output score (5 vs 4), which directly supports reliable compression into hard character limits. Claude Haiku 4.5 is stronger on faithfulness (5 vs 4) and tool_calling (5 vs 4), so it preserves source content better, but it underperforms at tight-format compression compared with Devstral.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral 2 2512

Overall
4.00/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
4/5
Persona Consistency
4/5
Constrained Rewriting
5/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window262K

modelpicker.net

Task Analysis

What Constrained Rewriting demands: precise compression within hard character limits while preserving meaning and required structure. Key capabilities: strict structured_output compliance (format/schema), faithfulness to source content, predictable token use for exact length control, and long_context to maintain context when compressing long inputs. There is no external benchmark for this task in the payload, so we base the winner on our internal task scores: Devstral 2 2512 scores 5 on constrained_rewriting and 5 on structured_output; Claude Haiku 4.5 scores 3 on constrained_rewriting and 4 on structured_output. Supporting internal signals: Devstral’s faithfulness is 4 (vs Claude’s 5), and both models have strong long_context (5). In short: Devstral’s higher structured_output and top task rank explain its superiority at strict compression; Claude’s higher faithfulness explains when preservation of nuance matters more than squeezing into a limit.

Practical Examples

  1. Mobile push notification (<=160 chars): Devstral 2 2512 (5) is the better choice — its 5/5 constrained_rewriting and 5/5 structured_output give more reliable, repeatable compression into the exact character cap. 2) SMS summaries of legal clauses where every phrase must be preserved: Claude Haiku 4.5 (3) is preferable when faithfulness matters — it scores 5 on faithfulness vs Devstral’s 4, so it better preserves precise wording even if it struggles more to hit tight limits. 3) CSV field compression for UI display: Devstral’s structured_output 5 vs Claude’s 4 means Devstral will more reliably produce exact-length, schema-compliant cells. 4) Persona-aware microcopy (brand voice + character limit): Claude’s persona_consistency 5 helps retain voice, but expect more manual prompts to reach a hard limit because its constrained_rewriting score is 3 versus Devstral’s 5. 5) Cost-sensitive bulk rewriting: Devstral is also cheaper on output (2¢/mTok vs Claude’s 5¢/mTok in the payload), so for high-volume constrained rewriting tasks it is both more accurate and more economical.

Bottom Line

For Constrained Rewriting, choose Devstral 2 2512 if you need reliable compression into hard character limits, exact schema/format adherence, and lower output cost — Devstral scores 5 vs Claude’s 3 in our testing. Choose Claude Haiku 4.5 if preserving exact source wording and persona fidelity matters more than hitting a strict character cap — Claude scores 5 on faithfulness and persona_consistency but only 3 on constrained_rewriting.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions