Claude Haiku 4.5 vs Devstral Medium for Persona Consistency

Claude Haiku 4.5 is the clear winner for Persona Consistency in our testing. It scores 5 versus Devstral Medium's 3 (a 2-point gap). In our suite Haiku ranks 1st for this task (rank 1 of 52) while Devstral ranks 45th of 52. Haiku's stronger supporting capabilities — long-context (5 vs 4), tool-calling (5 vs 3), and faithfulness (5 vs 4) — explain its ability to maintain character and resist prompt injection. Devstral Medium is cheaper per output token (2 vs 5) but is weaker on the specific robustness and retention measures that matter for persona consistency.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Task Analysis

Persona Consistency demands: (1) sustained memory of a character or role across turns, (2) resistance to prompt-injection or style-drift, and (3) accurate, on-character responses under long contexts. Key capabilities from our benchmarks that matter: long_context (retrieval and coherence across 30K+ tokens), tool_calling (correct sequencing and argument boxing when tools or persona state are exposed), faithfulness (sticking to source persona without hallucination), and safety_calibration (refusing harmful persona-altering injections). In our testing, Claude Haiku 4.5 scores 5 on persona_consistency and also scores 5 on long_context, tool_calling, and faithfulness — a profile that directly supports persona retention and injection resistance. Devstral Medium scores 3 on persona_consistency with lower tool_calling (3) and slightly lower long_context and faithfulness (4 each), which helps explain its weaker result. There is no external benchmark for this task in the payload, so our internal taskScore (5 vs 3) is the primary signal.

Practical Examples

Scenario 1 — Multi-session character assistant: You need a role-playing customer support persona that preserves backstory over a 100K-token transcript. Claude Haiku 4.5 (persona 5, long_context 5) is better positioned to keep details consistent across long history. Scenario 2 — Agent interacting with tools while keeping persona: If prompts call external tools or inject new instructions, Haiku's tool_calling 5 and persona 5 reduce the risk of the agent adopting injected roles. Scenario 3 — Cost-sensitive bulk generation: If you must generate many persona-consistent messages at minimum cost and can tolerate occasional drift, Devstral Medium is cheaper (output cost 2 per mTok vs Haiku 5) and may be acceptable for short-turn, lower-risk personas. Scenario 4 — Tight prompt-control with structured outputs: Both models support structured_outputs; Haiku's higher faithfulness (5 vs 4) means JSON or schema outputs are less likely to include off-persona content in our tests. All example comparisons are grounded in our measured scores: persona_consistency 5 vs 3, tool_calling 5 vs 3, long_context 5 vs 4, faithfulness 5 vs 4, and output cost 5 vs 2 (Haiku vs Devstral).

Bottom Line

For Persona Consistency, choose Claude Haiku 4.5 if you need robust, long-context character maintenance and resistance to prompt injection (scores: persona 5, long_context 5, tool_calling 5). Choose Devstral Medium if your primary constraint is cost or short-turn throughput and you can accept weaker persona robustness (scores: persona 3, lower tool_calling and long-context), noting Devstral’s lower output cost per mTok (2 vs Haiku's 5).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions