Claude Haiku 4.5 vs Devstral 2 2512 for Persona Consistency
Winner: Claude Haiku 4.5. In our testing Claude Haiku 4.5 scores 5 vs Devstral 2 2512's 4 on the persona_consistency benchmark and ranks 1st vs 38th of 52. Claude's advantages in faithfulness (5 vs 4), tool_calling (5 vs 4), and higher persona_consistency rank make it better at maintaining character and resisting injection. Devstral 2 2512 remains competitive when you need stronger structured_output (5 vs 4) or lower cost, but for strict persona maintenance Claude Haiku 4.5 is the definitive choice in our tests.
anthropic
Claude Haiku 4.5
Benchmark Scores
External Benchmarks
Pricing
Input
$1.00/MTok
Output
$5.00/MTok
modelpicker.net
mistral
Devstral 2 2512
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
Task Analysis
Persona Consistency requires two things: (1) staying 'in character' across turns and resisting prompt-injection or role-reset attempts, and (2) preserving factual and stylistic constraints while executing tasks. Capabilities that matter: faithfulness (sticking to the stated persona), robustness to injection (measured by persona_consistency), long_context handling (maintaining persona across long conversations), tool_calling (accurate function selection without deviating from persona), and safety_calibration (avoiding harmful persona behavior). No external benchmark applies for this task in the payload; our internal persona_consistency score is the primary evidence. Claude Haiku 4.5 scores 5 on persona_consistency, faithfulness 5, tool_calling 5, and long_context 5 in our tests — indicating strong, consistent role adherence and resistance to injection across long histories. Devstral 2 2512 scores 4 on persona_consistency with strengths in structured_output (5) and constrained_rewriting (5), but slightly lower faithfulness (4) and tool_calling (4), and weaker safety_calibration (1 vs Claude's 2). These internal results explain why Claude performs better on persona tasks in our suite.
Practical Examples
- Customer-support chatbot maintaining a persona (friendly, brief answers) across a long session: Claude Haiku 4.5 (persona_consistency 5, faithfulness 5, long_context 5, tool_calling 5) will better preserve tone and resist injected prompts that try to change its role. Devstral 2 2512 (persona_consistency 4) can still perform well but may require stricter prompt engineering.
- Roleplaying assistant that must follow complex persona rules and call backend tools without breaking character: Claude Haiku 4.5's higher tool_calling (5 vs 4) and persona_consistency (5 vs 4) reduce accidental persona drift when invoking functions. Devstral 2 2512 will produce cleaner structured outputs (structured_output 5 vs 4), helpful when you need rigid JSON responses tied to a persona, but you may need extra guardrails to prevent injection.
- Memory-heavy multi-session assistant that must recall and act on persona history: both models have strong long_context (5), but Claude Haiku 4.5's top persona_consistency score means fewer role-resets over repeated turns. If budget is a priority, Devstral 2 2512 is significantly cheaper (input 0.4 vs 1, output 2 vs 5 per mTok) and can be tuned for consistent output with additional instruction scaffolding.
- Safety-sensitive persona (refusing harmful prompts while staying in character): Claude Haiku 4.5 has higher safety_calibration (2 vs Devstral's 1) and should refuse inappropriate role requests more reliably in our tests, though neither scores high in absolute terms and additional safety layers are recommended.
Bottom Line
For Persona Consistency, choose Claude Haiku 4.5 if you need the strongest out-of-the-box resistance to persona drift and injection (scores 5 vs 4) and higher faithfulness/tool_calling in our testing. Choose Devstral 2 2512 if you need cheaper inference and superior structured_output (5 vs 4) and are willing to add prompt safeguards to reach the same persona reliability.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.