Question 1

How big is the persona-consistency gap between the two models?

Accepted Answer

In our testing Claude Haiku 4.5 scored 5/5 vs Devstral Small 1.1 at 2/5 — a 3-point gap. Task ranks reflect this: Haiku is rank 1 of 52, Devstral Small is rank 51 of 52.

Question 2

Can tuning or prompt engineering close the gap?

Accepted Answer

Prompt engineering may reduce some drift for Devstral Small 1.1, but our scores show Haiku 4.5 has stronger baseline abilities (long_context 5, faithfulness 5) that better preserve persona under adversarial or multi-turn conditions.

Question 3

What about cost tradeoffs?

Accepted Answer

Devstral Small 1.1 is substantially cheaper in our data (input_cost_per_mtok: 0.1, output_cost_per_mtok: 0.3) compared with Claude Haiku 4.5 (input_cost_per_mtok: 1, output_cost_per_mtok: 5). For low-risk automation where persona fidelity is secondary, Devstral Small can save a lot; for high-fidelity persona use-cases, Haiku’s higher accuracy justifies the cost.

Question 4

Are there other internal signals that explain Haiku’s lead?

Accepted Answer

Yes. Beyond persona_consistency (5), Claude Haiku 4.5 also scores 5 on long_context, 5 on faithfulness, and 5 on tool_calling in our tests — capabilities that help maintain persona over long interactions and when external tools or structured responses are involved. Devstral Small’s supporting scores are lower in these areas.

Claude Haiku 4.5 vs Devstral Small 1.1 for Persona Consistency

Claude Haiku 4.5

Devstral Small 1.1

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions