Question 1

How large is the faithfulness gap between the two models?

Accepted Answer

In our testing Claude Haiku 4.5 scores 5/5 on Faithfulness while Devstral 2 2512 scores 4/5 — a 1-point gap. Task ranks in our suite are 1st for Haiku and 33rd for Devstral.

Question 2

Does Devstral’s stronger structured_output make it better for faithful outputs?

Accepted Answer

Devstral’s structured_output is 5 vs Haiku’s 4, so it better follows schemas and exact formats. However, structured compliance doesn't guarantee factual fidelity: in our tests Haiku still outperforms Devstral on the overall Faithfulness metric (5 vs 4).

Question 3

Are retrieval or long-context limits driving the difference?

Accepted Answer

No — both models score 5 on long_context in our tests. The faithfulness advantage for Haiku is instead linked to higher tool_calling (5 vs 4) and persona_consistency (5 vs 4), which reduce hallucination risk during grounded tasks.

Question 4

How should I trade cost versus top faithfulness?

Accepted Answer

If top-tier faithfulness matters, pick Claude Haiku 4.5 (input_cost_per_mtok 1, output_cost_per_mtok 5) — it scores 5/5 on Faithfulness. If budget is the priority and you can accept slightly lower faithfulness, Devstral 2 2512 (input_cost_per_mtok 0.4, output_cost_per_mtok 2) offers lower inference costs with a 4/5 faithfulness score.

Claude Haiku 4.5 vs Devstral 2 2512 for Faithfulness

Claude Haiku 4.5

Devstral 2 2512

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions