Question 1

Why did Claude Haiku 4.5 win for Faithfulness?

Accepted Answer

In our testing Claude Haiku 4.5 scored 5/5 on faithfulness vs Devstral Medium's 4/5. Haiku's higher tool_calling (5 vs 3) and long_context (5 vs 4) scores indicate better function selection and retrieval from very large contexts, which reduces hallucinations.

Question 2

Is the 1‑point gap meaningful in practice?

Accepted Answer

Yes. A 5 vs 4 difference on our 1–5 scale moved Haiku from rank 33 to rank 1 of 52 for this task. That gap reflects fewer source errors in multi‑step, long‑context, or tool‑driven workflows, though for short, well‑scoped prompts Devstral Medium may be sufficient.

Question 3

There’s no external benchmark for Faithfulness — does that weaken the result?

Accepted Answer

No. The externalBenchmark field is null for this task; our verdict is based on the internal faithfulness test and supporting proxy scores (tool_calling, long_context, structured_output) from our 12‑test suite.

Question 4

How should I weigh cost vs faithfulness?

Accepted Answer

Use Claude Haiku 4.5 when fidelity matters and budget allows: input $1 / output $5 per mTok and much larger 200k context. Use Devstral Medium to save money (input $0.40 / output $2 per mTok) when 4/5 faithfulness is acceptable.

Question 5

Do both models support structured outputs and tool usage?

Accepted Answer

Yes. In the provided data both models list structured_outputs and tools/tool_choice among supported parameters; however, Haiku’s tool_calling score (5) indicates it performed better in our tool‑selection and argument accuracy tests than Devstral (3).

Claude Haiku 4.5 vs Devstral Medium for Faithfulness

Claude Haiku 4.5

Devstral Medium

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions