Claude Haiku 4.5 vs Devstral Small 1.1 for Faithfulness
Winner: Claude Haiku 4.5. In our testing Haiku scores 5/5 on Faithfulness vs Devstral Small 1.1 at 4/5, and ranks 1st of 52 vs Devstral’s 33rd. Supporting signals: Haiku has higher tool_calling (5 vs 4) and long_context (5 vs 4) which correlate with fewer hallucinations on long or tool-assisted source grounding. Devstral Small 1.1 is substantially cheaper (output $0.30/mtok vs Haiku $5/mtok) but loses one full faithfulness point in our suite.
anthropic
Claude Haiku 4.5
Benchmark Scores
External Benchmarks
Pricing
Input
$1.00/MTok
Output
$5.00/MTok
modelpicker.net
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
Task Analysis
What Faithfulness demands: reliably stick to the provided source material, avoid adding unsupported facts, and preserve factual structure and attributions. Key capabilities that matter: accurate retrieval from long contexts, precise tool interactions (search/QA agents), and structured output to mirror source content. In our testing the primary signal for this task is the faithfulness score (taskScoreA/taskScoreB). Claude Haiku 4.5 scored 5/5 on the faithfulness test in our 12-test suite; Devstral Small 1.1 scored 4/5. Supporting proxies from our data: Haiku’s tool_calling is 5 vs Devstral’s 4 (better tool selection and argument accuracy in our tests), Haiku’s long_context is 5 vs Devstral’s 4 (Haiku: 200,000 token window vs Devstral: 131,072), and structured_output is equal at 4 for both (both adhere to JSON/schema output comparably). Use these internal proxies to explain WHY Haiku is likelier to produce faithful outputs on long documents and multi-step retrieval workflows; in contrast Devstral is competent but shows a measurable drop in the same supporting dimensions.
Practical Examples
- Long legal or research brief summarization: Claude Haiku 4.5 (faithfulness 5, long_context 5, context window 200,000) minimizes omission and hallucination on 30k–100k token inputs in our tests. Devstral Small 1.1 (faithfulness 4, long_context 4, window 131,072) performs well but is one point lower on faithfulness and may require chunking or extra verification. 2) Tool-enabled source grounding (retrieval + synthesis): Haiku’s tool_calling 5 vs Devstral’s 4 in our testing means Haiku more reliably sequences tool calls and preserves source attributions when chaining search and summarization. 3) High-volume, cost-sensitive pipelines that can tolerate small drops in faithfulness: Devstral Small 1.1 is dramatically cheaper (input $0.10/mtok, output $0.30/mtok) versus Haiku (input $1/mtok, output $5/mtok); in routine classification or short-document extraction where our faithfulness gap is acceptable, Devstral is cost-effective. 4) Structured-data extraction and JSON compliance: both models score 4 on structured_output in our tests, so for schema-constrained extraction either model can produce comparable format adherence; prefer Haiku when fidelity to content matters most.
Bottom Line
For Faithfulness, choose Claude Haiku 4.5 if you need the highest stick-to-source reliability on long documents or tool-assisted retrieval workflows (Haiku: faithfulness 5, long_context 5, tool_calling 5; output $5/mtok). Choose Devstral Small 1.1 if you prioritize cost and throughput and can accept a modest drop in faithfulness (Devstral: faithfulness 4, long_context 4, tool_calling 4; output $0.30/mtok).
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.