Claude Haiku 4.5 vs Devstral Small 1.1 for Faithfulness

Winner: Claude Haiku 4.5. In our testing Haiku scores 5/5 on Faithfulness vs Devstral Small 1.1 at 4/5, and ranks 1st of 52 vs Devstral’s 33rd. Supporting signals: Haiku has higher tool_calling (5 vs 4) and long_context (5 vs 4) which correlate with fewer hallucinations on long or tool-assisted source grounding. Devstral Small 1.1 is substantially cheaper (output $0.30/mtok vs Haiku $5/mtok) but loses one full faithfulness point in our suite.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

Task Analysis

What Faithfulness demands: reliably stick to the provided source material, avoid adding unsupported facts, and preserve factual structure and attributions. Key capabilities that matter: accurate retrieval from long contexts, precise tool interactions (search/QA agents), and structured output to mirror source content. In our testing the primary signal for this task is the faithfulness score (taskScoreA/taskScoreB). Claude Haiku 4.5 scored 5/5 on the faithfulness test in our 12-test suite; Devstral Small 1.1 scored 4/5. Supporting proxies from our data: Haiku’s tool_calling is 5 vs Devstral’s 4 (better tool selection and argument accuracy in our tests), Haiku’s long_context is 5 vs Devstral’s 4 (Haiku: 200,000 token window vs Devstral: 131,072), and structured_output is equal at 4 for both (both adhere to JSON/schema output comparably). Use these internal proxies to explain WHY Haiku is likelier to produce faithful outputs on long documents and multi-step retrieval workflows; in contrast Devstral is competent but shows a measurable drop in the same supporting dimensions.

Practical Examples

  1. Long legal or research brief summarization: Claude Haiku 4.5 (faithfulness 5, long_context 5, context window 200,000) minimizes omission and hallucination on 30k–100k token inputs in our tests. Devstral Small 1.1 (faithfulness 4, long_context 4, window 131,072) performs well but is one point lower on faithfulness and may require chunking or extra verification. 2) Tool-enabled source grounding (retrieval + synthesis): Haiku’s tool_calling 5 vs Devstral’s 4 in our testing means Haiku more reliably sequences tool calls and preserves source attributions when chaining search and summarization. 3) High-volume, cost-sensitive pipelines that can tolerate small drops in faithfulness: Devstral Small 1.1 is dramatically cheaper (input $0.10/mtok, output $0.30/mtok) versus Haiku (input $1/mtok, output $5/mtok); in routine classification or short-document extraction where our faithfulness gap is acceptable, Devstral is cost-effective. 4) Structured-data extraction and JSON compliance: both models score 4 on structured_output in our tests, so for schema-constrained extraction either model can produce comparable format adherence; prefer Haiku when fidelity to content matters most.

Bottom Line

For Faithfulness, choose Claude Haiku 4.5 if you need the highest stick-to-source reliability on long documents or tool-assisted retrieval workflows (Haiku: faithfulness 5, long_context 5, tool_calling 5; output $5/mtok). Choose Devstral Small 1.1 if you prioritize cost and throughput and can accept a modest drop in faithfulness (Devstral: faithfulness 4, long_context 4, tool_calling 4; output $0.30/mtok).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions