How much better is Claude Haiku 4.5 at Faithfulness in your tests?

In our testing Haiku scores 5/5 on Faithfulness vs Devstral Small 1.1 at 4/5 — a one-point advantage. Haiku also ranks 1st of 52 for this task while Devstral ranks 33rd of 52.

Which supporting capabilities explain Haiku’s lead on Faithfulness?

In our tests Haiku has higher tool_calling (5 vs 4) and long_context (5 vs 4). Those proxies indicate better multi-step tool sequencing and handling of large contexts, both of which reduce hallucinations when grounding to sources.

Is Devstral Small 1.1 ever the better choice for Faithfulness-sensitive workloads?

Yes—if budget and throughput are primary constraints and you can tolerate a modest decrease in fidelity. Devstral is far cheaper (output $0.30/mtok vs Haiku $5/mtok) and still scores 4/5 on Faithfulness and 4/5 on structured_output in our testing.

Do both models respect structured output formats equally?

Yes. On our structured_output benchmark both Claude Haiku 4.5 and Devstral Small 1.1 score 4/5, indicating similar JSON/schema compliance in our tests.

Claude Haiku 4.5 vs Devstral Small 1.1 for Faithfulness

Winner: Claude Haiku 4.5. In our testing Haiku scores 5/5 on Faithfulness vs Devstral Small 1.1 at 4/5, and ranks 1st of 52 vs Devstral’s 33rd. Supporting signals: Haiku has higher tool_calling (5 vs 4) and long_context (5 vs 4) which correlate with fewer hallucinations on long or tool-assisted source grounding. Devstral Small 1.1 is substantially cheaper (output $0.30/mtok vs Haiku $5/mtok) but loses one full faithfulness point in our suite.

anthropic

Claude Haiku 4.5

Overall

4.33/5Strong

Benchmark Scores

Faithfulness

5/5

Long Context

5/5

Multilingual

5/5

Tool Calling

5/5

Classification

4/5

Agentic Planning

5/5

Structured Output

4/5

Safety Calibration

2/5

Strategic Analysis

5/5

Persona Consistency

5/5

Constrained Rewriting

3/5

Creative Problem Solving

4/5

External Benchmarks

SWE-bench Verified

N/A

MATH Level 5

N/A

AIME 2025

N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral Small 1.1

Overall

3.08/5Usable

Benchmark Scores

Faithfulness

4/5

Long Context

4/5

Multilingual

4/5

Tool Calling

4/5

Classification

4/5

Agentic Planning

2/5

Structured Output

4/5

Safety Calibration

2/5

Strategic Analysis

2/5

Persona Consistency

2/5

Constrained Rewriting

3/5

Creative Problem Solving

2/5

External Benchmarks

SWE-bench Verified

N/A

MATH Level 5

N/A

AIME 2025

N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

Task Analysis

What Faithfulness demands: reliably stick to the provided source material, avoid adding unsupported facts, and preserve factual structure and attributions. Key capabilities that matter: accurate retrieval from long contexts, precise tool interactions (search/QA agents), and structured output to mirror source content. In our testing the primary signal for this task is the faithfulness score (taskScoreA/taskScoreB). Claude Haiku 4.5 scored 5/5 on the faithfulness test in our 12-test suite; Devstral Small 1.1 scored 4/5. Supporting proxies from our data: Haiku’s tool_calling is 5 vs Devstral’s 4 (better tool selection and argument accuracy in our tests), Haiku’s long_context is 5 vs Devstral’s 4 (Haiku: 200,000 token window vs Devstral: 131,072), and structured_output is equal at 4 for both (both adhere to JSON/schema output comparably). Use these internal proxies to explain WHY Haiku is likelier to produce faithful outputs on long documents and multi-step retrieval workflows; in contrast Devstral is competent but shows a measurable drop in the same supporting dimensions.

Practical Examples

Long legal or research brief summarization: Claude Haiku 4.5 (faithfulness 5, long_context 5, context window 200,000) minimizes omission and hallucination on 30k–100k token inputs in our tests. Devstral Small 1.1 (faithfulness 4, long_context 4, window 131,072) performs well but is one point lower on faithfulness and may require chunking or extra verification. 2) Tool-enabled source grounding (retrieval + synthesis): Haiku’s tool_calling 5 vs Devstral’s 4 in our testing means Haiku more reliably sequences tool calls and preserves source attributions when chaining search and summarization. 3) High-volume, cost-sensitive pipelines that can tolerate small drops in faithfulness: Devstral Small 1.1 is dramatically cheaper (input $0.10/mtok, output $0.30/mtok) versus Haiku (input $1/mtok, output $5/mtok); in routine classification or short-document extraction where our faithfulness gap is acceptable, Devstral is cost-effective. 4) Structured-data extraction and JSON compliance: both models score 4 on structured_output in our tests, so for schema-constrained extraction either model can produce comparable format adherence; prefer Haiku when fidelity to content matters most.

Bottom Line

For Faithfulness, choose Claude Haiku 4.5 if you need the highest stick-to-source reliability on long documents or tool-assisted retrieval workflows (Haiku: faithfulness 5, long_context 5, tool_calling 5; output $5/mtok). Choose Devstral Small 1.1 if you prioritize cost and throughput and can accept a modest drop in faithfulness (Devstral: faithfulness 4, long_context 4, tool_calling 4; output $0.30/mtok).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Claude Haiku 4.5 vs Devstral Small 1.1 for Faithfulness

Claude Haiku 4.5

Devstral Small 1.1

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions

How much better is Claude Haiku 4.5 at Faithfulness in your tests?

Which supporting capabilities explain Haiku’s lead on Faithfulness?

Is Devstral Small 1.1 ever the better choice for Faithfulness-sensitive workloads?

Do both models respect structured output formats equally?