Claude Haiku 4.5 vs Devstral Medium for Faithfulness

Winner: Claude Haiku 4.5. In our testing Claude Haiku 4.5 scores 5/5 for Faithfulness vs Devstral Medium's 4/5, and ranks 1 of 52 vs 33 of 52. That 1‑point lead is supported by Haiku's stronger tool_calling (5 vs 3), long_context (5 vs 4), and persona_consistency (5 vs 3), all of which reduce hallucination risk in grounded tasks. Devstral Medium is competent (4/5) but trails on function selection and very long‑context retrieval. No external benchmark is available for this task; our internal scores are the basis for this verdict.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Task Analysis

Faithfulness demands: strict adherence to source material, minimal invention, correct citation or structured output, and reliable retrieval across long contexts. Primary measure here is our internal faithfulness test: Claude Haiku 4.5 = 5, Devstral Medium = 4 (rank 1 vs 33 of 52). Key capabilities that drive faithfulness in our suite: tool_calling (selecting the right retrieval/lookup functions and returning accurate arguments), long_context handling (retrieving facts from very large inputs), structured_output (format and schema fidelity), and persona_consistency (avoiding prompt injection that causes stray assertions). In our data Claude Haiku 4.5 outperforms Devstral Medium on tool_calling (5 vs 3) and long_context (5 vs 4) while both match on structured_output (4). Those internal proxies explain why Haiku stays closer to sources in multi-step, tool-augmented, or extremely long‑document workflows.

Practical Examples

Where Claude Haiku 4.5 shines (faithfulness):

  • Long legal or regulatory summaries: Haiku's 200,000‑token context and long_context=5 reduce omission and invented claims compared with Devstral (131,072 tokens, long_context=4).
  • Tool‑augmented retrieval pipelines: Haiku's tool_calling=5 means it more reliably selects and populates the correct search or citation functions, lowering hallucination risk relative to Devstral (tool_calling=3).
  • Structured, provenance‑sensitive outputs: Haiku (faithfulness=5, structured_output=4) is better when you need exact adherence to source text and consistent attribution. Where Devstral Medium is appropriate (faithfulness):
  • Shorter document synthesis or routing tasks: Devstral Medium scores 4/5 on faithfulness and matches structured_output=4, making it a solid, lower‑cost choice for concise summaries or classification.
  • Cost‑sensitive retrieval with moderate context: Devstral’s lower input/output costs (input $0.40 per mTok, output $2 per mTok) keep bills down while delivering acceptable faithfulness for many workflows. Concrete numbers to ground choices: Haiku input/output costs = $1 / $5 per mTok and context_window = 200,000; Devstral Medium input/output costs = $0.40 / $2 per mTok and context_window = 131,072. In our tests Haiku: faithfulness 5, tool_calling 5, long_context 5. Devstral: faithfulness 4, tool_calling 3, long_context 4.

Bottom Line

For Faithfulness, choose Claude Haiku 4.5 if you need the highest fidelity to source material across very long documents or tool‑augmented pipelines and you can accept higher costs (input $1 / output $5 per mTok). Choose Devstral Medium if you need a cost‑efficient option that still scores well on faithfulness (4/5) for shorter documents, faster development, or budgeted production (input $0.40 / output $2 per mTok).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions