Claude Haiku 4.5 vs Codestral 2508 for Faithfulness

Winner: Claude Haiku 4.5. In our testing both Claude Haiku 4.5 and Codestral 2508 score 5/5 on the Faithfulness task (sticking to source material without hallucinating). Because faithfulness in production depends not only on raw fidelity but also on maintaining a consistent voice and appropriate refusals, Claude Haiku 4.5 has the practical edge: persona_consistency 5 vs 3 and safety_calibration 2 vs 1 (our scores). Codestral 2508 wins on structured_output (5 vs 4), so it may be preferable when strict schema adherence is the dominant requirement, but for general faithfulness under real-world conversational and safety constraints, Claude Haiku 4.5 is the better choice.

anthropic

Claude Haiku 4.5

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$1.00/MTok

Output

$5.00/MTok

Context Window200K

modelpicker.net

mistral

Codestral 2508

Overall
3.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
5/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.300/MTok

Output

$0.900/MTok

Context Window256K

modelpicker.net

Task Analysis

What faithfulness demands: Faithfulness requires the model to stick to provided source material, avoid inventing facts, cite or preserve original phrasing when required, and refuse or flag requests that would force fabrication. Capabilities that matter: long_context (accurate retrieval across long inputs), tool_calling (to consult external sources or instruments reliably), structured_output (to keep citations and provenance in machine-readable form), persona_consistency (to avoid injection-driven drift that creates false attributions), and safety_calibration (to decline or hedge when the source is insufficient). In our testing both models achieved the top task score (5/5) on Faithfulness. To explain the practical differences seen in downstream use: Claude Haiku 4.5 scores higher on persona_consistency (5 vs 3) and safety_calibration (2 vs 1), which reduces risk of subtle hallucinations during multi-turn dialogue or when users press for unsupported claims. Codestral 2508 scores higher on structured_output (5 vs 4), which supports strict schema compliance and machine-parseable citations. Both tie at tool_calling (5) and long_context (5), so raw retrieval and function-invocation behavior were equal on our suite.

Practical Examples

  1. Legal-document fidelity (long multi-page source + conversational Q&A): Choose Claude Haiku 4.5. Both models scored 5/5 on Faithfulness, but Haiku’s persona_consistency 5 and safety_calibration 2 reduce risk of the model inventing contractual obligations during follow-up prompts. 2) Automated JSON citations for news ingestion (strict schema required): Choose Codestral 2508. Codestral’s structured_output is 5 vs Haiku’s 4, so it more reliably emits compliant JSON and preserves source fields. 3) Developer-facing toolchain that calls external validators (tool calling + structured output): Both models scored 5 on tool_calling and 5 on faithfulness; pick Codestral 2508 where cost and strict schema matter (input_cost_per_mtok=0.3, output_cost_per_mtok=0.9). 4) Customer support that must avoid confident hallucinations while maintaining persona (multi-turn): Claude Haiku 4.5 is preferable (persona_consistency 5 vs 3; input_cost_per_mtok=1, output_cost_per_mtok=5). 5) Bulk programmatic checks over very long contexts: both models support long_context (score 5), so choose based on structured-output needs and cost tradeoffs rather than raw faithfulness.

Bottom Line

For Faithfulness, choose Claude Haiku 4.5 if you need the safest conversational fidelity and consistent attribution across multi-turn dialogs (Haiku: faithfulness 5, persona_consistency 5 vs Codestral’s 3; safety_calibration 2 vs 1). Choose Codestral 2508 if strict machine-readable outputs and lower inference cost are your priority (Codestral: structured_output 5 vs Haiku 4; input_cost_per_mtok=0.3, output_cost_per_mtok=0.9). Note: both models score 5/5 on our Faithfulness test.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions