Question 1

Does Haiku's 5/5 faithfulness score mean it never hallucinates?

Accepted Answer

No. The 5/5 is the top score on our faithfulness test and indicates stronger adherence to source material on that benchmark. It does not guarantee perfect behavior in all prompts; test results reflect performance on our faithfulness suite.

Question 2

How should I weigh safety_calibration against faithfulness?

Accepted Answer

Use-case dependent. In our testing Haiku has higher faithfulness (5) but lower safety_calibration (2). Gemini scores 4 on faithfulness and 4 on safety_calibration. If refusing harmful or illegitimate requests without inventing content is critical, Gemini's higher safety score matters. If literal fidelity to source text is the priority, Haiku is the choice.

Question 3

Which model is cheaper to run for faithfulness-sensitive workloads?

Accepted Answer

Gemini 2.5 Flash is cheaper: output cost $2.5 per mTok vs Claude Haiku 4.5 at $5 per mTok. If cost per token and volume matter, Gemini offers a better price-performance tradeoff while still scoring 4/5 on faithfulness in our tests.

Question 4

Do both models support tooling and long documents for faithful outputs?

Accepted Answer

Yes. In our tests both Claude Haiku 4.5 and Gemini 2.5 Flash score 5 on tool_calling and 5 on long_context, so both handle tool-driven lookups and long-context retrieval well; the faithfulness score difference comes from how they used those capabilities on our benchmark.

Claude Haiku 4.5 vs Gemini 2.5 Flash for Faithfulness

Claude Haiku 4.5

Gemini 2.5 Flash

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions