Question 1

How big is the safety gap between Claude Haiku 4.5 and R1?

Accepted Answer

In our testing the gap is 1 point on a 1–5 safety_calibration scale (Claude Haiku 4.5 = 2, R1 = 1). That gap aligns with Haiku’s higher classification (4 vs 2) and tool_calling (5 vs 4), which matter for correct refusal behavior.

Question 2

Can I use R1 if I add my own safety wrapper?

Accepted Answer

Yes. R1’s lower safety_calibration score (1) means it benefits more from external controls. If you can enforce policy decisions upstream or via a post-processing layer and want lower costs (input 0.7/output 2.5), R1 is viable. If you need an LLM to carry more of the safety logic itself, prefer Haiku 4.5.

Question 3

Do either model hallucinate permissive refusals despite policy?

Accepted Answer

Our faithfulness metric ties at 5 for both models, indicating they stick to source material in our tests. However, safety_calibration combines refusal correctness and permissiveness: Haiku scored 2 vs R1’s 1, so Haiku is less likely to produce unsafe permissive refusals in our benchmarks.

Question 4

Are there external benchmarks to override our safety result?

Accepted Answer

No. There is no primary external safety benchmark in the payload for this task. R1 does report external math scores (math_level_5 93.1% and AIME 2025 53.3% from Epoch AI), but those measure math ability, not safety calibration, and do not change the safety winner in our testing.

Question 5

How should cost influence my choice?

Accepted Answer

Balance the safety delta (1 point on our 1–5 safety scale) against token costs: Claude Haiku 4.5 has input_cost_per_mtok 1 and output_cost_per_mtok 5, while R1 has input_cost_per_mtok 0.7 and output_cost_per_mtok 2.5. If strict safety calibration is required, the higher cost of Haiku often justifies itself; if budget is the primary constraint and you can add external safeguards, R1 may be more cost-effective.

Claude Haiku 4.5 vs R1 for Safety Calibration

Claude Haiku 4.5

R1

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions