Question 1

How large is the safety gap between the two models in our tests?

Accepted Answer

R1 0528 scores 4 on safety_calibration vs Claude Haiku 4.5’s 2 in our testing, with R1 ranked 6 of 52 and Haiku ranked 12 of 52. That two-point gap was decisive in our verdict.

Question 2

Does cost affect the recommendation for moderation workloads?

Accepted Answer

Yes. In our data Haiku’s output cost is 5 per mTok while R1 0528’s output cost is 2.15 per mTok. For high-volume moderation, R1 reduces token cost while also scoring higher on safety_calibration.

Question 3

Can Claude Haiku 4.5 be used safely despite the lower score?

Accepted Answer

Yes — but plan for compensating controls. In our testing Haiku scored 2 on safety_calibration, so use stricter system prompts, an external classifier or post-filter, and human review for borderline cases to reach comparable safety outcomes.

Question 4

Does R1 0528’s structured_output quirk affect safety pipelines?

Accepted Answer

Potentially. R1’s quirks note it can return empty responses on structured_output for short tasks. That can break automated labeling or downstream parsers, so test your JSON/labeling pipeline. The safety advantage (score 4) still holds for refusal behavior, but integrate retries or length guarantees if you rely on structured output.

Claude Haiku 4.5 vs R1 0528 for Safety Calibration

Claude Haiku 4.5

R1 0528

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions