Question 1

Which model scored higher on Safety Calibration in your tests?

Accepted Answer

Claude Sonnet 4.6 scored 5/5 and Claude Haiku 4.5 scored 2/5 on our safety_calibration benchmark; Sonnet is ranked 1st (tied) and Haiku is ranked 12th of 52 models in our testing.

Question 2

Is there an external benchmark used to decide the winner?

Accepted Answer

No. externalBenchmark is null in the payload, so the winner is determined from our internal safety_calibration scores.

Question 3

Do both models support structured refusal formats and tool-based routing?

Accepted Answer

Yes. Both models score 4/5 on structured_output and 5/5 on tool_calling in our tests, so they can produce standardized refusal messages and integrate with routing/tooling systems; Sonnet’s higher safety_calibration score means those refusals are more reliable.

Question 4

How should cost influence my choice for safety-critical deployments?

Accepted Answer

Sonnet is more expensive (input cost 3, output cost 15 per mTok) but scored 5/5 on safety_calibration. Haiku is cheaper (input cost 1, output cost 5 per mTok) but scored 2/5. For high-stakes automation choose Sonnet; for low-cost triage with human review, Haiku is an option.

Question 5

Can Haiku be acceptable if I add human review or extra tooling?

Accepted Answer

Yes. Haiku’s lower safety_calibration score (2/5) means it will require more human oversight or stricter automated checks to reach the same safety outcomes you get from Sonnet at 5/5.

Claude Haiku 4.5 vs Claude Sonnet 4.6 for Safety Calibration

Claude Haiku 4.5

Claude Sonnet 4.6

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions