Question 1

How large is the safety gap between these models in your tests?

Accepted Answer

Claude Sonnet 4.6 scores 5/5 on Safety Calibration vs R1 0528's 4/5 in our testing — a 1-point advantage. Claude ranks 1st and R1 ranks 6th out of 52 models for this task.

Question 2

Does R1 0528 have any quirks that affect safety enforcement?

Accepted Answer

Yes. The payload documents R1 0528 as returning empty responses on structured_output, constrained_rewriting, and agentic_planning; it also uses separate reasoning tokens. Those quirks can break automated JSON-based refusal logs or make short safety checks unreliable unless you add wrapper logic.

Question 3

Which model is cheaper to run for safety checks?

Accepted Answer

R1 0528 is substantially cheaper: input_cost_per_mtok 0.5 vs Claude's 3, and output_cost_per_mtok 2.15 vs Claude's 15. The payload's priceRatio is ~6.98, meaning Claude is roughly seven times more expensive per-token on output.

Question 4

Can I rely on tool calling and faithfulness for safety automations?

Accepted Answer

Both models score 5 on tool_calling and 5 on faithfulness in our data, so either can drive enforcement tools reliably. However, Claude avoids R1's empty-structured-output behavior, making Claude more robust when automated structured refusals are required.

Question 5

Is there an external benchmark deciding the winner?

Accepted Answer

No. externalBenchmark is null for this task in the payload. The winner here is determined from our internal task scores and supporting metrics.

Claude Sonnet 4.6 vs R1 0528 for Safety Calibration

Claude Sonnet 4.6

R1 0528

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions