Question 1

Is R1 0528 better than o3?

Accepted Answer

In our testing R1 0528 wins more head-to-head benchmarks (3 vs 2). R1 wins classification (4 vs 3), long_context (5 vs 4), and safety_calibration (4 vs 1). o3 wins structured_output (5 vs 4) and strategic_analysis (5 vs 4).

Question 2

Which model is cheaper to run?

Accepted Answer

R1 0528 is substantially cheaper: input $0.50/1k and output $2.15/1k vs o3 at $2/1k input and $8/1k output. At a 50/50 in/out mix that's ≈ $2.65/1k for R1 vs $10.00/1k for o3 (1M tokens → ~$2,650 vs ~$10,000).

Question 3

Which is better for coding and SWE-bench?

Accepted Answer

On SWE-bench Verified (Epoch AI), o3 is scored in the payload at 62.3% (rank 9 of 12). R1 does not have a SWE-bench Verified score in the payload, so based on available third-party data o3 is the stronger candidate for our coding benchmark.

Question 4

Which is better for math and contest problems?

Accepted Answer

According to Epoch AI benchmarks in the payload, o3 scores 97.8% on math_level_5 vs R1 96.6%; on AIME 2025 o3 scores 83.9% vs R1 66.4%. These external scores show o3 has an advantage on higher-difficulty math tasks.

Question 5

Are there any quirks to watch for if I choose R1 0528?

Accepted Answer

Yes — the payload flags R1 returning empty responses on structured_output, constrained_rewriting, and agentic_planning in some cases; it uses reasoning tokens that consume output budget on short tasks and requires high max completion tokens (min_max_completion_tokens: 1000). Test schema-based and short-output flows before production.

Question 6

Which model is safer at refusing harmful requests?

Accepted Answer

In our safety_calibration test R1 scores 4 vs o3's 1, indicating R1 better balances refusing harmful prompts while permitting legitimate ones in our benchmarks.

R1 0528 vs o3

R1 0528

o3

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions