Question 1

Is R1 0528 better than GPT-4.1?

Accepted Answer

It depends on the task. In our 12-test suite R1 0528 wins 3 tests to GPT-4.1's 2, with 7 ties. R1 outperforms GPT-4.1 on safety_calibration (4 vs 1), agentic_planning (5 vs 4), and creative_problem_solving (4 vs 3). GPT-4.1 wins strategic_analysis (5 vs 4) and constrained_rewriting (5 vs 4).

Question 2

Which model is cheaper?

Accepted Answer

R1 0528 is substantially cheaper. Per the payload: R1 input $0.50 / output $2.15 per mTok; GPT-4.1 input $2.00 / output $8.00 per mTok. With a 50/50 input/output split that equals ~$1,325 per 1M tokens for R1 vs ~$5,000 for GPT-4.1.

Question 3

Which is better for coding / SWE-bench?

Accepted Answer

On SWE-bench Verified (Epoch AI) the payload shows GPT-4.1 at 48.5% and R1 has no SWE-bench score provided. That means GPT-4.1 has a SWE-bench result in Epoch AI's dataset (48.5%), while we have no SWE-bench figure for R1 in the payload — use that when comparing coding-specific readiness.

Question 4

Which model is better at math?

Accepted Answer

On MATH Level 5 (Epoch AI) R1 scores 96.6% vs GPT-4.1 83.0%; on AIME 2025 R1 66.4% vs GPT-4.1 38.3% (Epoch AI). In our tests these external math scores indicate R1 leads on higher-difficulty math benchmarks included in the payload.

Question 5

How do context windows compare?

Accepted Answer

GPT-4.1 has a much larger context window: 1,047,576 tokens vs R1 0528's 163,840 tokens as listed in the payload. Both scored 5 on long_context in our internal tests (tied for 1st), but raw window size differs significantly.

Question 6

Are there any operational quirks to watch for?

Accepted Answer

Yes. R1 0528's quirks in the payload note it "returns empty responses on structured_output, constrained_rewriting, and agentic_planning" and "uses reasoning tokens" which consume output budget on short tasks — test those endpoints. GPT-4.1 has no quirks listed in the payload.

R1 0528 vs GPT-4.1

R1 0528

GPT-4.1

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions