R1 vs R1 0528
R1 0528 is the practical pick for most teams: it wins 5 of 12 benchmarks in our testing and is cheaper (input $0.50 / output $2.15 per mTok). R1 still beats R1 0528 on strategic analysis and creative problem solving (R1 scored 5 vs 4 on those tests), so choose R1 when those two capabilities are mission-critical despite its ~16% higher per-token cost.
deepseek
R1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.700/MTok
Output
$2.50/MTok
modelpicker.net
deepseek
R1 0528
Benchmark Scores
External Benchmarks
Pricing
Input
$0.500/MTok
Output
$2.15/MTok
modelpicker.net
Benchmark Analysis
In our 12-test suite R1 0528 wins 5 tests, R1 wins 2, and 5 are ties (payload winLossTie). Detailed comparison (scores shown are from our testing):
- Tool calling: R1 0528 5 vs R1 4 — R1 0528 is tied for 1st on tool calling (rank: tied for 1st of 54); this matters for function selection, argument accuracy, and sequencing. Use R1 0528 for tool-driven apps.
- Classification: R1 0528 4 vs R1 2 — R1 0528 is tied for 1st on classification (rank tied for 1st of 53); R1’s rank is 51 of 53, so R1 is weak for routing/class labels in our tests.
- Long context: R1 0528 5 vs R1 4 — R1 0528 is tied for 1st on long_context (rank tied for 1st of 55); expect better retrieval and coherence past 30k tokens with R1 0528.
- Safety calibration: R1 0528 4 vs R1 1 — R1 0528 ranks 6 of 55 on safety in our tests vs R1 rank 32 of 55; R1 0528 better at refusing harmful requests while permitting legitimate ones.
- Agentic planning: R1 0528 5 vs R1 4 — R1 0528 tied for 1st on agentic_planning (rank tied for 1st of 54); better at goal decomposition and recovery.
- Strategic analysis: R1 5 vs R1 0528 4 — R1 ties for 1st on strategic_analysis (rank tied for 1st of 54); prefer R1 when fine-grained tradeoff reasoning with numbers is required.
- Creative problem solving: R1 5 vs R1 0528 4 — R1 tied for 1st on creative_problem_solving; R1 gives more non-obvious, specific feasible ideas in our tests.
- Ties (structured_output, constrained_rewriting, faithfulness, persona_consistency, multilingual): both models scored equal; for example both score 5 on persona_consistency and faithfulness. External math benchmarks (Epoch AI): on MATH Level 5, R1 93.1% vs R1 0528 96.6% (R1 0528 ranks 5 of 14 vs R1 rank 8 of 14). On AIME 2025, R1 53.3% vs R1 0528 66.4% (R1 0528 ranks 16 vs R1 17 of 23). We cite those Epoch AI results as supplementary evidence that R1 0528 is stronger on higher-difficulty math/coding tasks in these external measures. Overall: R1 0528 is better for tool-driven, long-context, safety-sensitive, and agentic workflows; R1 is a better pick for strategic numeric reasoning and creative ideation in our tests.
Pricing Analysis
Per-token list prices from the payload: R1 input $0.70 / output $2.50 per mTok; R1 0528 input $0.50 / output $2.15 per mTok. Using a 50/50 input:output split, monthly costs are:
- 1M tokens: R1 = $1,600; R1 0528 = $1,325 (save $275/month)
- 10M tokens: R1 = $16,000; R1 0528 = $13,250 (save $2,750/month)
- 100M tokens: R1 = $160,000; R1 0528 = $132,500 (save $27,500/month) If your app is output-heavy (more output tokens than input), the output-rate gap ($2.50 vs $2.15 per mTok) magnifies savings: R1 costs $2,500 per 1M output tokens vs $2,150 for R1 0528 ($350 difference per 1M output tokens). Teams with large volumes (10M+ tokens) or tight margins should prefer R1 0528 for cost efficiency; teams that need the specific strengths where R1 wins may accept the ~16% higher price (priceRatio 1.1628 in payload).
Real-World Cost Comparison
Bottom Line
Choose R1 0528 if: you need best-in-class tool calling, long-context coherence (tied for 1st), stronger safety calibration, agentic planning, and lower cost (input $0.50/output $2.15 per mTok). It also posts higher external math scores (MATH Level 5 96.6% and AIME 66.4% per Epoch AI). Choose R1 if: your product demands top-tier strategic analysis or creative problem solving (R1 scored 5 vs 4 on both in our tests) and you will accept ~16% higher per-token costs (R1 input $0.70/output $2.50 per mTok) for those strengths.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.