R1 0528 vs Devstral Small 1.1
R1 0528 is the clear performance winner in our testing, outscoring Devstral Small 1.1 on 10 of 12 benchmarks — with particularly large gaps on agentic planning (5 vs 2), creative problem solving (4 vs 2), and strategic analysis (4 vs 2). Devstral Small 1.1 wins zero benchmarks outright and ties on two (structured output and classification), but its output cost of $0.30/M tokens vs R1 0528's $2.15/M makes it a viable choice for high-volume, narrow tasks where those two categories are sufficient. For most developers and teams, R1 0528 earns its premium on anything requiring reasoning, planning, or multi-step agentic work.
deepseek
R1 0528
Benchmark Scores
External Benchmarks
Pricing
Input
$0.500/MTok
Output
$2.15/MTok
modelpicker.net
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
Benchmark Analysis
R1 0528 wins 10 of 12 benchmarks in our testing; Devstral Small 1.1 wins none and ties on two.
Where R1 0528 dominates:
- Agentic planning: 5/5 (tied for 1st of 54 models) vs Devstral Small 1.1's 2/5 (53rd of 54). This is the largest practical gap — goal decomposition and failure recovery are core to any multi-step AI workflow, and Devstral Small 1.1 ranks near the bottom of all tested models.
- Tool calling: 5/5 (tied for 1st of 54) vs 4/5 (18th of 54). R1 0528 edges ahead on function selection and argument accuracy — meaningful for complex agentic pipelines.
- Creative problem solving: 4/5 (9th of 54) vs 2/5 (47th of 54). Devstral Small 1.1 ranks in the bottom 15% of all models on generating non-obvious, feasible ideas.
- Strategic analysis: 4/5 (27th of 54) vs 2/5 (44th of 54). Both sit in the middle or lower tiers here, but R1 0528 is substantially ahead on nuanced tradeoff reasoning.
- Persona consistency: 5/5 (tied 1st of 53) vs 2/5 (51st of 53). Devstral Small 1.1 is near the bottom — relevant for any chatbot or character-driven application.
- Faithfulness: 5/5 (tied 1st of 55) vs 4/5 (34th of 55). R1 0528 is more reliable at sticking to source material without hallucinating.
- Long context: 5/5 (tied 1st of 55) vs 4/5 (38th of 55). R1 0528 has a larger context window (163,840 vs 131,072 tokens) and outperforms on retrieval at 30K+ tokens.
- Safety calibration: 4/5 (6th of 55, only 4 models share this score) vs 2/5 (12th of 55, 20 models share this score). R1 0528 is substantially better calibrated on refusing harmful requests while permitting legitimate ones.
- Constrained rewriting: 4/5 (6th of 53) vs 3/5 (31st of 53).
- Multilingual: 5/5 (tied 1st of 55) vs 4/5 (36th of 55).
Where the models tie:
- Structured output: Both score 4/5 — both rank 26th of 54 in our testing. JSON schema compliance is equivalent here.
- Classification: Both score 4/5, tied for 1st of 53 models. Routing and categorization tasks are a genuine Devstral Small 1.1 strength.
External benchmarks (Epoch AI): R1 0528 scores 96.6% on MATH Level 5 (rank 5 of 14 models tested) and 66.4% on AIME 2025 (rank 16 of 23 models tested). No external benchmark scores are available in our data for Devstral Small 1.1. These figures place R1 0528 solidly in the top tier for competition-level math, though it trails the very top scorers on AIME 2025.
Important quirk for R1 0528: The model can return empty responses on structured output, constrained rewriting, and agentic planning tasks unless max completion tokens is set high enough — reasoning tokens consume output budget on short tasks. Factor this into your integration work.
Pricing Analysis
R1 0528 costs $0.50/M input and $2.15/M output tokens. Devstral Small 1.1 costs $0.10/M input and $0.30/M output — roughly 5x cheaper on input and 7x cheaper on output. At 1M output tokens/month, that's $2.15 vs $0.30 — a $1.85 difference that's negligible for most teams. At 10M output tokens/month, the gap grows to $18.50, still manageable for most production workloads. At 100M output tokens/month, R1 0528 costs $215,000 vs Devstral Small 1.1's $30,000 — a $185,000 annual difference that demands justification. At that scale, Devstral Small 1.1 is compelling for pipelines limited to classification or structured output tasks, where both models score identically (4/5 in our testing). For reasoning-heavy or agentic workloads, R1 0528's benchmark advantages are hard to route around regardless of cost.
Real-World Cost Comparison
Bottom Line
Choose R1 0528 if: You're building agentic systems, multi-step pipelines, or any application requiring planning, reasoning, or failure recovery. R1 0528 scores 5/5 on agentic planning (1st of 54 models in our testing) vs Devstral Small 1.1's 2/5 (53rd of 54). Also choose R1 0528 for multilingual applications, long-context retrieval, persona-driven chatbots, or any task where hallucination risk is high — it leads on faithfulness, persona consistency, and safety calibration. Be aware: you'll need to set high max completion tokens in your API calls, as R1 0528's reasoning tokens can exhaust output budgets on short tasks.
Choose Devstral Small 1.1 if: Your workload is dominated by classification or structured JSON output — the two areas where Devstral Small 1.1 matches R1 0528's scores exactly at 7x lower output cost. Devstral Small 1.1 is a 24B parameter model purpose-built for software engineering agents, so if your pipeline is narrowly scoped to code-related routing, categorization, or schema-compliant output at scale, the cost savings at 100M+ tokens/month are substantial. It also has no reported quirks around empty responses or token budget management, making it simpler to integrate.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.