R1 0528 vs Devstral Medium

R1 0528 is the clear choice for most workloads, winning 10 of 12 benchmarks in our testing against Devstral Medium and tying the remaining two. Devstral Medium wins zero benchmarks outright, though it costs marginally less ($0.40/$2.00 input/output per MTok vs $0.50/$2.15). The 7.5% price premium for R1 0528 is hard to argue against given the performance gap — particularly on agentic planning, tool calling, and reasoning-heavy tasks.

deepseek

R1 0528

Overall
4.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
4/5
Strategic Analysis
4/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
96.6%
AIME 2025
66.4%

Pricing

Input

$0.500/MTok

Output

$2.15/MTok

Context Window164K

modelpicker.net

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Benchmark Analysis

R1 0528 dominates this comparison, winning 10 of 12 benchmarks in our testing with Devstral Medium winning none and the two tying on structured output and classification.

Agentic planning & tool calling: R1 0528 scores 5/5 on both, ranking tied for 1st among 54 models on tool calling and tied for 1st among 54 on agentic planning in our tests. Devstral Medium scores 3/5 on tool calling (rank 47 of 54) and 4/5 on agentic planning (rank 16 of 54). For any workflow involving function calls, multi-step agent loops, or goal decomposition, R1 0528 has a substantial edge.

Reasoning & analysis: R1 0528 scores 4/5 on strategic analysis (rank 27 of 54) and 4/5 on creative problem solving (rank 9 of 54, tied with 20 others). Devstral Medium scores 2/5 on both — rank 44 of 54 on strategic analysis and rank 47 of 54 on creative problem solving. These are significant gaps: strategic analysis tests nuanced tradeoff reasoning with real numbers, and creative problem solving tests non-obvious, feasible ideas. Devstral Medium is near the bottom of the field on both.

Math benchmarks (Epoch AI): R1 0528 scores 96.6% on MATH Level 5 (rank 5 of 14 models with this score) and 66.4% on AIME 2025 (rank 16 of 23). Devstral Medium has no external benchmark scores in our data. The 96.6% MATH Level 5 result places R1 0528 well above the median of 94.15% across models with that score, confirming strong mathematical reasoning.

Long context & faithfulness: R1 0528 scores 5/5 on both (tied for 1st among 55 models on each). Devstral Medium scores 4/5 on faithfulness (rank 34 of 55) and 4/5 on long context (rank 38 of 55). For retrieval at 30K+ tokens and staying grounded in source material, R1 0528 is more reliable.

Safety calibration: R1 0528 scores 4/5, rank 6 of 55 — one of only 4 models at this level in our tests. Devstral Medium scores 1/5, rank 32 of 55. This is the starkest gap in the comparison. For applications that need a model to correctly refuse harmful requests while permitting legitimate ones, R1 0528 is substantially more calibrated in our testing.

Persona consistency & multilingual: R1 0528 scores 5/5 on persona consistency (tied for 1st among 53 models) and 5/5 on multilingual (tied for 1st among 55). Devstral Medium scores 3/5 on persona consistency (rank 45 of 53) and 4/5 on multilingual (rank 36 of 55).

Ties: Both models score 4/5 on structured output and 4/5 on classification, sharing the same rank (26 of 54 and tied for 1st of 53, respectively) on each.

Important caveat: R1 0528 is a reasoning model that can return empty responses on structured output, constrained rewriting, and agentic planning tasks if max completion tokens are not set high enough, as reasoning tokens consume the output budget. This is a real integration concern — developers must set high max completion token values to avoid silent failures.

BenchmarkR1 0528Devstral Medium
Faithfulness5/54/5
Long Context5/54/5
Multilingual5/54/5
Tool Calling5/53/5
Classification4/54/5
Agentic Planning5/54/5
Structured Output4/54/5
Safety Calibration4/51/5
Strategic Analysis4/52/5
Persona Consistency5/53/5
Constrained Rewriting4/53/5
Creative Problem Solving4/52/5
Summary10 wins0 wins

Pricing Analysis

R1 0528 costs $0.50/MTok input and $2.15/MTok output. Devstral Medium costs $0.40/MTok input and $2.00/MTok output. That's a $0.10 input and $0.15 output gap per million tokens — small in absolute terms. At 1M output tokens/month, R1 0528 costs $2.15 vs Devstral Medium's $2.00 — a $0.15 difference, essentially negligible. At 10M output tokens, the gap grows to $1.50/month; at 100M tokens, it's $150/month. For high-volume production workloads pushing 100M+ output tokens monthly, the cost difference becomes a real line item, but the 7.5% price ratio means teams would need to be very cost-sensitive before the savings from Devstral Medium justify its weaker benchmark performance. For most developers and product teams, the capability gap far outweighs the price difference.

Real-World Cost Comparison

TaskR1 0528Devstral Medium
iChat response$0.0012$0.0011
iBlog post$0.0046$0.0042
iDocument batch$0.117$0.108
iPipeline run$1.18$1.08

Bottom Line

Choose R1 0528 if you're building agentic systems, need reliable tool calling, handle math-heavy or reasoning-intensive tasks, require strong safety calibration, or work across languages — it outperforms Devstral Medium on 10 of 12 benchmarks and costs only 7.5% more. Be prepared for the integration quirk: R1 0528 requires high max completion token settings or it will return empty responses on certain task types.

Choose Devstral Medium if you're at very high output volumes (100M+ tokens/month) where even a $150/month saving matters, your workload is limited to classification or structured output (where both models tie), and you have no need for agentic pipelines, multilingual output, or safety-calibrated responses. Devstral Medium's positioning as a code generation and agentic reasoning model does not translate to top benchmark scores in our testing — teams with coding-heavy workflows should weigh that carefully before committing.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions