Question 1

Is R1 0528 better than Devstral Medium?

Accepted Answer

Yes, by a wide margin in our testing. R1 0528 wins 10 of 12 benchmarks, ties 2, and Devstral Medium wins none. The gaps are especially large on tool calling (5 vs 3), strategic analysis (4 vs 2), creative problem solving (4 vs 2), persona consistency (5 vs 3), and safety calibration (4 vs 1). The only cost to R1 0528 is a 7.5% price premium ($2.15 vs $2.00 per MTok output).

Question 2

Which model is cheaper, R1 0528 or Devstral Medium?

Accepted Answer

Devstral Medium is marginally cheaper: $0.40/MTok input and $2.00/MTok output vs R1 0528's $0.50/MTok input and $2.15/MTok output. At 10M output tokens/month, you save $1.50. At 100M output tokens/month, the savings reach $150. For most usage volumes, the performance difference makes R1 0528 the better value.

Question 3

Which is better for coding and agentic workflows?

Accepted Answer

R1 0528 scores 5/5 on both tool calling (tied for 1st of 54 models) and agentic planning (tied for 1st of 54) in our tests. Devstral Medium scores 3/5 on tool calling (rank 47 of 54) and 4/5 on agentic planning (rank 16 of 54). Despite Devstral Medium's description as a code generation and agentic reasoning model, R1 0528 outperforms it on both relevant benchmarks. Note that R1 0528 requires high max completion token settings for agentic tasks or it may return empty responses.

Question 4

Which model handles math better?

Accepted Answer

R1 0528 has external benchmark data here: it scores 96.6% on MATH Level 5 (rank 5 of 14 models, Epoch AI) and 66.4% on AIME 2025 (rank 16 of 23, Epoch AI). The MATH Level 5 median across models with that score is 94.15%, so R1 0528 is above average. Devstral Medium has no external math benchmark scores in our data. R1 0528 is the stronger choice for math-intensive applications.

Question 5

Does R1 0528 have any integration quirks I should know about?

Accepted Answer

Yes — R1 0528 is a reasoning model that uses reasoning tokens, which consume part of the output budget. On short tasks or with low max completion token settings, it can return empty responses on structured output, constrained rewriting, and agentic planning tasks. The payload notes a minimum of 1,000 max completion tokens and recommends setting this value high for reliable results. Devstral Medium has no documented quirks in our data.

Question 6

Which model is safer and more reliable for production applications?

Accepted Answer

R1 0528 scores 4/5 on safety calibration in our testing, ranking 6th of 55 models — one of only 4 models at this score level. Devstral Medium scores 1/5 on safety calibration, ranking 32nd of 55. Safety calibration measures a model's ability to correctly refuse harmful requests while permitting legitimate ones. For any production application with real users, R1 0528 is substantially more calibrated on this dimension.

R1 0528 vs Devstral Medium

R1 0528

Devstral Medium

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions