Question 1

Is Devstral Medium better than Grok 3 Mini?

Accepted Answer

In our testing across 12 benchmarks, Grok 3 Mini outperforms Devstral Medium on 8 tests, with Devstral Medium winning only 1 (agentic planning, scoring 4 vs 3). Three benchmarks are tied. Grok 3 Mini also costs 4x less on output tokens ($0.50 vs $2.00/MTok), making it the stronger choice for most use cases.

Question 2

Which is cheaper — Devstral Medium or Grok 3 Mini?

Accepted Answer

Grok 3 Mini is significantly cheaper. It costs $0.30/MTok input and $0.50/MTok output. Devstral Medium costs $0.40/MTok input and $2.00/MTok output. The output price difference (4x) is the key number — at 10M output tokens/month, that's $5,000 vs $20,000. At 100M tokens, the gap reaches $150,000/month.

Question 3

Which is better for coding and agentic tasks?

Accepted Answer

It depends on the task. Devstral Medium wins on agentic planning in our testing (4 vs 3), and its description positions it as a code generation and agentic reasoning model. However, Grok 3 Mini scores 5 on tool calling vs Devstral Medium's 3 — tool calling is foundational for most agentic workflows involving function calls and API orchestration. If your agentic system relies heavily on tool use, Grok 3 Mini has a clearer advantage in our benchmarks.

Question 4

Which handles long documents better?

Accepted Answer

Grok 3 Mini scores 5 on long context in our testing (tied for 1st among 55 models), while Devstral Medium scores 4 (rank 38 of 55). Both share a 131,072-token context window, but Grok 3 Mini demonstrates better retrieval accuracy at 30K+ tokens in our benchmarks.

Question 5

Does Grok 3 Mini support reasoning tokens?

Accepted Answer

Yes. The payload shows Grok 3 Mini has a 'uses_reasoning_tokens' quirk and supports an 'include_reasoning' parameter, meaning raw thinking traces are accessible via the API. Devstral Medium does not have this feature listed in its supported parameters.

Question 6

Which model is safer and more reliable for production?

Accepted Answer

Grok 3 Mini scores 2 on safety calibration in our testing (rank 12 of 55), while Devstral Medium scores 1 (rank 32 of 55). Neither model is in the top tier on this dimension — the 75th percentile across all 55 tested models is also a score of 2 — but Devstral Medium's score of 1 places it among the lowest performers. For production applications requiring reliable refusal of harmful requests, Grok 3 Mini is the safer option of the two.

Devstral Medium vs Grok 3 Mini

Devstral Medium

Grok 3 Mini

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions