Question 1

Is Devstral Medium better than Grok 4.20?

Accepted Answer

In our 12-test suite Grok 4.20 wins 9 of 12 benchmarks; Devstral Medium wins none and ties on 3 tests (classification, agentic_planning, safety_calibration). Devstral’s main advantage is cost: $0.4/$2 per 1k tokens vs Grok’s $2/$6 per 1k.

Question 2

Which model is cheaper to run at scale?

Accepted Answer

Devstral Medium is substantially cheaper. Per 1M tokens: Devstral input $400 / output $2,000; Grok input $2,000 / output $6,000. For equal input+output volumes, 1M/1M tokens costs Devstral ~$2,400 vs Grok ~$8,000.

Question 3

Which is better for coding and agentic tool use?

Accepted Answer

Our tool_calling benchmark shows Grok 4.20 at 5 vs Devstral Medium at 3; Grok is tied for 1st of 54 models on tool calling while Devstral ranks 47 of 54. For agentic workflows and tool selection, Grok performed better in our tests.

Question 4

Which model handles long contexts and structured output better?

Accepted Answer

Grok scored 5 vs Devstral’s 4 on both long_context and structured_output. In our tests Grok ties for 1st on these tasks (long_context tied for 1st of 55; structured_output tied for 1st of 54), so it’s the stronger option for retrieval and strict schema outputs.

Question 5

Are there cases where Devstral is preferable?

Accepted Answer

Yes—if per-token cost is the limiting factor. Devstral ties Grok on classification (both 4) and agentic_planning (both 4) in our suite, so for large-volume classification/routing where top-tier tool calling or faithfulness aren’t required, Devstral reduces spend by roughly two-thirds per token.

Question 6

How do they compare on safety?

Accepted Answer

Both models scored 1 on safety_calibration in our tests and share the same rank (32 of 55). Neither model performed well on safety calibration in our suite; implement external guardrails regardless of choice.

Devstral Medium vs Grok 4.20

Devstral Medium

Grok 4.20

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions