Devstral Medium vs Grok 4.20

In our 12-test suite Grok 4.20 is the practical winner for agents, long-context retrieval, and high-fidelity outputs, winning 9 of 12 benchmarks. Devstral Medium offers the same classification and agentic-planning scores at roughly one-third the per-token cost, so pick it when price is the dominant constraint.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

xai

Grok 4.20

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$2.00/MTok

Output

$6.00/MTok

Context Window2000K

modelpicker.net

Benchmark Analysis

Across our 12-test suite Grok 4.20 dominates: it wins 9 benchmarks, Devstral Medium wins 0, and 3 tests tie (classification, safety_calibration, agentic_planning). Test-by-test (scoreA = Devstral, scoreB = Grok) with interpretation: - Tool calling: 3 vs 5. Grok ranks tied for 1st of 54 (best-in-class group); Devstral ranks 47 of 54. For agents and function selection, Grok’s 5 means more accurate function choice and argument sequencing in our tests. - Faithfulness: 4 vs 5. Grok is tied for 1st of 55; Devstral is mid-pack (rank 34). Expect fewer hallucinations with Grok in our testing. - Long context: 4 vs 5. Grok tied for 1st of 55; Devstral rank 38. For retrieval over 30K+ tokens, Grok performed better in our runs. - Structured output: 4 vs 5. Grok tied for 1st of 54; Devstral rank 26 — Grok better at strict JSON/schema adherence in our tests. - Strategic analysis: 2 vs 5. Grok tied for 1st; Devstral ranks 44 — Grok handled nuanced tradeoffs and numeric reasoning far better in our evaluations. - Constrained rewriting: 3 vs 4. Grok rank 6; Devstral rank 31 — Grok compresses to hard limits more reliably. - Creative problem solving: 2 vs 4. Grok rank 9; Devstral rank 47 — Grok produced more feasible, non-obvious ideas in our tasks. - Persona consistency: 3 vs 5. Grok tied for 1st; Devstral rank 45 — Grok kept role/character fidelity better. - Multilingual: 4 vs 5. Grok tied for 1st; Devstral rank 36 — Grok showed stronger non-English parity. - Classification: 4 vs 4 (tie). Both tied for 1st with many models; classification/routing tasks are comparable in our tests. - Agentic planning: 4 vs 4 (tie). Both models scored equally on goal decomposition and recovery in our suite. - Safety calibration: 1 vs 1 (tie). Both models scored poorly at safety calibration in our tests (rank 32 of 55); neither reliably refuses harmful requests while permitting legitimate ones. Overall, Grok’s wins concentrate where agents, retrieval, and strict formats matter; Devstral matches basic classification and planning performance but lags on tool-calling, faithfulness, and long-context.

BenchmarkDevstral MediumGrok 4.20
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling3/55/5
Classification4/54/5
Agentic Planning4/54/5
Structured Output4/55/5
Safety Calibration1/51/5
Strategic Analysis2/55/5
Persona Consistency3/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/54/5
Summary0 wins9 wins

Pricing Analysis

Pricing from the payload: Devstral Medium charges $0.4 (input) / $2 (output) per 1k tokens; Grok 4.20 charges $2 (input) / $6 (output) per 1k tokens. That means per-million tokens (1,000 × 1k): Devstral input = $400, output = $2,000; Grok input = $2,000, output = $6,000. If you run equal input+output volumes, a 1M/1M token month costs Devstral ~$2,400 vs Grok ~$8,000 (difference $5,600). At 10M/10M: Devstral ~$24,000 vs Grok ~$80,000. At 100M/100M: Devstral ~$240,000 vs Grok ~$800,000. The payload’s priceRatio (0.3333) reflects Devstral being roughly one-third the per-token cost. High-volume deployments (10M+ tokens/month) and cost-sensitive startups should prioritize Devstral Medium; teams that need better tool calling, faithfulness, long-context, multilingual and structured output should budget for Grok.

Real-World Cost Comparison

TaskDevstral MediumGrok 4.20
iChat response$0.0011$0.0034
iBlog post$0.0042$0.013
iDocument batch$0.108$0.340
iPipeline run$1.08$3.40

Bottom Line

Choose Devstral Medium if: you need a lower-cost model (input $0.4 / output $2 per 1k) for high-volume classification, basic agentic planning, or budget-constrained production where tool-calling fidelity and top-tier long-context are not critical. Choose Grok 4.20 if: you prioritize accurate tool calling, stronger faithfulness, long-context retrieval, structured outputs, multilingual parity, and better strategic/creative reasoning and can absorb the higher cost (input $2 / output $6 per 1k). Note both models scored equally on classification and agentic planning in our tests, and both scored low on safety calibration—plan safeguards accordingly.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions