Devstral Medium vs Grok 4.20
In our 12-test suite Grok 4.20 is the practical winner for agents, long-context retrieval, and high-fidelity outputs, winning 9 of 12 benchmarks. Devstral Medium offers the same classification and agentic-planning scores at roughly one-third the per-token cost, so pick it when price is the dominant constraint.
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
xai
Grok 4.20
Benchmark Scores
External Benchmarks
Pricing
Input
$2.00/MTok
Output
$6.00/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test suite Grok 4.20 dominates: it wins 9 benchmarks, Devstral Medium wins 0, and 3 tests tie (classification, safety_calibration, agentic_planning). Test-by-test (scoreA = Devstral, scoreB = Grok) with interpretation: - Tool calling: 3 vs 5. Grok ranks tied for 1st of 54 (best-in-class group); Devstral ranks 47 of 54. For agents and function selection, Grok’s 5 means more accurate function choice and argument sequencing in our tests. - Faithfulness: 4 vs 5. Grok is tied for 1st of 55; Devstral is mid-pack (rank 34). Expect fewer hallucinations with Grok in our testing. - Long context: 4 vs 5. Grok tied for 1st of 55; Devstral rank 38. For retrieval over 30K+ tokens, Grok performed better in our runs. - Structured output: 4 vs 5. Grok tied for 1st of 54; Devstral rank 26 — Grok better at strict JSON/schema adherence in our tests. - Strategic analysis: 2 vs 5. Grok tied for 1st; Devstral ranks 44 — Grok handled nuanced tradeoffs and numeric reasoning far better in our evaluations. - Constrained rewriting: 3 vs 4. Grok rank 6; Devstral rank 31 — Grok compresses to hard limits more reliably. - Creative problem solving: 2 vs 4. Grok rank 9; Devstral rank 47 — Grok produced more feasible, non-obvious ideas in our tasks. - Persona consistency: 3 vs 5. Grok tied for 1st; Devstral rank 45 — Grok kept role/character fidelity better. - Multilingual: 4 vs 5. Grok tied for 1st; Devstral rank 36 — Grok showed stronger non-English parity. - Classification: 4 vs 4 (tie). Both tied for 1st with many models; classification/routing tasks are comparable in our tests. - Agentic planning: 4 vs 4 (tie). Both models scored equally on goal decomposition and recovery in our suite. - Safety calibration: 1 vs 1 (tie). Both models scored poorly at safety calibration in our tests (rank 32 of 55); neither reliably refuses harmful requests while permitting legitimate ones. Overall, Grok’s wins concentrate where agents, retrieval, and strict formats matter; Devstral matches basic classification and planning performance but lags on tool-calling, faithfulness, and long-context.
Pricing Analysis
Pricing from the payload: Devstral Medium charges $0.4 (input) / $2 (output) per 1k tokens; Grok 4.20 charges $2 (input) / $6 (output) per 1k tokens. That means per-million tokens (1,000 × 1k): Devstral input = $400, output = $2,000; Grok input = $2,000, output = $6,000. If you run equal input+output volumes, a 1M/1M token month costs Devstral ~$2,400 vs Grok ~$8,000 (difference $5,600). At 10M/10M: Devstral ~$24,000 vs Grok ~$80,000. At 100M/100M: Devstral ~$240,000 vs Grok ~$800,000. The payload’s priceRatio (0.3333) reflects Devstral being roughly one-third the per-token cost. High-volume deployments (10M+ tokens/month) and cost-sensitive startups should prioritize Devstral Medium; teams that need better tool calling, faithfulness, long-context, multilingual and structured output should budget for Grok.
Real-World Cost Comparison
Bottom Line
Choose Devstral Medium if: you need a lower-cost model (input $0.4 / output $2 per 1k) for high-volume classification, basic agentic planning, or budget-constrained production where tool-calling fidelity and top-tier long-context are not critical. Choose Grok 4.20 if: you prioritize accurate tool calling, stronger faithfulness, long-context retrieval, structured outputs, multilingual parity, and better strategic/creative reasoning and can absorb the higher cost (input $2 / output $6 per 1k). Note both models scored equally on classification and agentic planning in our tests, and both scored low on safety calibration—plan safeguards accordingly.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.