Devstral Medium vs Grok 4
Grok 4 is the stronger model across nearly every dimension in our testing, winning 9 of 12 benchmarks outright and tying 2 more — its advantages on strategic analysis (5 vs 2), faithfulness (5 vs 4), and multilingual (5 vs 4) are particularly meaningful. Devstral Medium's only win is agentic planning (4 vs 3), which matters for autonomous workflow tasks. At $15/M output tokens versus $2/M for Devstral Medium, Grok 4 costs 7.5x more on the output side — a gap that's hard to justify unless you specifically need its reasoning depth or multimodal capabilities.
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
xai
Grok 4
Benchmark Scores
External Benchmarks
Pricing
Input
$3.00/MTok
Output
$15.00/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test suite, Grok 4 wins 9 benchmarks, Devstral Medium wins 1, and they tie on 2.
Where Grok 4 wins clearly:
- Strategic analysis: 5 vs 2. Grok 4 ties for 1st among 54 models; Devstral Medium ranks 44th. This is the largest gap in the comparison and means real differences in nuanced tradeoff reasoning and decision-support tasks.
- Faithfulness: 5 vs 4. Grok 4 ties for 1st among 55 models; Devstral Medium ranks 34th. Fewer hallucinations and better source adherence — critical for RAG applications and summarization.
- Persona consistency: 5 vs 3. Grok 4 ties for 1st among 53 models; Devstral Medium ranks 45th. A two-point gap here suggests Devstral Medium struggles to maintain character under pressure, which limits its usefulness in chatbot or roleplay applications.
- Multilingual: 5 vs 4. Grok 4 ties for 1st among 55 models; Devstral Medium ranks 36th. Both score above the median (p50 = 5), but Grok 4 reaches the ceiling.
- Tool calling: 4 vs 3. Grok 4 ranks 18th of 54; Devstral Medium ranks 47th. For agentic workflows dependent on accurate function selection and argument passing, this gap is operationally significant.
- Long context: 5 vs 4. Grok 4 ties for 1st among 55 models; Devstral Medium ranks 38th. Grok 4 also has a 256K context window vs Devstral Medium's 131K — double the capacity.
- Safety calibration: 2 vs 1. Neither model excels here; Grok 4 ranks 12th of 55 while Devstral Medium ranks 32nd. Both sit below the p50 of 2, though Devstral Medium's score of 1 puts it in the bottom quartile.
- Constrained rewriting: 4 vs 3. Grok 4 ranks 6th of 53; Devstral Medium ranks 31st.
- Creative problem solving: 3 vs 2. Grok 4 ranks 30th of 54; Devstral Medium ranks 47th.
Where Devstral Medium wins:
- Agentic planning: 4 vs 3. Devstral Medium ranks 16th of 54; Grok 4 ranks 42nd. This is a meaningful reversal — Devstral Medium is built specifically for code generation and agentic reasoning, and this score reflects that. For goal decomposition and multi-step autonomous task execution, Devstral Medium outperforms Grok 4 in our tests.
Ties:
- Structured output: Both score 4, both rank 26th of 54. JSON schema compliance is equivalent.
- Classification: Both score 4, both tie for 1st of 53. Routing and categorization tasks are a wash.
Pricing Analysis
Devstral Medium costs $0.40/M input and $2.00/M output tokens. Grok 4 costs $3.00/M input and $15.00/M output tokens — 7.5x and 10x more expensive on input and output respectively. At 1M output tokens/month, that's $2 vs $15 — a $13 difference. At 10M tokens/month, you're paying $20 vs $150. At 100M tokens/month, the gap is $2,000 vs $15,000. For high-volume applications — bulk document processing, large-scale classification pipelines, or cost-sensitive consumer products — Devstral Medium's pricing is a genuine advantage. Grok 4's price premium makes sense for lower-volume, high-stakes tasks: legal analysis, strategic research, or multimodal workflows where Grok 4's image and file input support (not available on Devstral Medium per the payload) adds real capability. Note also that Grok 4 uses reasoning tokens, which can inflate actual output costs beyond the base rate.
Real-World Cost Comparison
Bottom Line
Choose Devstral Medium if your primary use case is agentic planning and multi-step autonomous workflows — it scores 4 vs Grok 4's 3 in our testing and ranks 16th of 54 models on that benchmark. It's also the right choice for high-volume, cost-sensitive applications: at $2/M output tokens, you can run 7.5x the volume for the same budget. It handles structured output and classification as well as Grok 4 at a fraction of the price.
Choose Grok 4 if you need strong strategic analysis (5/5, tied for 1st), high faithfulness for RAG or summarization pipelines (5/5), reliable multilingual output (5/5), or multimodal inputs (image and file support per the payload, which Devstral Medium does not offer). Grok 4's 256K context window is also twice Devstral Medium's 131K, useful for very long document workflows. The $15/M output cost is justified when task quality, reasoning depth, or modality support directly affects outcomes.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.