Devstral Medium vs Grok 3
Grok 3 is the stronger AI across nearly every benchmark in our testing, winning 10 of 12 tests with particularly large gaps in strategic analysis (5 vs 2), agentic planning (5 vs 4), and persona consistency (5 vs 3). Devstral Medium wins zero benchmarks outright and ties on classification and constrained rewriting. However, Devstral Medium's output cost of $2/MTok vs Grok 3's $15/MTok makes it 7.5x cheaper on the dimension that matters most at scale — making it a viable choice when budget is the primary constraint and you need basic structured output and faithfulness.
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
xai
Grok 3
Benchmark Scores
External Benchmarks
Pricing
Input
$3.00/MTok
Output
$15.00/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test benchmark suite, Grok 3 outperforms Devstral Medium on 10 tests, with the two models tying on constrained rewriting and classification. Devstral Medium wins zero tests outright.
Where Grok 3 leads:
- Strategic analysis: Grok 3 scores 5 vs Devstral Medium's 2 — the widest gap in this comparison. Grok 3 ties for 1st among 54 models on nuanced tradeoff reasoning with real numbers. Devstral Medium ranks 44th of 54, placing it firmly in the bottom tier on this task.
- Agentic planning: Grok 3 scores 5 vs Devstral Medium's 4. Grok 3 ties for 1st among 54 models on goal decomposition and failure recovery, sharing that top score with 14 others. Devstral Medium ranks 16th of 54 — respectable, but a meaningful step behind.
- Persona consistency: Grok 3 scores 5 vs Devstral Medium's 3. Grok 3 ties for 1st among 53 models; Devstral Medium ranks 45th of 53. This matters significantly for chatbot and assistant applications.
- Faithfulness: Grok 3 scores 5 vs Devstral Medium's 4. Grok 3 ties for 1st among 55 models on sticking to source material without hallucinating. For RAG applications and document-grounded tasks, this is a concrete advantage.
- Long context: Grok 3 scores 5 vs Devstral Medium's 4. Grok 3 ties for 1st among 55 models on retrieval accuracy at 30K+ tokens. Devstral Medium ranks 38th of 55.
- Multilingual: Grok 3 scores 5 vs Devstral Medium's 4. Grok 3 ties for 1st among 55 models; Devstral Medium ranks 36th of 55.
- Structured output: Grok 3 scores 5 vs Devstral Medium's 4. Grok 3 ties for 1st among 54 models on JSON schema compliance. Devstral Medium ranks 26th of 54 — at the median, not below it, but clearly trailing.
- Tool calling: Grok 3 scores 4 vs Devstral Medium's 3. Grok 3 ranks 18th of 54 on function selection and argument accuracy; Devstral Medium ranks 47th of 54. For LLM-powered integrations and API orchestration, this gap is operationally significant.
- Safety calibration: Grok 3 scores 2 vs Devstral Medium's 1. Neither model excels here — Grok 3 ranks 12th of 55 and Devstral Medium ranks 32nd of 55, but both sit below the field median on this test.
- Creative problem solving: Grok 3 scores 3 vs Devstral Medium's 2. Grok 3 ranks 30th of 54; Devstral Medium ranks 47th of 54.
Where models tie:
- Classification: Both models score 4 and both tie for 1st among 53 models — the score at which 30 models converge. For routing and categorization tasks, neither has an edge.
- Constrained rewriting: Both score 3 and both rank 31st of 53 (22 models share this score). Compression under hard character limits is a weak point for both.
The benchmark picture is clear: Grok 3 is the stronger model across the board in our testing, with the most decisive advantages in strategic reasoning, persona consistency, and faithfulness.
Pricing Analysis
The cost gap between these two models is substantial. Devstral Medium charges $0.40/MTok on input and $2/MTok on output. Grok 3 charges $3/MTok on input and $15/MTok on output — 7.5x higher on input and 7.5x higher on output.
At real-world volumes, that math compounds quickly:
- 1M output tokens/month: Devstral Medium costs $2; Grok 3 costs $15. Difference: $13.
- 10M output tokens/month: Devstral Medium costs $20; Grok 3 costs $150. Difference: $130.
- 100M output tokens/month: Devstral Medium costs $200; Grok 3 costs $1,500. Difference: $1,300.
For individual developers or small teams running low-to-moderate workloads, the raw dollar difference is modest. But for production systems processing tens of millions of tokens monthly — document pipelines, classification systems, or agentic loops — the $1,300/month gap at 100M tokens becomes a meaningful budget line item. Teams that need Grok 3's superior performance on strategic analysis, faithfulness, or long-context retrieval will likely find it worth the premium. Teams running high-volume classification or structured extraction tasks, where Devstral Medium scores competitively (tied for 1st on classification), should weigh whether the performance delta justifies the 7.5x cost multiplier.
Real-World Cost Comparison
Bottom Line
Choose Devstral Medium if:
- Your primary concern is cost at scale, and the $1,300/month savings at 100M output tokens is material to your budget.
- Your use case centers on classification or structured extraction, where Devstral Medium ties Grok 3 for 1st in our testing and you don't need the premium performance Grok 3 brings elsewhere.
- You're building a high-volume pipeline (document processing, data labeling, routing) where the 7.5x cost difference outweighs moderate performance gaps on tasks like persona consistency or strategic analysis.
- You're prototyping or in early development and want to control API spend before committing to a production model.
Choose Grok 3 if:
- You need reliable agentic workflows — Grok 3 scores 5 on agentic planning vs Devstral Medium's 4, and its 4 on tool calling dwarfs Devstral Medium's 3 (rank 47th of 54 in our tests).
- Faithfulness and RAG accuracy are critical — Grok 3 scores 5 vs Devstral Medium's 4, tying for 1st in our testing on source fidelity.
- Your application needs strong persona consistency (5 vs 3) — Devstral Medium ranks 45th of 53 on this test, making it a poor choice for chatbot or assistant products.
- You're analyzing complex strategy or business problems — the 5 vs 2 gap on strategic analysis in our testing is too wide to overlook for serious analytical use cases.
- Multilingual support matters — Grok 3 scores 5 vs Devstral Medium's 4, tying for 1st among 55 models.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.