Devstral Medium vs Grok 3 Mini
Grok 3 Mini is the stronger all-around choice in our testing, winning 8 of 12 benchmarks — including tool calling, faithfulness, persona consistency, and long context — while costing 4x less on output tokens ($0.50 vs $2.00 per MTok). Devstral Medium's only clear win is agentic planning (4 vs 3), making it worth considering specifically for multi-step autonomous workflows. For most developers and consumers, Grok 3 Mini delivers more capability at a fraction of the price.
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
xai
Grok 3 Mini
Benchmark Scores
External Benchmarks
Pricing
Input
$0.300/MTok
Output
$0.500/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test suite, Grok 3 Mini wins 8 benchmarks, Devstral Medium wins 1, and 3 are tied. Here's the breakdown:
Tool Calling (5 vs 3): Grok 3 Mini scores 5, tied for 1st among 54 models. Devstral Medium scores 3, ranked 47th of 54. This is a decisive gap — for any agentic workflow requiring function selection and argument accuracy, Grok 3 Mini is materially better in our testing.
Faithfulness (5 vs 4): Grok 3 Mini ties for 1st of 55 models; Devstral Medium ranks 34th. In RAG pipelines or any task where sticking to source material matters, Grok 3 Mini produces fewer hallucinations in our tests.
Persona Consistency (5 vs 3): Grok 3 Mini ties for 1st of 53 models; Devstral Medium ranks 45th. A two-point gap here is significant for chatbot or assistant deployments where maintaining character under adversarial prompts matters.
Long Context (5 vs 4): Grok 3 Mini ties for 1st of 55 models; Devstral Medium ranks 38th. Both models share a 131,072-token context window, but Grok 3 Mini retrieves more accurately at 30K+ tokens in our testing.
Strategic Analysis (3 vs 2): Grok 3 Mini scores 3, ranked 36th; Devstral Medium scores 2, ranked 44th. Neither model excels here, but Grok 3 Mini is less weak.
Constrained Rewriting (4 vs 3): Grok 3 Mini ranks 6th of 53; Devstral Medium ranks 31st. Compression tasks with hard character limits go to Grok 3 Mini.
Creative Problem Solving (3 vs 2): Grok 3 Mini ranks 30th; Devstral Medium ranks 47th. Devstral Medium sits in the bottom tier on this dimension.
Safety Calibration (2 vs 1): Grok 3 Mini ranks 12th of 55; Devstral Medium ranks 32nd. Both are below the 75th percentile (score of 2), but Devstral Medium's score of 1 places it among the lowest performers in our testing.
Agentic Planning (4 vs 3): This is Devstral Medium's only win. It scores 4 (rank 16 of 54) vs Grok 3 Mini's 3 (rank 42 of 54). For goal decomposition and failure recovery in autonomous agents, Devstral Medium has a measurable edge.
Structured Output, Classification, Multilingual (tied): Both models score identically — 4, 4, and 4 respectively — at the same ranks (26th, 1st tied, 36th). These are non-differentiators.
Note: Neither model has external benchmark scores (SWE-bench Verified, AIME 2025, MATH Level 5) in our dataset.
Pricing Analysis
Grok 3 Mini costs $0.30/MTok input and $0.50/MTok output. Devstral Medium costs $0.40/MTok input and $2.00/MTok output — 33% more on input, but 4x more on output, which is where most production costs accumulate. At 1M output tokens/month, Grok 3 Mini costs $0.50 vs Devstral Medium's $2.00 — a $1.50 difference that's nearly negligible. Scale to 10M tokens and the gap reaches $15,000/month ($5,000 vs $20,000). At 100M tokens, you're looking at $50,000 vs $200,000 — a $150,000 annual difference that matters for any serious production deployment. Developers running high-volume pipelines — chatbots, document processing, API-heavy agentic systems — should factor this heavily. The only scenario where Devstral Medium's premium makes sense is if your workload centers on agentic planning, where it scores 4 vs Grok 3 Mini's 3 in our testing.
Real-World Cost Comparison
Bottom Line
Choose Grok 3 Mini if you need strong tool calling, high faithfulness, reliable persona consistency, or solid long-context retrieval — which covers the majority of API use cases including agentic pipelines, RAG systems, and chat applications. The 4x output cost advantage makes it the default choice at any meaningful volume.
Choose Devstral Medium if your primary use case is agentic planning — specifically multi-step goal decomposition and failure recovery — where it scores 4 vs Grok 3 Mini's 3 in our testing, and that capability gap justifies paying $2.00 vs $0.50 per MTok on output. Devstral Medium's description positions it as a code generation and agentic reasoning model developed with All Hands AI, so if your workflow is deeply agentic-coding-focused, it may warrant evaluation despite the cost premium.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.