Devstral Medium vs Grok 4

Grok 4 is the stronger model across nearly every dimension in our testing, winning 9 of 12 benchmarks outright and tying 2 more — its advantages on strategic analysis (5 vs 2), faithfulness (5 vs 4), and multilingual (5 vs 4) are particularly meaningful. Devstral Medium's only win is agentic planning (4 vs 3), which matters for autonomous workflow tasks. At $15/M output tokens versus $2/M for Devstral Medium, Grok 4 costs 7.5x more on the output side — a gap that's hard to justify unless you specifically need its reasoning depth or multimodal capabilities.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

xai

Grok 4

Overall
4.08/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window256K

modelpicker.net

Benchmark Analysis

Across our 12-test suite, Grok 4 wins 9 benchmarks, Devstral Medium wins 1, and they tie on 2.

Where Grok 4 wins clearly:

  • Strategic analysis: 5 vs 2. Grok 4 ties for 1st among 54 models; Devstral Medium ranks 44th. This is the largest gap in the comparison and means real differences in nuanced tradeoff reasoning and decision-support tasks.
  • Faithfulness: 5 vs 4. Grok 4 ties for 1st among 55 models; Devstral Medium ranks 34th. Fewer hallucinations and better source adherence — critical for RAG applications and summarization.
  • Persona consistency: 5 vs 3. Grok 4 ties for 1st among 53 models; Devstral Medium ranks 45th. A two-point gap here suggests Devstral Medium struggles to maintain character under pressure, which limits its usefulness in chatbot or roleplay applications.
  • Multilingual: 5 vs 4. Grok 4 ties for 1st among 55 models; Devstral Medium ranks 36th. Both score above the median (p50 = 5), but Grok 4 reaches the ceiling.
  • Tool calling: 4 vs 3. Grok 4 ranks 18th of 54; Devstral Medium ranks 47th. For agentic workflows dependent on accurate function selection and argument passing, this gap is operationally significant.
  • Long context: 5 vs 4. Grok 4 ties for 1st among 55 models; Devstral Medium ranks 38th. Grok 4 also has a 256K context window vs Devstral Medium's 131K — double the capacity.
  • Safety calibration: 2 vs 1. Neither model excels here; Grok 4 ranks 12th of 55 while Devstral Medium ranks 32nd. Both sit below the p50 of 2, though Devstral Medium's score of 1 puts it in the bottom quartile.
  • Constrained rewriting: 4 vs 3. Grok 4 ranks 6th of 53; Devstral Medium ranks 31st.
  • Creative problem solving: 3 vs 2. Grok 4 ranks 30th of 54; Devstral Medium ranks 47th.

Where Devstral Medium wins:

  • Agentic planning: 4 vs 3. Devstral Medium ranks 16th of 54; Grok 4 ranks 42nd. This is a meaningful reversal — Devstral Medium is built specifically for code generation and agentic reasoning, and this score reflects that. For goal decomposition and multi-step autonomous task execution, Devstral Medium outperforms Grok 4 in our tests.

Ties:

  • Structured output: Both score 4, both rank 26th of 54. JSON schema compliance is equivalent.
  • Classification: Both score 4, both tie for 1st of 53. Routing and categorization tasks are a wash.
BenchmarkDevstral MediumGrok 4
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling3/54/5
Classification4/54/5
Agentic Planning4/53/5
Structured Output4/54/5
Safety Calibration1/52/5
Strategic Analysis2/55/5
Persona Consistency3/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/53/5
Summary1 wins9 wins

Pricing Analysis

Devstral Medium costs $0.40/M input and $2.00/M output tokens. Grok 4 costs $3.00/M input and $15.00/M output tokens — 7.5x and 10x more expensive on input and output respectively. At 1M output tokens/month, that's $2 vs $15 — a $13 difference. At 10M tokens/month, you're paying $20 vs $150. At 100M tokens/month, the gap is $2,000 vs $15,000. For high-volume applications — bulk document processing, large-scale classification pipelines, or cost-sensitive consumer products — Devstral Medium's pricing is a genuine advantage. Grok 4's price premium makes sense for lower-volume, high-stakes tasks: legal analysis, strategic research, or multimodal workflows where Grok 4's image and file input support (not available on Devstral Medium per the payload) adds real capability. Note also that Grok 4 uses reasoning tokens, which can inflate actual output costs beyond the base rate.

Real-World Cost Comparison

TaskDevstral MediumGrok 4
iChat response$0.0011$0.0081
iBlog post$0.0042$0.032
iDocument batch$0.108$0.810
iPipeline run$1.08$8.10

Bottom Line

Choose Devstral Medium if your primary use case is agentic planning and multi-step autonomous workflows — it scores 4 vs Grok 4's 3 in our testing and ranks 16th of 54 models on that benchmark. It's also the right choice for high-volume, cost-sensitive applications: at $2/M output tokens, you can run 7.5x the volume for the same budget. It handles structured output and classification as well as Grok 4 at a fraction of the price.

Choose Grok 4 if you need strong strategic analysis (5/5, tied for 1st), high faithfulness for RAG or summarization pipelines (5/5), reliable multilingual output (5/5), or multimodal inputs (image and file support per the payload, which Devstral Medium does not offer). Grok 4's 256K context window is also twice Devstral Medium's 131K, useful for very long document workflows. The $15/M output cost is justified when task quality, reasoning depth, or modality support directly affects outcomes.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions