Devstral Medium vs Grok Code Fast 1

Grok Code Fast 1 wins this matchup outright — it scores higher than Devstral Medium on 6 of 12 benchmarks in our testing, ties on the remaining 6, and wins on 0. It also costs less: $0.20 input / $1.50 output per MTok versus Devstral Medium's $0.40 / $2.00. Devstral Medium holds its own only in parity — it never pulls ahead — so Grok Code Fast 1 is the stronger choice for most coding and agentic workloads at a lower price.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

xai

Grok Code Fast 1

Overall
3.67/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
3/5
Persona Consistency
4/5
Constrained Rewriting
3/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.200/MTok

Output

$1.50/MTok

Context Window256K

modelpicker.net

Benchmark Analysis

Across our 12-test benchmark suite, Grok Code Fast 1 wins 6 tests outright and ties the remaining 6. Devstral Medium wins none.

Where Grok Code Fast 1 wins clearly:

  • Agentic planning (5 vs 4): Grok Code Fast 1 scores 5/5, tied for 1st among 54 models in our testing. Devstral Medium scores 4/5, ranked 16th of 54. For multi-step task execution and autonomous coding agents, this gap matters — 5 represents the top tier while 4 is solid mid-pack.
  • Tool calling (4 vs 3): Grok Code Fast 1 scores 4/5 (rank 18 of 54), Devstral Medium scores 3/5 (rank 47 of 54). A score of 3 places Devstral Medium near the bottom of the field on function selection and argument accuracy — a meaningful gap for API-integrated or tool-augmented workflows.
  • Persona consistency (4 vs 3): Grok Code Fast 1 ranks 38 of 53; Devstral Medium ranks 45 of 53. Both are below median, but Devstral Medium's 3/5 is notably weaker for chatbot or roleplay applications.
  • Strategic analysis (3 vs 2): Both are below the median (p50 = 4), but Devstral Medium's 2/5 puts it at rank 44 of 54 — near the bottom. Grok Code Fast 1 scores 3/5 at rank 36. Neither excels at nuanced tradeoff reasoning, but Devstral Medium struggles more.
  • Creative problem solving (3 vs 2): Same pattern — Grok Code Fast 1 scores 3/5 (rank 30 of 54), Devstral Medium scores 2/5 (rank 47 of 54). Generating non-obvious, feasible ideas is a weak point for Devstral Medium.
  • Safety calibration (2 vs 1): Grok Code Fast 1 scores 2/5 (rank 12 of 55), Devstral Medium scores 1/5 (rank 32 of 55). Neither is strong here — the p75 is only 2, meaning most models score low — but Devstral Medium's 1/5 is the floor of our scale.

Where they tie:

  • Structured output (4/4): Both score 4/5, tied at rank 26 of 54. Solid JSON schema compliance from both.
  • Faithfulness (4/4): Both score 4/5 at rank 34 of 55. Neither hallucinates frequently in our tests.
  • Classification (4/4): Both tied for 1st among 53 models — the most crowded top tier in our suite. Strong routing accuracy from both.
  • Long context (4/4): Both score 4/5 at rank 38 of 55. Adequate retrieval at 30K+ tokens, though not top-tier.
  • Constrained rewriting (3/3): Both rank 31 of 53. Mid-pack compression performance.
  • Multilingual (4/4): Both rank 36 of 55. Consistent non-English quality from both.

The pattern is clear: where Devstral Medium diverges from Grok Code Fast 1, it diverges downward. Its weakest results — tool calling at rank 47, creative problem solving at rank 47, safety calibration at rank 32 with a 1/5 score — are liabilities for production deployments. Grok Code Fast 1's standout is agentic planning at 5/5, tied for 1st, which aligns directly with its described strength as a coding agent model. Note that neither model has been tested on our suite's external benchmarks (SWE-bench Verified, AIME 2025, MATH Level 5) as of this report.

BenchmarkDevstral MediumGrok Code Fast 1
Faithfulness4/54/5
Long Context4/54/5
Multilingual4/54/5
Tool Calling3/54/5
Classification4/54/5
Agentic Planning4/55/5
Structured Output4/54/5
Safety Calibration1/52/5
Strategic Analysis2/53/5
Persona Consistency3/54/5
Constrained Rewriting3/53/5
Creative Problem Solving2/53/5
Summary0 wins6 wins

Pricing Analysis

Grok Code Fast 1 is cheaper on both dimensions: $0.20/MTok input and $1.50/MTok output versus Devstral Medium's $0.40/MTok input and $2.00/MTok output. That's half the input cost and 25% less on output. At 1M output tokens/month, you pay $1.50 vs $2.00 — a $0.50 gap. At 10M tokens it's $15 vs $20, and at 100M tokens the gap widens to $150 vs $200 per month on output alone. Input costs double that advantage: 100M input tokens costs $20 with Grok Code Fast 1 vs $40 with Devstral Medium. For high-volume agentic pipelines or code generation tools running millions of tokens monthly, Grok Code Fast 1 is the clear cost winner. The only reason to pay Devstral Medium's premium would be a specific workflow where its parity scores are a hard requirement — and the data shows no such edge exists.

Real-World Cost Comparison

TaskDevstral MediumGrok Code Fast 1
iChat response$0.0011<$0.001
iBlog post$0.0042$0.0031
iDocument batch$0.108$0.079
iPipeline run$1.08$0.790

Bottom Line

Choose Grok Code Fast 1 if: you're building agentic coding workflows, need reliable tool calling (4/5 at rank 18 vs 3/5 at rank 47), or are running high-volume pipelines where the lower cost ($0.20/$1.50 vs $0.40/$2.00 per MTok) compounds into real savings. Its 5/5 agentic planning score — tied for 1st among 54 models in our testing — makes it the better pick for autonomous agents that decompose goals and recover from failures. At 100M output tokens/month, you save $50 on output and $20 on input vs Devstral Medium.

Choose Devstral Medium if: you have a specific integration requirement tied to the Mistral ecosystem, need the supported parameters it offers (frequency_penalty, presence_penalty, seed), or your workload is dominated by tasks where both models tie — classification, faithfulness, structured output, or long context. Be aware you're paying more for no benchmark advantage. Devstral Medium's 1/5 safety calibration score is also worth flagging if your deployment has content moderation requirements.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions