Devstral Medium vs Gemma 4 31B

In our testing Gemma 4 31B is the clear pick for most teams: it wins 10 of 12 benchmarks and is materially cheaper per-token. Devstral Medium ties on classification and long‑context but does not win any benchmark in our suite; pick Devstral only if you must use mistral's offering despite the higher cost.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

google

Gemma 4 31B

Overall
4.42/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
4/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.130/MTok

Output

$0.380/MTok

Context Window262K

modelpicker.net

Benchmark Analysis

Summary from our 12-test suite: Gemma 4 31B wins 10 tests, Devstral Medium wins 0, and there are 2 ties (classification, long_context). Scores (Devstral → Gemma) and what they imply: 1) tool_calling — 3 vs 5: Gemma wins and ranks "tied for 1st with 16 others out of 54" on tool calling; this means Gemma is reliably better at selecting functions, sequencing calls, and filling args in real tool-driven workflows. 2) faithfulness — 4 vs 5: Gemma wins and is tied for 1st of 55 on faithfulness; expect fewer hallucinations and closer adherence to source material with Gemma in our tests. 3) structured_output — 4 vs 5: Gemma wins and is tied for 1st of 54 on JSON/schema compliance; Gemma better matches strict format requirements. 4) strategic_analysis — 2 vs 5: Gemma wins (tied for 1st of 54); Gemma handles nuanced tradeoff reasoning and numeric tradeoffs far better in our scenarios. 5) constrained_rewriting — 3 vs 4: Gemma wins and ranks 6th of 53; Gemma compresses and rewrites to tight limits more reliably. 6) creative_problem_solving — 2 vs 4: Gemma wins and ranks 9th of 54; Gemma produced more specific, feasible ideas in our creative prompts. 7) agentic_planning — 4 vs 5: Gemma wins and is tied for 1st of 54; Gemma better decomposes goals and recovers from failures. 8) persona_consistency — 3 vs 5: Gemma wins and is tied for 1st of 53; Gemma kept character and resisted injection better in our tests. 9) multilingual — 4 vs 5: Gemma wins and is tied for 1st of 55; Gemma produced higher-quality non-English output in our samples. 10) safety_calibration — 1 vs 2: Gemma wins and ranks 12th of 55; Gemma made more appropriate allow/refuse choices in our safety prompts. 11) classification — 4 vs 4: tie; both models tied for 1st with 29 others out of 53 tested, so both are good at routing/categorization in our suite. 12) long_context — 4 vs 4: tie; both scored 4 and rank 38 of 55, indicating similar retrieval accuracy at 30K+ tokens in our experiments. Practical takeaway: Gemma outperforms Devstral across tool-driven, reasoning, format-sensitive, multilingual, and safety-sensitive tasks in our testing. Neither model has external benchmark scores included in the payload.

BenchmarkDevstral MediumGemma 4 31B
Faithfulness4/55/5
Long Context4/54/5
Multilingual4/55/5
Tool Calling3/55/5
Classification4/54/5
Agentic Planning4/55/5
Structured Output4/55/5
Safety Calibration1/52/5
Strategic Analysis2/55/5
Persona Consistency3/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/54/5
Summary0 wins10 wins

Pricing Analysis

Raw unit prices from the payload: Devstral Medium charges $0.40 per input mTok and $2.00 per output mTok; Gemma 4 31B charges $0.13 input / $0.38 output mTok. The payload priceRatio (Devstral output ÷ Gemma output) is 5.263. Using a simple 50/50 input/output token split as an illustrative example, per-month costs are: • 1M tokens: Devstral ≈ $1,200 vs Gemma ≈ $255. • 10M tokens: Devstral ≈ $12,000 vs Gemma ≈ $2,550. • 100M tokens: Devstral ≈ $120,000 vs Gemma ≈ $25,500. High-volume apps (10M+ tokens/month) will see six-figure differences quickly; cost-sensitive products, consumer apps, and startups should prefer Gemma for price-performance. The Devstral price is only defensible if you have non-cost constraints (vendor requirements, existing contracts) or specific integration needs not captured in these benchmarks.

Real-World Cost Comparison

TaskDevstral MediumGemma 4 31B
iChat response$0.0011<$0.001
iBlog post$0.0042<$0.001
iDocument batch$0.108$0.022
iPipeline run$1.08$0.216

Bottom Line

Choose Gemma 4 31B if you need the best value and higher accuracy on tooling, strategic reasoning, structured outputs, multilingual tasks, and safety calibration — it wins 10/12 tests in our suite and costs far less per token. Choose Devstral Medium only if you have a firm constraint to use mistral's model or special integration reasons; it ties on classification and long-context but does not win any benchmark in our testing and costs substantially more.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions