Devstral Medium vs Gemma 4 31B
In our testing Gemma 4 31B is the clear pick for most teams: it wins 10 of 12 benchmarks and is materially cheaper per-token. Devstral Medium ties on classification and long‑context but does not win any benchmark in our suite; pick Devstral only if you must use mistral's offering despite the higher cost.
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
Gemma 4 31B
Benchmark Scores
External Benchmarks
Pricing
Input
$0.130/MTok
Output
$0.380/MTok
modelpicker.net
Benchmark Analysis
Summary from our 12-test suite: Gemma 4 31B wins 10 tests, Devstral Medium wins 0, and there are 2 ties (classification, long_context). Scores (Devstral → Gemma) and what they imply: 1) tool_calling — 3 vs 5: Gemma wins and ranks "tied for 1st with 16 others out of 54" on tool calling; this means Gemma is reliably better at selecting functions, sequencing calls, and filling args in real tool-driven workflows. 2) faithfulness — 4 vs 5: Gemma wins and is tied for 1st of 55 on faithfulness; expect fewer hallucinations and closer adherence to source material with Gemma in our tests. 3) structured_output — 4 vs 5: Gemma wins and is tied for 1st of 54 on JSON/schema compliance; Gemma better matches strict format requirements. 4) strategic_analysis — 2 vs 5: Gemma wins (tied for 1st of 54); Gemma handles nuanced tradeoff reasoning and numeric tradeoffs far better in our scenarios. 5) constrained_rewriting — 3 vs 4: Gemma wins and ranks 6th of 53; Gemma compresses and rewrites to tight limits more reliably. 6) creative_problem_solving — 2 vs 4: Gemma wins and ranks 9th of 54; Gemma produced more specific, feasible ideas in our creative prompts. 7) agentic_planning — 4 vs 5: Gemma wins and is tied for 1st of 54; Gemma better decomposes goals and recovers from failures. 8) persona_consistency — 3 vs 5: Gemma wins and is tied for 1st of 53; Gemma kept character and resisted injection better in our tests. 9) multilingual — 4 vs 5: Gemma wins and is tied for 1st of 55; Gemma produced higher-quality non-English output in our samples. 10) safety_calibration — 1 vs 2: Gemma wins and ranks 12th of 55; Gemma made more appropriate allow/refuse choices in our safety prompts. 11) classification — 4 vs 4: tie; both models tied for 1st with 29 others out of 53 tested, so both are good at routing/categorization in our suite. 12) long_context — 4 vs 4: tie; both scored 4 and rank 38 of 55, indicating similar retrieval accuracy at 30K+ tokens in our experiments. Practical takeaway: Gemma outperforms Devstral across tool-driven, reasoning, format-sensitive, multilingual, and safety-sensitive tasks in our testing. Neither model has external benchmark scores included in the payload.
Pricing Analysis
Raw unit prices from the payload: Devstral Medium charges $0.40 per input mTok and $2.00 per output mTok; Gemma 4 31B charges $0.13 input / $0.38 output mTok. The payload priceRatio (Devstral output ÷ Gemma output) is 5.263. Using a simple 50/50 input/output token split as an illustrative example, per-month costs are: • 1M tokens: Devstral ≈ $1,200 vs Gemma ≈ $255. • 10M tokens: Devstral ≈ $12,000 vs Gemma ≈ $2,550. • 100M tokens: Devstral ≈ $120,000 vs Gemma ≈ $25,500. High-volume apps (10M+ tokens/month) will see six-figure differences quickly; cost-sensitive products, consumer apps, and startups should prefer Gemma for price-performance. The Devstral price is only defensible if you have non-cost constraints (vendor requirements, existing contracts) or specific integration needs not captured in these benchmarks.
Real-World Cost Comparison
Bottom Line
Choose Gemma 4 31B if you need the best value and higher accuracy on tooling, strategic reasoning, structured outputs, multilingual tasks, and safety calibration — it wins 10/12 tests in our suite and costs far less per token. Choose Devstral Medium only if you have a firm constraint to use mistral's model or special integration reasons; it ties on classification and long-context but does not win any benchmark in our testing and costs substantially more.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.