Devstral Small 1.1 vs Gemma 4 31B
On our 12-test suite, Gemma 4 31B is the clear winner for most production use cases—it wins 9 of 12 benchmarks, notably tool calling (5 vs 4) and strategic analysis (5 vs 2). Devstral Small 1.1 is the cheaper option with comparable classification and tied safety/long-context scores, so pick it if cost is the priority or you only need text-only models.
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
Gemma 4 31B
Benchmark Scores
External Benchmarks
Pricing
Input
$0.130/MTok
Output
$0.380/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test suite, Gemma 4 31B wins the majority of tasks: structured output 5 vs 4 (Gemma tied for 1st of 54), strategic analysis 5 vs 2 (Gemma tied for 1st; Devstral ranks 44 of 54), tool calling 5 vs 4 (Gemma tied for 1st of 54), faithfulness 5 vs 4 (Gemma tied for 1st of 55), persona consistency 5 vs 2 (Gemma tied for 1st; Devstral ranks 51 of 53), agentic planning 5 vs 2 (Gemma tied for 1st; Devstral ranks 53 of 54), multilingual 5 vs 4 (Gemma tied for 1st), creative problem solving 4 vs 2 (Gemma rank 9 vs Devstral rank 47), and constrained rewriting 4 vs 3 (Gemma rank 6 vs Devstral rank 31). Devstral doesn’t win any benchmark outright in our tests; it ties Gemma on classification (both 4 — tied for 1st) and long context (both 4) and safety calibration (both 2). What this means for real tasks: Gemma is substantially stronger for tool-driven workflows, multi-step planning, and situations requiring tight, schema-compliant outputs. Devstral remains competitive for classification and handling long context in text-only scenarios but trails on agentic planning and persona fidelity. Note also Gemma’s larger context window (262,144 vs Devstral’s 131,072) and multimodal input (text+image+video -> text) versus Devstral’s text-only modality — factors that can matter even when long context scores tie.
Pricing Analysis
Costs are given per mtoken in the payload (interpreted here as per 1,000 tokens). Devstral Small 1.1: input $0.10 / mtok, output $0.30 / mtok. Gemma 4 31B: input $0.13 / mtok, output $0.38 / mtok. If you assume a 50/50 split of input/output tokens, per 1,000,000 tokens that equals: Devstral ≈ $200 (500k input = $50 + 500k output = $150), Gemma ≈ $255 (500k input = $65 + 500k output = $190). At scale that's Devstral ≈ $2,000 vs Gemma ≈ $2,550 for 10M tokens/month, and Devstral ≈ $20,000 vs Gemma ≈ $25,500 for 100M tokens/month. The priceRatio in the payload (0.789) matches this gap: Devstral is ~21% cheaper overall. High-volume apps, multi-tenant SaaS, and analytics pipelines should care about the difference; smaller projects or those that need Gemma's stronger capabilities may justify the higher spend.
Real-World Cost Comparison
Bottom Line
Choose Devstral Small 1.1 if: you need a lower-cost, text-only model for high-volume classification or long-context text tasks and want to save ~21% on token spend (per the payload prices). Choose Gemma 4 31B if: you need best-in-class tool calling, agentic planning, strategic analysis, persona consistency, multimodal inputs, or strict structured-output reliability — Gemma wins 9 of 12 tests in our suite and is worth the extra cost for those needs.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.