Devstral Small 1.1 vs Gemma 4 31B

On our 12-test suite, Gemma 4 31B is the clear winner for most production use cases—it wins 9 of 12 benchmarks, notably tool calling (5 vs 4) and strategic analysis (5 vs 2). Devstral Small 1.1 is the cheaper option with comparable classification and tied safety/long-context scores, so pick it if cost is the priority or you only need text-only models.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

google

Gemma 4 31B

Overall
4.42/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
4/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.130/MTok

Output

$0.380/MTok

Context Window262K

modelpicker.net

Benchmark Analysis

Across our 12-test suite, Gemma 4 31B wins the majority of tasks: structured output 5 vs 4 (Gemma tied for 1st of 54), strategic analysis 5 vs 2 (Gemma tied for 1st; Devstral ranks 44 of 54), tool calling 5 vs 4 (Gemma tied for 1st of 54), faithfulness 5 vs 4 (Gemma tied for 1st of 55), persona consistency 5 vs 2 (Gemma tied for 1st; Devstral ranks 51 of 53), agentic planning 5 vs 2 (Gemma tied for 1st; Devstral ranks 53 of 54), multilingual 5 vs 4 (Gemma tied for 1st), creative problem solving 4 vs 2 (Gemma rank 9 vs Devstral rank 47), and constrained rewriting 4 vs 3 (Gemma rank 6 vs Devstral rank 31). Devstral doesn’t win any benchmark outright in our tests; it ties Gemma on classification (both 4 — tied for 1st) and long context (both 4) and safety calibration (both 2). What this means for real tasks: Gemma is substantially stronger for tool-driven workflows, multi-step planning, and situations requiring tight, schema-compliant outputs. Devstral remains competitive for classification and handling long context in text-only scenarios but trails on agentic planning and persona fidelity. Note also Gemma’s larger context window (262,144 vs Devstral’s 131,072) and multimodal input (text+image+video -> text) versus Devstral’s text-only modality — factors that can matter even when long context scores tie.

BenchmarkDevstral Small 1.1Gemma 4 31B
Faithfulness4/55/5
Long Context4/54/5
Multilingual4/55/5
Tool Calling4/55/5
Classification4/54/5
Agentic Planning2/55/5
Structured Output4/55/5
Safety Calibration2/52/5
Strategic Analysis2/55/5
Persona Consistency2/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/54/5
Summary0 wins9 wins

Pricing Analysis

Costs are given per mtoken in the payload (interpreted here as per 1,000 tokens). Devstral Small 1.1: input $0.10 / mtok, output $0.30 / mtok. Gemma 4 31B: input $0.13 / mtok, output $0.38 / mtok. If you assume a 50/50 split of input/output tokens, per 1,000,000 tokens that equals: Devstral ≈ $200 (500k input = $50 + 500k output = $150), Gemma ≈ $255 (500k input = $65 + 500k output = $190). At scale that's Devstral ≈ $2,000 vs Gemma ≈ $2,550 for 10M tokens/month, and Devstral ≈ $20,000 vs Gemma ≈ $25,500 for 100M tokens/month. The priceRatio in the payload (0.789) matches this gap: Devstral is ~21% cheaper overall. High-volume apps, multi-tenant SaaS, and analytics pipelines should care about the difference; smaller projects or those that need Gemma's stronger capabilities may justify the higher spend.

Real-World Cost Comparison

TaskDevstral Small 1.1Gemma 4 31B
iChat response<$0.001<$0.001
iBlog post<$0.001<$0.001
iDocument batch$0.017$0.022
iPipeline run$0.170$0.216

Bottom Line

Choose Devstral Small 1.1 if: you need a lower-cost, text-only model for high-volume classification or long-context text tasks and want to save ~21% on token spend (per the payload prices). Choose Gemma 4 31B if: you need best-in-class tool calling, agentic planning, strategic analysis, persona consistency, multimodal inputs, or strict structured-output reliability — Gemma wins 9 of 12 tests in our suite and is worth the extra cost for those needs.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions