Devstral Medium vs Gemma 4 26B A4B

Gemma 4 26B A4B is the clear winner across our benchmarks, outscoring Devstral Medium on 8 of 12 tests while tying on 4 and winning none — and it does so at roughly one-sixth the output cost ($0.35 vs $2.00 per million tokens). Devstral Medium's narrow advantage lies in its purpose-built code generation and agentic reasoning positioning described in its product description, but our benchmark data does not show that advantage translating into higher scores on the tests we ran. At this price-to-performance gap, Gemma 4 26B A4B is the default choice unless your workflow demands something Devstral Medium specifically provides.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

google

Gemma 4 26B A4B

Overall
4.25/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.080/MTok

Output

$0.350/MTok

Context Window262K

modelpicker.net

Benchmark Analysis

Across our 12-test suite, Gemma 4 26B A4B wins 8 benchmarks, ties 4, and loses none. Devstral Medium wins zero and ties 4.

Tool Calling (5 vs 3): Gemma 4 26B A4B scores 5/5 (tied for 1st with 16 other models out of 54 tested); Devstral Medium scores 3/5 (rank 47 of 54). This is a significant gap for agentic and automation use cases where function selection and argument accuracy are critical.

Strategic Analysis (5 vs 2): Gemma 4 26B A4B scores 5/5 (tied for 1st with 25 others out of 54); Devstral Medium scores 2/5 (rank 44 of 54). In our testing, Devstral Medium placed near the bottom on nuanced tradeoff reasoning with real numbers.

Creative Problem Solving (4 vs 2): Gemma 4 26B A4B scores 4/5 (rank 9 of 54); Devstral Medium scores 2/5 (rank 47 of 54). A 2-point gap here means Devstral Medium struggles to generate non-obvious, specific, and feasible ideas.

Faithfulness (5 vs 4): Gemma 4 26B A4B scores 5/5 (tied for 1st with 32 others out of 55); Devstral Medium scores 4/5 (rank 34 of 55). Both are solid, but Gemma 4 26B A4B is at the ceiling on sticking to source material without hallucinating.

Long Context (5 vs 4): Gemma 4 26B A4B scores 5/5 (tied for 1st with 36 others out of 55) and has a 262,144-token context window; Devstral Medium scores 4/5 (rank 38 of 55) with a 131,072-token context window. The doubling of context capacity plus the higher score makes Gemma 4 26B A4B the clear pick for document-heavy workflows.

Multilingual (5 vs 4): Gemma 4 26B A4B scores 5/5 (tied for 1st with 34 others); Devstral Medium scores 4/5 (rank 36 of 55).

Persona Consistency (5 vs 3): Gemma 4 26B A4B scores 5/5 (tied for 1st with 36 others out of 53); Devstral Medium scores 3/5 (rank 45 of 53). A 2-point gap that matters for chatbot and assistant deployments.

Structured Output (5 vs 4): Gemma 4 26B A4B scores 5/5 (tied for 1st with 24 others out of 54); Devstral Medium scores 4/5 (rank 26 of 54). Both are competent at JSON schema compliance, but Gemma 4 26B A4B is at the top tier.

Ties — Constrained Rewriting, Classification, Safety Calibration, Agentic Planning: Both models score identically. On agentic planning (4/4), both rank 16 of 54. On safety calibration, both score 1/5 (rank 32 of 55) — neither model excels at refusing harmful requests while permitting legitimate ones, placing both below the field median of 2/5. On classification, both score 4/5 (tied for 1st with 29 others out of 53). On constrained rewriting, both score 3/5 (rank 31 of 53).

Gemma 4 26B A4B also supports multimodal input (text + image + video), which Devstral Medium does not — Devstral Medium is text-only.

BenchmarkDevstral MediumGemma 4 26B A4B
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling3/55/5
Classification4/54/5
Agentic Planning4/54/5
Structured Output4/55/5
Safety Calibration1/51/5
Strategic Analysis2/55/5
Persona Consistency3/55/5
Constrained Rewriting3/53/5
Creative Problem Solving2/54/5
Summary0 wins8 wins

Pricing Analysis

Devstral Medium costs $0.40/MTok input and $2.00/MTok output. Gemma 4 26B A4B costs $0.08/MTok input and $0.35/MTok output — making it 5x cheaper on input and 5.7x cheaper on output. At 1M output tokens/month, that's $2,000 vs $350 — a $1,650 monthly difference. At 10M output tokens/month, the gap widens to $16,500 per month ($20,000 vs $3,500). At 100M tokens/month — typical for a production API integration — you're looking at $200,000 vs $35,000, a $165,000 annual difference that easily justifies engineering time spent evaluating which model fits your use case. For individual developers or small teams, even the 1M-token tier makes Gemma 4 26B A4B the obvious cost-efficient pick. The only scenario where Devstral Medium's higher price makes sense is if it demonstrates a capability advantage on your specific workload — which our benchmarks do not show at a broad level.

Real-World Cost Comparison

TaskDevstral MediumGemma 4 26B A4B
iChat response$0.0011<$0.001
iBlog post$0.0042<$0.001
iDocument batch$0.108$0.019
iPipeline run$1.08$0.191

Bottom Line

Choose Gemma 4 26B A4B if: you need the best benchmark-per-dollar ratio across general tasks; your workflows involve tool calling, strategic analysis, long documents (up to 262K tokens), or multimodal inputs (images and video); you are building production applications where output costs at scale are a primary concern; or you need strong persona consistency for assistant or chatbot products. At $0.35/MTok output, it is one of the most capable low-cost options in our tested set.

Choose Devstral Medium if: your specific production workload involves code generation and agentic software engineering tasks (per its product positioning as a Mistral + All Hands AI collaboration), and you have validated through your own testing that it outperforms Gemma 4 26B A4B on your target tasks. Our benchmarks do not show a general advantage for Devstral Medium, but domain-specific evaluation on your own codebase is always the final arbiter. Be prepared to pay a 5.7x output cost premium if you go this route.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions