Devstral Medium vs Gemma 4 26B A4B
Gemma 4 26B A4B is the clear winner across our benchmarks, outscoring Devstral Medium on 8 of 12 tests while tying on 4 and winning none — and it does so at roughly one-sixth the output cost ($0.35 vs $2.00 per million tokens). Devstral Medium's narrow advantage lies in its purpose-built code generation and agentic reasoning positioning described in its product description, but our benchmark data does not show that advantage translating into higher scores on the tests we ran. At this price-to-performance gap, Gemma 4 26B A4B is the default choice unless your workflow demands something Devstral Medium specifically provides.
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
Gemma 4 26B A4B
Benchmark Scores
External Benchmarks
Pricing
Input
$0.080/MTok
Output
$0.350/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test suite, Gemma 4 26B A4B wins 8 benchmarks, ties 4, and loses none. Devstral Medium wins zero and ties 4.
Tool Calling (5 vs 3): Gemma 4 26B A4B scores 5/5 (tied for 1st with 16 other models out of 54 tested); Devstral Medium scores 3/5 (rank 47 of 54). This is a significant gap for agentic and automation use cases where function selection and argument accuracy are critical.
Strategic Analysis (5 vs 2): Gemma 4 26B A4B scores 5/5 (tied for 1st with 25 others out of 54); Devstral Medium scores 2/5 (rank 44 of 54). In our testing, Devstral Medium placed near the bottom on nuanced tradeoff reasoning with real numbers.
Creative Problem Solving (4 vs 2): Gemma 4 26B A4B scores 4/5 (rank 9 of 54); Devstral Medium scores 2/5 (rank 47 of 54). A 2-point gap here means Devstral Medium struggles to generate non-obvious, specific, and feasible ideas.
Faithfulness (5 vs 4): Gemma 4 26B A4B scores 5/5 (tied for 1st with 32 others out of 55); Devstral Medium scores 4/5 (rank 34 of 55). Both are solid, but Gemma 4 26B A4B is at the ceiling on sticking to source material without hallucinating.
Long Context (5 vs 4): Gemma 4 26B A4B scores 5/5 (tied for 1st with 36 others out of 55) and has a 262,144-token context window; Devstral Medium scores 4/5 (rank 38 of 55) with a 131,072-token context window. The doubling of context capacity plus the higher score makes Gemma 4 26B A4B the clear pick for document-heavy workflows.
Multilingual (5 vs 4): Gemma 4 26B A4B scores 5/5 (tied for 1st with 34 others); Devstral Medium scores 4/5 (rank 36 of 55).
Persona Consistency (5 vs 3): Gemma 4 26B A4B scores 5/5 (tied for 1st with 36 others out of 53); Devstral Medium scores 3/5 (rank 45 of 53). A 2-point gap that matters for chatbot and assistant deployments.
Structured Output (5 vs 4): Gemma 4 26B A4B scores 5/5 (tied for 1st with 24 others out of 54); Devstral Medium scores 4/5 (rank 26 of 54). Both are competent at JSON schema compliance, but Gemma 4 26B A4B is at the top tier.
Ties — Constrained Rewriting, Classification, Safety Calibration, Agentic Planning: Both models score identically. On agentic planning (4/4), both rank 16 of 54. On safety calibration, both score 1/5 (rank 32 of 55) — neither model excels at refusing harmful requests while permitting legitimate ones, placing both below the field median of 2/5. On classification, both score 4/5 (tied for 1st with 29 others out of 53). On constrained rewriting, both score 3/5 (rank 31 of 53).
Gemma 4 26B A4B also supports multimodal input (text + image + video), which Devstral Medium does not — Devstral Medium is text-only.
Pricing Analysis
Devstral Medium costs $0.40/MTok input and $2.00/MTok output. Gemma 4 26B A4B costs $0.08/MTok input and $0.35/MTok output — making it 5x cheaper on input and 5.7x cheaper on output. At 1M output tokens/month, that's $2,000 vs $350 — a $1,650 monthly difference. At 10M output tokens/month, the gap widens to $16,500 per month ($20,000 vs $3,500). At 100M tokens/month — typical for a production API integration — you're looking at $200,000 vs $35,000, a $165,000 annual difference that easily justifies engineering time spent evaluating which model fits your use case. For individual developers or small teams, even the 1M-token tier makes Gemma 4 26B A4B the obvious cost-efficient pick. The only scenario where Devstral Medium's higher price makes sense is if it demonstrates a capability advantage on your specific workload — which our benchmarks do not show at a broad level.
Real-World Cost Comparison
Bottom Line
Choose Gemma 4 26B A4B if: you need the best benchmark-per-dollar ratio across general tasks; your workflows involve tool calling, strategic analysis, long documents (up to 262K tokens), or multimodal inputs (images and video); you are building production applications where output costs at scale are a primary concern; or you need strong persona consistency for assistant or chatbot products. At $0.35/MTok output, it is one of the most capable low-cost options in our tested set.
Choose Devstral Medium if: your specific production workload involves code generation and agentic software engineering tasks (per its product positioning as a Mistral + All Hands AI collaboration), and you have validated through your own testing that it outperforms Gemma 4 26B A4B on your target tasks. Our benchmarks do not show a general advantage for Devstral Medium, but domain-specific evaluation on your own codebase is always the final arbiter. Be prepared to pay a 5.7x output cost premium if you go this route.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.