Devstral Small 1.1 vs Gemma 4 26B A4B
Gemma 4 26B A4B is the stronger general-purpose model, winning 9 of 12 benchmarks in our testing — including tool calling (5 vs 4), agentic planning (4 vs 2), strategic analysis (5 vs 2), and long context (5 vs 4). Devstral Small 1.1's only win is safety calibration (2 vs 1), where it handles harmful-request refusal more reliably. The pricing difference is marginal — Gemma 4 26B A4B costs $0.08/$0.35 per million tokens vs Devstral's $0.10/$0.30, so neither model carries a significant cost premium over the other.
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
Gemma 4 26B A4B
Benchmark Scores
External Benchmarks
Pricing
Input
$0.080/MTok
Output
$0.350/MTok
modelpicker.net
Benchmark Analysis
Gemma 4 26B A4B wins 9 of 12 benchmarks, ties 2, and loses 1 in our testing. Here's the breakdown:
Tool Calling (5 vs 4): Gemma scores 5/5, tied for 1st among 54 models. Devstral scores 4/5 at rank 18 of 54. For agentic workflows that depend on accurate function selection and argument passing, Gemma is the stronger choice.
Agentic Planning (4 vs 2): This is the sharpest gap. Gemma scores 4/5 at rank 16 of 54; Devstral scores 2/5 at rank 53 of 54 — nearly last. Devstral, despite being purpose-built for software engineering agents per its description, scores at the bottom of our field on goal decomposition and failure recovery in our testing.
Strategic Analysis (5 vs 2): Gemma scores 5/5, tied for 1st among 54 models. Devstral scores 2/5 at rank 44 of 54. For tasks requiring nuanced tradeoff reasoning with real numbers, Gemma is dramatically better in our tests.
Structured Output (5 vs 4): Gemma scores 5/5, tied for 1st among 54 models. Devstral scores 4/5, rank 26 of 54. Both are solid, but Gemma's perfect score gives it an edge for strict JSON schema compliance.
Faithfulness (5 vs 4): Gemma scores 5/5, tied for 1st among 55 models. Devstral scores 4/5 at rank 34 of 55. For RAG pipelines and summarization tasks where hallucination is costly, Gemma is more reliable in our tests.
Long Context (5 vs 4): Gemma scores 5/5, tied for 1st among 55 models, and also offers a 262,144-token context window versus Devstral's 131,072. Devstral scores 4/5 at rank 38 of 55. The context window difference doubles Gemma's ceiling for large document workloads.
Persona Consistency (5 vs 2): Gemma scores 5/5, tied for 1st among 53 models. Devstral scores 2/5 at rank 51 of 53 — second to last. For chatbot, assistant, or roleplay applications, Devstral struggles significantly in our testing.
Multilingual (5 vs 4): Gemma scores 5/5, tied for 1st among 55 models. Devstral scores 4/5 at rank 36 of 55. Both are capable, but Gemma delivers more consistent non-English output in our tests.
Creative Problem Solving (4 vs 2): Gemma scores 4/5 at rank 9 of 54. Devstral scores 2/5 at rank 47 of 54. Gemma generates substantially more novel and feasible ideas in our testing.
Ties — Classification (4 vs 4) and Constrained Rewriting (3 vs 3): Both models perform identically on these tasks.
Safety Calibration (2 vs 1): Devstral's only win. It scores 2/5 at rank 12 of 55, while Gemma scores 1/5 at rank 32 of 55. Note that the median across all 55 models is 2, so Devstral is near the median while Gemma scores below it. Neither model excels here, but Devstral is noticeably better at refusing harmful requests without over-refusing legitimate ones in our tests.
Modality note: Gemma 4 26B A4B accepts text, image, and video input per the payload. Devstral Small 1.1 is text-only.
Pricing Analysis
Devstral Small 1.1 costs $0.10 input / $0.30 output per million tokens. Gemma 4 26B A4B costs $0.08 input / $0.35 output per million tokens. For input-heavy workloads (e.g., long document processing), Gemma is 20% cheaper at $0.08 vs $0.10 per million input tokens — a $2 saving per 100M input tokens. For output-heavy workloads (e.g., code generation, long-form writing), Devstral is cheaper at $0.30 vs $0.35 per million output tokens — a $5 saving per 100M output tokens. At 1M tokens/month, the difference is cents either way. At 10M tokens/month, you're looking at $1–3 difference depending on your input/output ratio. At 100M tokens/month, the gap grows to $10–50, still small relative to typical API budgets. Cost should not drive this decision — benchmark performance should.
Real-World Cost Comparison
Bottom Line
Choose Gemma 4 26B A4B if you need a capable general-purpose model for agentic workflows, tool-calling pipelines, long-context document processing, multilingual applications, or any task requiring strategic reasoning or creative problem solving. Its 262K context window, multimodal input support (text, image, video), and top-tier scores on 9 of 12 benchmarks in our testing make it the default choice for most use cases.
Choose Devstral Small 1.1 if safety calibration is a hard requirement — it scores 2/5 vs Gemma's 1/5 in our testing, making it meaningfully better at refusing harmful requests while allowing legitimate ones. It's also worth considering if your workload is extremely output-heavy and the $0.05/million output token savings matters at scale, or if you specifically want a model positioned for software engineering agent tasks and are testing against real-world coding benchmarks not yet reflected in our suite. However, its near-last-place agentic planning score (rank 53 of 54 in our tests) is a significant caveat for autonomous coding agent deployments.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.