Question 1

Is Devstral Small 1.1 better than Gemma 4 31B?

Accepted Answer

No — in our 12-test suite Gemma 4 31B wins 9 of 12 benchmarks and outperforms Devstral on tool calling (5 vs 4), strategic analysis (5 vs 2), agentic planning (5 vs 2), faithfulness (5 vs 4) and more. Devstral ties Gemma on classification, long context, and safety calibration.

Question 2

Which model is cheaper?

Accepted Answer

Devstral Small 1.1 is cheaper. Payload prices: Devstral input $0.10 / mtok and output $0.30 / mtok; Gemma is input $0.13 / mtok and output $0.38 / mtok. Using a 50/50 input/output split, cost per 1M tokens is ≈ $200 for Devstral vs ≈ $255 for Gemma.

Question 3

Which is better for coding or tool-driven agents?

Accepted Answer

Gemma 4 31B is better for tool-driven agents in our testing: tool calling scores 5 for Gemma vs 4 for Devstral, and Gemma ranks tied for 1st of 54 on tool calling. Devstral is described as tuned for software engineering agents in its model description, but in our benchmarks Gemma still wins tool selection and sequencing.

Question 4

Do they handle long context differently?

Accepted Answer

Both models scored 4 on long context in our tests (a tie). Gemma has a larger context window (262,144 tokens) versus Devstral's 131,072, and that multimodal/context capacity may be useful in practice even though the long context benchmark tied.

Question 5

Which model is better for multilingual use?

Accepted Answer

Gemma 4 31B scored 5 on multilingual (tied for 1st of 55), while Devstral scored 4 (rank 36 of 55). In our tests Gemma produces more consistent non-English output.

Question 6

How big is the quality gap?

Accepted Answer

Gemma wins nine benchmarks outright and ties three; Devstral doesn’t win any. The largest gaps in our data are strategic analysis (5 vs 2) and agentic planning (5 vs 2), indicating substantial practical differences in multi-step reasoning and planning tasks.

Devstral Small 1.1 vs Gemma 4 31B

Devstral Small 1.1

Gemma 4 31B

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions