Is Gemma 4 31B better than Mistral Small 3.1 24B?

In our 12-test suite, Gemma 4 31B wins 11 of 12 benchmarks (tool calling 5 vs 1, strategic analysis 5 vs 3, structured output 5 vs 4). Mistral only wins long context (5 vs Gemma's 4). So Gemma is the stronger general-purpose model in our tests.

Which model is cheaper?

Gemma 4 31B is cheaper: input $0.13/mTok and output $0.38/mTok vs Mistral Small 3.1 24B input $0.35/mTok and output $0.56/mTok. Gemma runs at ~0.68x the cost of Mistral (priceRatio 0.6786).

Which model is better for coding or tool-based workflows?

Gemma 4 31B is clearly better for tool-based or code-assist workflows: it scores 5 on tool calling (tied for 1st) while Mistral scores 1 and has a no_tool calling quirk. Gemma also scores higher on creative problem solving (4 vs 2) and structured output (5 vs 4), which help in code generation and function orchestration.

Which is better for very long documents or retrieval at 30K+ tokens?

Mistral Small 3.1 24B wins long context in our tests (5 vs Gemma's 4) and is tied for 1st for that benchmark. If your primary requirement is retrieval accuracy on 30K+ inputs, Mistral has the measured advantage.

How much will the price difference matter at scale?

Assuming a 50/50 input/output split, monthly costs are roughly: 1M tokens — Gemma $255 vs Mistral $455; 10M tokens — Gemma $2,550 vs Mistral $4,550; 100M tokens — Gemma $25,500 vs Mistral $45,500. Large-volume users should prioritize Gemma for cost savings unless Mistral’s long-context win is essential.

Gemma 4 31B vs Mistral Small 3.1 24B

For most production apps (structured output, tool-based agents, multilingual/chat), Gemma 4 31B is the better pick — it wins 11 of 12 benchmarks in our suite and supports tool calling. Mistral Small 3.1 24B only beats Gemma on long-context retrieval; it is also significantly more expensive (input $0.35/out $0.56 vs Gemma $0.13/$0.38), so Gemma is the stronger price-performance choice unless you specifically need the long-context advantage.

google

Gemma 4 31B

Overall

4.42/5Strong

Benchmark Scores

Faithfulness

5/5

Long Context

4/5

Multilingual

5/5

Tool Calling

5/5

Classification

4/5

Agentic Planning

5/5

Structured Output

5/5

Safety Calibration

2/5

Strategic Analysis

5/5

Persona Consistency

5/5

Constrained Rewriting

4/5

Creative Problem Solving

4/5

External Benchmarks

SWE-bench Verified

N/A

MATH Level 5

N/A

AIME 2025

N/A

Pricing

Input

$0.130/MTok

Output

$0.380/MTok

Context Window262K

modelpicker.net

mistral

Mistral Small 3.1 24B

Overall

2.92/5Usable

Benchmark Scores

Faithfulness

4/5

Long Context

5/5

Multilingual

4/5

Tool Calling

1/5

Classification

3/5

Agentic Planning

3/5

Structured Output

4/5

Safety Calibration

1/5

Strategic Analysis

3/5

Persona Consistency

2/5

Constrained Rewriting

3/5

Creative Problem Solving

2/5

External Benchmarks

SWE-bench Verified

N/A

MATH Level 5

N/A

AIME 2025

N/A

Pricing

Input

$0.350/MTok

Output

$0.560/MTok

Context Window128K

modelpicker.net

Benchmark Analysis

Summary: Gemma 4 31B wins 11 benchmarks in our 12-test suite; Mistral Small 3.1 24B wins only long context. Detailed walk-through by test (Gemma score vs Mistral score, with ranking context and practical meaning):

tool calling: Gemma 5 vs Mistral 1. Gemma is tied for 1st (tied with 16 others of 54) for correct function selection, argument accuracy and sequencing. Mistral ranks 53 of 54 and its quirks include no_tool calling=true, so Mistral is effectively unsuitable for tool-based agent workflows in our tests.
strategic analysis: Gemma 5 vs Mistral 3. Gemma is tied for 1st (tied with 25 others of 54) on nuanced tradeoff reasoning; expect better numeric tradeoffs and multi-step decision advice from Gemma.
structured output: Gemma 5 vs Mistral 4. Gemma is tied for 1st (tied with 24 others) on schema/JSON compliance; Mistral ranks 26 of 54. Use Gemma when strict format adherence is required.
faithfulness: Gemma 5 vs Mistral 4. Gemma ranks tied for 1st (tied with 32 others of 55) on sticking to source material; this reduces hallucination risk relative to Mistral in our tests.
classification: Gemma 4 vs Mistral 3. Gemma tied for 1st (tied with 29 others of 53), so routing and categorization tasks were more accurate in our testing.
persona consistency: Gemma 5 vs Mistral 2. Gemma tied for 1st (tied with 36 others of 53); Mistral ranks 51 of 53, meaning Gemma better maintains character and resists prompt injection in role-based chat.
multilingual: Gemma 5 vs Mistral 4. Gemma tied for 1st (tied with 34 others of 55); expect higher-quality non-English outputs from Gemma in our suite.
agentic planning: Gemma 5 vs Mistral 3. Gemma tied for 1st (tied with 14 others of 54) for decomposition, fallback and recovery — important for multi-step agents.
constrained rewriting: Gemma 4 vs Mistral 3. Gemma ranked 6 of 53 (25 models share this score) on tight-character rewrites, so better at strict-length edits.
creative problem solving: Gemma 4 vs Mistral 2. Gemma ranks 9 of 54 (21 models share) on producing non-obvious, feasible ideas; Mistral scored lower here.
safety calibration: Gemma 2 vs Mistral 1. Gemma ranks 12 of 55 (20 share) and Mistral ranks 32 of 55; Gemma is better at refusing harmful requests while permitting legitimate ones in our tests.
long context: Gemma 4 vs Mistral 5. This is Mistral’s only win; Mistral is tied for 1st (tied with 36 others of 55) for retrieval accuracy at 30K+ tokens. If your workload is heavy long-context retrieval, Mistral has the edge.

Practical takeaway: Gemma dominates in agentic, structured, multilingual and safety-sensitive tasks and is also cheaper. Mistral’s single advantage is long-context retrieval accuracy.

BenchmarkGemma 4 31BMistral Small 3.1 24B

Faithfulness5/54/5

Long Context4/55/5

Multilingual5/54/5

Tool Calling5/51/5

Classification4/53/5

Agentic Planning5/53/5

Structured Output5/54/5

Safety Calibration2/51/5

Strategic Analysis5/53/5

Persona Consistency5/52/5

Constrained Rewriting4/53/5

Creative Problem Solving4/52/5

Summary11 wins1 wins

Pricing Analysis

Costs per mTok: Gemma 4 31B input $0.13 / output $0.38; Mistral Small 3.1 24B input $0.35 / output $0.56. Assuming a balanced 50/50 split of input/output tokens, monthly cost examples: • 1M tokens (1,000 mTok): Gemma ≈ $255; Mistral ≈ $455. • 10M tokens (10,000 mTok): Gemma ≈ $2,550; Mistral ≈ $4,550. • 100M tokens (100,000 mTok): Gemma ≈ $25,500; Mistral ≈ $45,500. Gemma runs at ~0.68x the cost of Mistral overall (priceRatio 0.6786). High-volume inference, large-scale chatbots, and API-driven products that generate lots of output tokens should care about this gap — at 100M tokens/month the difference is roughly $20k/month. If your workload is dominated by very long-context reads (30K+ tokens) and you can’t accept the lack of tool calling, Mistral’s premium may be defensible; otherwise Gemma gives better value.

Real-World Cost Comparison

TaskGemma 4 31BMistral Small 3.1 24B

iChat response<$0.001<$0.001

iBlog post<$0.001$0.0013

iDocument batch$0.022$0.035

iPipeline run$0.216$0.350

Bottom Line

Choose Gemma 4 31B if: • You need tool calling / agent workflows (Gemma tool calling 5 vs Mistral 1; Mistral has no_tool calling quirk). • You require strict structured output, high faithfulness, persona consistency, multilingual support, or strategic analysis (Gemma wins these tests and often ranks tied for 1st). • You run high-volume inference and want lower per-token costs (Gemma $0.13 in/$0.38 out vs Mistral $0.35/$0.56).

Choose Mistral Small 3.1 24B if: • Your primary need is best-in-class long-context retrieval at 30K+ tokens (Mistral long context 5 vs Gemma 4; Mistral tied for 1st). • You can tolerate no tool calling and higher costs for that specific long-context advantage.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.