Is Gemma 4 31B better than Mistral Medium 3.1?

In our testing Gemma 4 31B wins 4 of 12 benchmarks vs Mistral Medium 3.1's 2 (Gemma wins structured output, creative problem solving, tool calling, faithfulness; Mistral wins long context and constrained rewriting). Many other tests are ties.

Which model is cheaper to run?

Gemma 4 31B is substantially cheaper: input $0.13 and output $0.38 per mTok vs Mistral Medium 3.1 input $0.40 and output $2.00 per mTok. For 1M input+1M output tokens that sums to about $510 (Gemma) vs $2,400 (Mistral) per month.

Which is better for coding or tool-enabled workflows?

Gemma 4 31B scored 5/5 on tool calling vs Mistral's 4/5 and ties for 1st in structured output (Gemma 5 vs Mistral 4). In our tests Gemma more reliably chooses functions, arguments and adheres to schemas, which benefits coding assistants and tool workflows.

Which is better for very long documents and retrieval?

Mistral Medium 3.1 wins long context 5/5 vs Gemma's 4/5 and is tied for 1st for that test in our pool. If you need retrieval accuracy and reasoning over 30K+ tokens, Mistral performs better in our benchmarks.

Do they differ on safety or multilingual performance?

No meaningful difference in our testing: both models tie on safety calibration (2/5) and multilingual (5/5), and both rank similarly on those tests within our model set.

Gemma 4 31B vs Mistral Medium 3.1

For most product and developer use cases, Gemma 4 31B is the better pick: it wins more benchmarks (4 vs 2) and is far cheaper (output $0.38/1k vs $2/1k). Mistral Medium 3.1 wins on long context and constrained rewriting (useful for very long documents and tight character compression) but comes at a substantially higher per-token price.

google

Gemma 4 31B

Overall

4.42/5Strong

Benchmark Scores

Faithfulness

5/5

Long Context

4/5

Multilingual

5/5

Tool Calling

5/5

Classification

4/5

Agentic Planning

5/5

Structured Output

5/5

Safety Calibration

2/5

Strategic Analysis

5/5

Persona Consistency

5/5

Constrained Rewriting

4/5

Creative Problem Solving

4/5

External Benchmarks

SWE-bench Verified

N/A

MATH Level 5

N/A

AIME 2025

N/A

Pricing

Input

$0.130/MTok

Output

$0.380/MTok

Context Window262K

modelpicker.net

mistral

Mistral Medium 3.1

Overall

4.25/5Strong

Benchmark Scores

Faithfulness

4/5

Long Context

5/5

Multilingual

5/5

Tool Calling

4/5

Classification

4/5

Agentic Planning

5/5

Structured Output

4/5

Safety Calibration

2/5

Strategic Analysis

5/5

Persona Consistency

5/5

Constrained Rewriting

5/5

Creative Problem Solving

3/5

External Benchmarks

SWE-bench Verified

N/A

MATH Level 5

N/A

AIME 2025

N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Benchmark Analysis

Below are our 12-test comparisons (scores are from our testing). Where applicable we cite model ranks from our pool of 52–55 models and explain task impact.

structured output: Gemma 4 31B 5 vs Mistral 4. Gemma ties for 1st ("tied for 1st with 24 other models") while Mistral sits lower (rank 26 of 54). This means Gemma is more reliable for JSON/schema compliance and format adherence in our testing.
creative problem solving: Gemma 4 vs Mistral 3. Gemma ranks 9 of 54 (21-model tie) vs Mistral rank 30; expect Gemma to generate more specific, feasible ideas in our suite of creative tasks.
tool calling: Gemma 5 vs Mistral 4. Gemma is tied for 1st (with 16 others) vs Mistral at rank 18 — Gemma selects and sequences functions more accurately in our tool-calling tests.
faithfulness: Gemma 5 vs Mistral 4. Gemma is tied for 1st (with 32 others) while Mistral ranks 34 of 55 — Gemma better sticks to source material in our testing.
long context: Gemma 4 vs Mistral 5. Mistral ties for 1st (with 36 models) and Gemma ranks 38 of 55; Mistral is the clear winner for retrieval and accuracy at 30K+ token contexts in our tests.
constrained rewriting: Gemma 4 vs Mistral 5. Mistral ties for 1st (with 4 others) — Mistral performs better compressing content into strict character limits in our suite.
strategic analysis: tie 5/5. Both models are tied for 1st (with 25 other models) on nuanced tradeoff reasoning in our tests.
classification: tie 4/4. Both tied for 1st (with 29 others) — both perform well for routing and categorization in our testing.
persona consistency: tie 5/5. Both tied for 1st (with 36 others) — both maintain character and resist injection in our suite.
agentic planning: tie 5/5. Both tied for 1st — both decompose goals and recover from failures effectively in our tests.
multilingual: tie 5/5. Both tied for 1st — equivalent quality on non-English outputs in our testing.
safety calibration: tie 2/2. Both rank 12 of 55 (many models share this score) — both models show comparable refusal/permit behavior in our tests.

Takeaway: In our testing Gemma 4 31B wins the practical engineering categories of structured output, tool calling and faithfulness (important for production APIs and schema-driven apps). Mistral Medium 3.1 wins the two narrow but important areas of long context and constrained rewriting (better at very long-document retrieval and tight compression). Several core capabilities (strategic analysis, classification, persona consistency, multilingual, agentic planning, safety) are ties.

BenchmarkGemma 4 31BMistral Medium 3.1

Faithfulness5/54/5

Long Context4/55/5

Multilingual5/55/5

Tool Calling5/54/5

Classification4/54/5

Agentic Planning5/55/5

Structured Output5/54/5

Safety Calibration2/52/5

Strategic Analysis5/55/5

Persona Consistency5/55/5

Constrained Rewriting4/55/5

Creative Problem Solving4/53/5

Summary4 wins2 wins

Pricing Analysis

Payload prices (per mTok): Gemma 4 31B input $0.13, output $0.38; Mistral Medium 3.1 input $0.40, output $2.00. Treating 1 mTok as 1,000 tokens, per-1M-token costs are: Gemma input $130 + output $380 = $510 for 1M input+1M output; Mistral input $400 + output $2,000 = $2,400 for the same volume. At 10M tokens/month those totals scale to $5,100 (Gemma) vs $24,000 (Mistral); at 100M tokens/month $51,000 vs $240,000. The cost gap matters for any high-volume product (chat fleets, multiuser apps, heavy inference pipelines). Teams with tight budgets or large-scale deployments should prefer Gemma for cost-efficiency; teams that specifically need Mistral’s wins (long-context retrieval or extreme compression in constrained rewrites) may accept the higher spend.

Real-World Cost Comparison

TaskGemma 4 31BMistral Medium 3.1

iChat response<$0.001$0.0011

iBlog post<$0.001$0.0042

iDocument batch$0.022$0.108

iPipeline run$0.216$1.08

Bottom Line

Choose Gemma 4 31B if: you need cheaper inference at scale (output $0.38/1k vs $2/1k), reliable JSON/schema output, stronger tool-calling and higher faithfulness — e.g., product chatbots with function calls, schema-driven APIs, and cost-sensitive fleets. Choose Mistral Medium 3.1 if: your primary need is highest accuracy on very long contexts and constrained rewriting (long-document retrieval, summarizing/compressing large texts into strict character limits) and you can absorb the higher runtime costs.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.