Gemma 4 31B vs Mistral Medium 3.1

For most product and developer use cases, Gemma 4 31B is the better pick: it wins more benchmarks (4 vs 2) and is far cheaper (output $0.38/1k vs $2/1k). Mistral Medium 3.1 wins on long context and constrained rewriting (useful for very long documents and tight character compression) but comes at a substantially higher per-token price.

google

Gemma 4 31B

Overall
4.42/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
4/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.130/MTok

Output

$0.380/MTok

Context Window262K

modelpicker.net

mistral

Mistral Medium 3.1

Overall
4.25/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
5/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Benchmark Analysis

Below are our 12-test comparisons (scores are from our testing). Where applicable we cite model ranks from our pool of 52–55 models and explain task impact.

  • structured output: Gemma 4 31B 5 vs Mistral 4. Gemma ties for 1st ("tied for 1st with 24 other models") while Mistral sits lower (rank 26 of 54). This means Gemma is more reliable for JSON/schema compliance and format adherence in our testing.
  • creative problem solving: Gemma 4 vs Mistral 3. Gemma ranks 9 of 54 (21-model tie) vs Mistral rank 30; expect Gemma to generate more specific, feasible ideas in our suite of creative tasks.
  • tool calling: Gemma 5 vs Mistral 4. Gemma is tied for 1st (with 16 others) vs Mistral at rank 18 — Gemma selects and sequences functions more accurately in our tool-calling tests.
  • faithfulness: Gemma 5 vs Mistral 4. Gemma is tied for 1st (with 32 others) while Mistral ranks 34 of 55 — Gemma better sticks to source material in our testing.
  • long context: Gemma 4 vs Mistral 5. Mistral ties for 1st (with 36 models) and Gemma ranks 38 of 55; Mistral is the clear winner for retrieval and accuracy at 30K+ token contexts in our tests.
  • constrained rewriting: Gemma 4 vs Mistral 5. Mistral ties for 1st (with 4 others) — Mistral performs better compressing content into strict character limits in our suite.
  • strategic analysis: tie 5/5. Both models are tied for 1st (with 25 other models) on nuanced tradeoff reasoning in our tests.
  • classification: tie 4/4. Both tied for 1st (with 29 others) — both perform well for routing and categorization in our testing.
  • persona consistency: tie 5/5. Both tied for 1st (with 36 others) — both maintain character and resist injection in our suite.
  • agentic planning: tie 5/5. Both tied for 1st — both decompose goals and recover from failures effectively in our tests.
  • multilingual: tie 5/5. Both tied for 1st — equivalent quality on non-English outputs in our testing.
  • safety calibration: tie 2/2. Both rank 12 of 55 (many models share this score) — both models show comparable refusal/permit behavior in our tests.

Takeaway: In our testing Gemma 4 31B wins the practical engineering categories of structured output, tool calling and faithfulness (important for production APIs and schema-driven apps). Mistral Medium 3.1 wins the two narrow but important areas of long context and constrained rewriting (better at very long-document retrieval and tight compression). Several core capabilities (strategic analysis, classification, persona consistency, multilingual, agentic planning, safety) are ties.

BenchmarkGemma 4 31BMistral Medium 3.1
Faithfulness5/54/5
Long Context4/55/5
Multilingual5/55/5
Tool Calling5/54/5
Classification4/54/5
Agentic Planning5/55/5
Structured Output5/54/5
Safety Calibration2/52/5
Strategic Analysis5/55/5
Persona Consistency5/55/5
Constrained Rewriting4/55/5
Creative Problem Solving4/53/5
Summary4 wins2 wins

Pricing Analysis

Payload prices (per mTok): Gemma 4 31B input $0.13, output $0.38; Mistral Medium 3.1 input $0.40, output $2.00. Treating 1 mTok as 1,000 tokens, per-1M-token costs are: Gemma input $130 + output $380 = $510 for 1M input+1M output; Mistral input $400 + output $2,000 = $2,400 for the same volume. At 10M tokens/month those totals scale to $5,100 (Gemma) vs $24,000 (Mistral); at 100M tokens/month $51,000 vs $240,000. The cost gap matters for any high-volume product (chat fleets, multiuser apps, heavy inference pipelines). Teams with tight budgets or large-scale deployments should prefer Gemma for cost-efficiency; teams that specifically need Mistral’s wins (long-context retrieval or extreme compression in constrained rewrites) may accept the higher spend.

Real-World Cost Comparison

TaskGemma 4 31BMistral Medium 3.1
iChat response<$0.001$0.0011
iBlog post<$0.001$0.0042
iDocument batch$0.022$0.108
iPipeline run$0.216$1.08

Bottom Line

Choose Gemma 4 31B if: you need cheaper inference at scale (output $0.38/1k vs $2/1k), reliable JSON/schema output, stronger tool-calling and higher faithfulness — e.g., product chatbots with function calls, schema-driven APIs, and cost-sensitive fleets. Choose Mistral Medium 3.1 if: your primary need is highest accuracy on very long contexts and constrained rewriting (long-document retrieval, summarizing/compressing large texts into strict character limits) and you can absorb the higher runtime costs.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions