Codestral 2508 vs Gemma 4 31B

Gemma 4 31B is the stronger general-purpose model, winning 8 of 12 benchmarks in our testing against Codestral 2508's single win and three ties. Codestral 2508's one clear edge is long-context retrieval (5/5 vs 4/5), plus it was purpose-built for coding tasks like fill-in-the-middle and test generation — making it worth considering for high-frequency code completion pipelines specifically. The cost calculus strongly favors Gemma 4 31B: output costs $0.38/M tokens vs $0.90/M for Codestral 2508, a 2.4x premium that requires a compelling reason to pay.

mistral

Codestral 2508

Overall
3.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
5/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.300/MTok

Output

$0.900/MTok

Context Window256K

modelpicker.net

google

Gemma 4 31B

Overall
4.42/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
4/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.130/MTok

Output

$0.380/MTok

Context Window262K

modelpicker.net

Benchmark Analysis

Across our 12-test suite, Gemma 4 31B wins 8 benchmarks, Codestral 2508 wins 1, and they tie on 3.

Where Gemma 4 31B leads:

  • Strategic analysis: 5/5 (tied for 1st of 54) vs Codestral 2508's 2/5 (rank 44 of 54). This is the widest gap in the comparison — a full 3-point difference. For tasks involving tradeoff reasoning with real numbers, Gemma 4 31B is in a different league.
  • Creative problem solving: 4/5 (rank 9 of 54) vs 2/5 (rank 47 of 54). Codestral 2508 sits near the bottom of our tested models here; Gemma 4 31B performs well above median.
  • Agentic planning: 5/5 (tied for 1st of 54) vs 4/5 (rank 16 of 54). Gemma 4 31B's top score here matters for multi-step AI workflows — goal decomposition and failure recovery are critical for autonomous agents.
  • Multilingual: 5/5 (tied for 1st of 55) vs 4/5 (rank 36 of 55). Gemma 4 31B handles non-English output at the highest tier; Codestral 2508 is above median but not elite.
  • Persona consistency: 5/5 (tied for 1st of 53) vs 3/5 (rank 45 of 53). A meaningful gap for chatbot and assistant applications.
  • Constrained rewriting: 4/5 (rank 6 of 53) vs 3/5 (rank 31 of 53). Compression tasks under hard character limits favor Gemma 4 31B.
  • Classification: 4/5 (tied for 1st of 53) vs 3/5 (rank 31 of 53). Routing and categorization tasks go to Gemma 4 31B.
  • Safety calibration: 2/5 (rank 12 of 55) vs 1/5 (rank 32 of 55). Neither model excels here — both score below the median (p50 = 2) — but Codestral 2508's score of 1/5 is at the floor of our scale, meaning it struggles to balance refusals with legitimate requests.

Where Codestral 2508 leads:

  • Long context: 5/5 (tied for 1st of 55) vs 4/5 (rank 38 of 55). Codestral 2508's retrieval accuracy at 30K+ tokens is at the top tier; Gemma 4 31B drops a point here despite having a comparable 262K token context window.

Ties (both score identically):

  • Tool calling: Both score 5/5, tied for 1st of 54. For function selection, argument accuracy, and sequencing, these models are equivalent in our testing.
  • Structured output: Both score 5/5, tied for 1st of 54. JSON schema compliance is a strength for both.
  • Faithfulness: Both score 5/5, tied for 1st of 55. Neither model hallucinates in our source-adherence tests.

Codestral 2508's 256K context window and specialization in FIM and code correction (per its description) are real differentiators for coding-specific pipelines — but our general benchmark suite shows Gemma 4 31B as the more rounded performer across the full task spectrum. Note that Gemma 4 31B also supports image and video input alongside text (text+image+video->text), while Codestral 2508 is text-only.

BenchmarkCodestral 2508Gemma 4 31B
Faithfulness5/55/5
Long Context5/54/5
Multilingual4/55/5
Tool Calling5/55/5
Classification3/54/5
Agentic Planning4/55/5
Structured Output5/55/5
Safety Calibration1/52/5
Strategic Analysis2/55/5
Persona Consistency3/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/54/5
Summary1 wins8 wins

Pricing Analysis

Codestral 2508 is priced at $0.30/M input tokens and $0.90/M output tokens. Gemma 4 31B costs $0.13/M input and $0.38/M output. At 1M output tokens/month, Codestral 2508 costs $0.90 vs Gemma 4 31B's $0.38 — a $0.52 difference that barely registers. At 10M output tokens/month, the gap widens to $5.20 ($9.00 vs $3.80). At 100M output tokens/month — a realistic scale for production coding assistants or chat products — you're paying $90 for Codestral 2508 vs $38 for Gemma 4 31B, a $52/month difference per 100M tokens. Developers running high-volume code completion with FIM (fill-in-the-middle) workflows may find Codestral 2508's specialization justifies the premium. For most other use cases, paying 2.4x more for a model that loses on 8 of 12 benchmarks requires a specific, demonstrable performance need.

Real-World Cost Comparison

TaskCodestral 2508Gemma 4 31B
iChat response<$0.001<$0.001
iBlog post$0.0020<$0.001
iDocument batch$0.051$0.022
iPipeline run$0.510$0.216

Bottom Line

Choose Codestral 2508 if your primary workload is high-frequency coding tasks — specifically fill-in-the-middle completion, code correction, or test generation — where Mistral's coding specialization may deliver advantages not captured by our general benchmark suite. Also prefer it if long-context retrieval accuracy (5/5 in our testing) is your single most critical requirement and you can justify the 2.4x output cost premium.

Choose Gemma 4 31B if you need a capable general-purpose model for anything beyond narrow code completion: strategic analysis (5/5 vs 2/5), agentic planning (5/5 vs 4/5), creative problem solving (4/5 vs 2/5), multilingual output, or persona-consistent chat applications. At $0.38/M output tokens — less than half Codestral 2508's $0.90/M — it's also the obvious pick for cost-sensitive production deployments. The addition of image and video input support makes Gemma 4 31B the only option here for multimodal workflows.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions