Codestral 2508 vs Gemma 4 31B
Gemma 4 31B is the stronger general-purpose model, winning 8 of 12 benchmarks in our testing against Codestral 2508's single win and three ties. Codestral 2508's one clear edge is long-context retrieval (5/5 vs 4/5), plus it was purpose-built for coding tasks like fill-in-the-middle and test generation — making it worth considering for high-frequency code completion pipelines specifically. The cost calculus strongly favors Gemma 4 31B: output costs $0.38/M tokens vs $0.90/M for Codestral 2508, a 2.4x premium that requires a compelling reason to pay.
mistral
Codestral 2508
Benchmark Scores
External Benchmarks
Pricing
Input
$0.300/MTok
Output
$0.900/MTok
modelpicker.net
Gemma 4 31B
Benchmark Scores
External Benchmarks
Pricing
Input
$0.130/MTok
Output
$0.380/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test suite, Gemma 4 31B wins 8 benchmarks, Codestral 2508 wins 1, and they tie on 3.
Where Gemma 4 31B leads:
- Strategic analysis: 5/5 (tied for 1st of 54) vs Codestral 2508's 2/5 (rank 44 of 54). This is the widest gap in the comparison — a full 3-point difference. For tasks involving tradeoff reasoning with real numbers, Gemma 4 31B is in a different league.
- Creative problem solving: 4/5 (rank 9 of 54) vs 2/5 (rank 47 of 54). Codestral 2508 sits near the bottom of our tested models here; Gemma 4 31B performs well above median.
- Agentic planning: 5/5 (tied for 1st of 54) vs 4/5 (rank 16 of 54). Gemma 4 31B's top score here matters for multi-step AI workflows — goal decomposition and failure recovery are critical for autonomous agents.
- Multilingual: 5/5 (tied for 1st of 55) vs 4/5 (rank 36 of 55). Gemma 4 31B handles non-English output at the highest tier; Codestral 2508 is above median but not elite.
- Persona consistency: 5/5 (tied for 1st of 53) vs 3/5 (rank 45 of 53). A meaningful gap for chatbot and assistant applications.
- Constrained rewriting: 4/5 (rank 6 of 53) vs 3/5 (rank 31 of 53). Compression tasks under hard character limits favor Gemma 4 31B.
- Classification: 4/5 (tied for 1st of 53) vs 3/5 (rank 31 of 53). Routing and categorization tasks go to Gemma 4 31B.
- Safety calibration: 2/5 (rank 12 of 55) vs 1/5 (rank 32 of 55). Neither model excels here — both score below the median (p50 = 2) — but Codestral 2508's score of 1/5 is at the floor of our scale, meaning it struggles to balance refusals with legitimate requests.
Where Codestral 2508 leads:
- Long context: 5/5 (tied for 1st of 55) vs 4/5 (rank 38 of 55). Codestral 2508's retrieval accuracy at 30K+ tokens is at the top tier; Gemma 4 31B drops a point here despite having a comparable 262K token context window.
Ties (both score identically):
- Tool calling: Both score 5/5, tied for 1st of 54. For function selection, argument accuracy, and sequencing, these models are equivalent in our testing.
- Structured output: Both score 5/5, tied for 1st of 54. JSON schema compliance is a strength for both.
- Faithfulness: Both score 5/5, tied for 1st of 55. Neither model hallucinates in our source-adherence tests.
Codestral 2508's 256K context window and specialization in FIM and code correction (per its description) are real differentiators for coding-specific pipelines — but our general benchmark suite shows Gemma 4 31B as the more rounded performer across the full task spectrum. Note that Gemma 4 31B also supports image and video input alongside text (text+image+video->text), while Codestral 2508 is text-only.
Pricing Analysis
Codestral 2508 is priced at $0.30/M input tokens and $0.90/M output tokens. Gemma 4 31B costs $0.13/M input and $0.38/M output. At 1M output tokens/month, Codestral 2508 costs $0.90 vs Gemma 4 31B's $0.38 — a $0.52 difference that barely registers. At 10M output tokens/month, the gap widens to $5.20 ($9.00 vs $3.80). At 100M output tokens/month — a realistic scale for production coding assistants or chat products — you're paying $90 for Codestral 2508 vs $38 for Gemma 4 31B, a $52/month difference per 100M tokens. Developers running high-volume code completion with FIM (fill-in-the-middle) workflows may find Codestral 2508's specialization justifies the premium. For most other use cases, paying 2.4x more for a model that loses on 8 of 12 benchmarks requires a specific, demonstrable performance need.
Real-World Cost Comparison
Bottom Line
Choose Codestral 2508 if your primary workload is high-frequency coding tasks — specifically fill-in-the-middle completion, code correction, or test generation — where Mistral's coding specialization may deliver advantages not captured by our general benchmark suite. Also prefer it if long-context retrieval accuracy (5/5 in our testing) is your single most critical requirement and you can justify the 2.4x output cost premium.
Choose Gemma 4 31B if you need a capable general-purpose model for anything beyond narrow code completion: strategic analysis (5/5 vs 2/5), agentic planning (5/5 vs 4/5), creative problem solving (4/5 vs 2/5), multilingual output, or persona-consistent chat applications. At $0.38/M output tokens — less than half Codestral 2508's $0.90/M — it's also the obvious pick for cost-sensitive production deployments. The addition of image and video input support makes Gemma 4 31B the only option here for multimodal workflows.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.