Devstral Medium vs Gemini 3.1 Pro Preview

Gemini 3.1 Pro Preview is the clear choice for high-performance, multimodal and agentic AI workflows — it wins 11 of 12 benchmarks in our testing and scores 95.6% on AIME 2025 (Epoch AI). Devstral Medium is the cost-efficient alternative: about 6× cheaper and the better pick when classification accuracy and tight budgets matter.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

google

Gemini 3.1 Pro Preview

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
2/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
95.6%

Pricing

Input

$2.00/MTok

Output

$12.00/MTok

Context Window1049K

modelpicker.net

Benchmark Analysis

Summary of our 12-test head-to-head (scores are from our testing, 1–5 scale):

  • Gemini wins 11 tests: structured output 5 vs 4 (Gemini ties for 1st — "tied for 1st with 24 other models out of 54 tested"; Devstral is "rank 26 of 54 (27 models share this score)"). Structured_output measures JSON/schema compliance — Gemini is stronger at strict format adherence.
  • Strategic_analysis 5 vs 2 (Gemini: "tied for 1st with 25 other models out of 54 tested"). This implies Gemini handles nuanced tradeoffs and numeric reasoning in strategy tasks far better in our tests.
  • Constrained_rewriting 4 vs 3 (Gemini ranks "rank 6 of 53 (25 models share this score)"): better at tight compression/rewrite tasks.
  • Creative_problem_solving 5 vs 2 (Gemini "tied for 1st with 7 other models out of 54 tested"): Gemini produces more original, feasible ideas in our prompts.
  • Tool_calling 4 vs 3 (Gemini "rank 18 of 54 (29 models share this score)"; Devstral "rank 47 of 54 (6 models share this score)"): Gemini selects and sequences functions more accurately in our tests.
  • Faithfulness 5 vs 4 (Gemini "tied for 1st with 32 other models out of 55 tested"): Gemini adheres to source material better in our testing.
  • Long_context 5 vs 4 (Gemini "tied for 1st with 36 other models out of 55 tested"): Gemini outperforms for retrieval and reasoning at 30K+ token contexts.
  • Safety_calibration 2 vs 1 (Gemini "rank 12 of 55 (20 models share this score)"): Gemini refuses more harmful requests while permitting legitimate ones more reliably in our tests.
  • Persona_consistency 5 vs 3 (Gemini "tied for 1st with 36 other models out of 53 tested"): Gemini better maintains character and resists injection in chat-style workloads.
  • Agentic_planning 5 vs 4 (Gemini "tied for 1st with 14 other models out of 54 tested"): Gemini decomposes goals and recovers from failures more robustly in our agentic planning prompts.
  • Multilingual 5 vs 4 (Gemini "tied for 1st with 34 other models out of 55 tested"): Gemini produces higher-quality non-English outputs in our tests.
  • Devstral wins classification 4 vs 2 (Devstral "tied for 1st with 29 other models out of 53 tested"; Gemini "rank 51 of 53 (3 models share this score)"): Devstral is stronger at routing/categorization tasks in our benchmark scenarios. Extra external data: Gemini scores 95.6% on AIME 2025 (Epoch AI), which we cite as a third-party indicator of high mathematical/competition reasoning; Devstral has no AIME external score in the payload. Overall, Gemini's top ranks (many "tied for 1st" positions) indicate consistently best-in-class behavior on format, reasoning, creativity, long context, and agentic tests in our suite, while Devstral delivers a clear cost advantage and better classification in our testing.
BenchmarkDevstral MediumGemini 3.1 Pro Preview
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling3/54/5
Classification4/52/5
Agentic Planning4/55/5
Structured Output4/55/5
Safety Calibration1/52/5
Strategic Analysis2/55/5
Persona Consistency3/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/55/5
Summary1 wins11 wins

Pricing Analysis

Prices in the payload are per mTok; cost per 1M tokens = price_per_mTok × 1,000. Devstral Medium: input $0.4 + output $2 = $2.4 per mTok → $2,400 per 1M tokens. Gemini 3.1 Pro Preview: input $2 + output $12 = $14 per mTok → $14,000 per 1M tokens. At 10M tokens/month Devstral ≈ $24,000 vs Gemini ≈ $140,000; at 100M tokens/month Devstral ≈ $240,000 vs Gemini ≈ $1,400,000. The ~6× price gap matters for high-volume production use (10M+ tokens/mo), startups, and any team optimizing cost of inference; it’s less critical for low-volume research or feature-prototype work where Gemini’s top-tier capabilities may justify the premium.

Real-World Cost Comparison

TaskDevstral MediumGemini 3.1 Pro Preview
iChat response$0.0011$0.0064
iBlog post$0.0042$0.025
iDocument batch$0.108$0.640
iPipeline run$1.08$6.40

Bottom Line

Choose Devstral Medium if: you have strict cost constraints or very high token volumes (Devstral costs ~$2,400 per 1M tokens vs Gemini $14,000 per 1M), your primary tasks are classification/routing, or you need strong code/agentic reasoning at a much lower price point. Choose Gemini 3.1 Pro Preview if: you need top-tier performance across structured output, creative problem solving, long-context retrieval, agentic planning, multimodal inputs (Gemini supports text+image+file+audio+video→text), or you value the superior faithfulness and safety calibration shown in our tests (Gemini wins 11 of 12 benchmarks).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions