Devstral Medium vs Gemini 2.5 Pro

Gemini 2.5 Pro is the better pick for accuracy-heavy and long-context workloads: it wins 8 of 12 benchmarks in our tests, notably tool-calling, faithfulness, and long-context. Devstral Medium is the cost-efficient alternative—it ties on classification and delivers lower per-token pricing if you need high throughput and can accept weaker tool-calling and long-context performance.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

google

Gemini 2.5 Pro

Overall
4.25/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
4/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
57.6%
MATH Level 5
N/A
AIME 2025
84.2%

Pricing

Input

$1.25/MTok

Output

$10.00/MTok

Context Window1049K

modelpicker.net

Benchmark Analysis

Across our 12-test suite, Gemini 2.5 Pro wins 8 benchmarks, Devstral Medium wins none outright, and 4 tests tie. Test-by-test (Devstral score vs Gemini score):

  • Structured output: 4 vs 5 — Gemini wins; Gemini is tied for 1st on structured_output per our rankings ("tied for 1st with 24 other models"). This matters when you need strict JSON/schema compliance.
  • Strategic analysis: 2 vs 4 — Gemini wins; Gemini ranks 27 of 54 (display shows "rank 27 of 54"), so it handles nuanced numeric tradeoffs substantially better in our tests.
  • Creative problem solving: 2 vs 5 — Gemini wins; Gemini is tied for 1st here, so it produces more non-obvious, feasible ideas in our evaluation.
  • Tool calling: 3 vs 5 — Gemini wins; Gemini is tied for 1st on tool_calling ("tied for 1st with 16 other models"), which correlates to more accurate function selection and argument sequencing.
  • Faithfulness: 4 vs 5 — Gemini wins; Gemini is tied for 1st on faithfulness, reducing hallucination risk in source-driven tasks.
  • Long context: 4 vs 5 — Gemini wins; Gemini is tied for 1st on long_context and also has a much larger context window (1,048,576 vs 131,072), so it performs better on retrieval and continuity across 30K+ tokens.
  • Persona consistency: 3 vs 5 — Gemini wins; Gemini is tied for 1st on persona_consistency, so it resists injection and maintains character in our tests.
  • Multilingual: 4 vs 5 — Gemini wins; Gemini is tied for 1st on multilingual outputs in our ranking set. Ties (no winner): constrained_rewriting 3/3, classification 4/4, safety_calibration 1/1, agentic_planning 4/4 — these represent parity in our suite. Notably, Devstral ties for 1st on classification per its ranking ("tied for 1st with 29 other models"), so for routing and categorization tasks it compares well. External benchmarks: Gemini scores 57.6% on SWE-bench Verified (Epoch AI) and 84.2% on AIME 2025 (Epoch AI); Devstral has no SWE-bench or AIME external scores in the payload. We cite those Epoch AI results as supplementary evidence for Gemini's coding/math capabilities.
BenchmarkDevstral MediumGemini 2.5 Pro
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling3/55/5
Classification4/54/5
Agentic Planning4/54/5
Structured Output4/55/5
Safety Calibration1/51/5
Strategic Analysis2/54/5
Persona Consistency3/55/5
Constrained Rewriting3/53/5
Creative Problem Solving2/55/5
Summary0 wins8 wins

Pricing Analysis

Per the payload, Devstral Medium costs $0.4 input / $2 output per mTok; Gemini 2.5 Pro costs $1.25 input / $10 output per mTok. Using a balanced 50/50 input/output split (1M tokens = 1,000 mTok total; 500 mTok input + 500 mTok output): Devstral = 500*$0.4 + 500*$2 = $200 + $1,000 = $1,200 per 1M tokens; Gemini = 500*$1.25 + 500*$10 = $625 + $5,000 = $5,625 per 1M tokens. Scale that to 10M tokens: Devstral ≈ $12,000; Gemini ≈ $56,250. At 100M tokens: Devstral ≈ $120,000; Gemini ≈ $562,500. In short, Devstral is ~20% of Gemini's cost per the priceRatio (0.2) in the payload. Teams doing high-volume inference (chat services, bulk generation) should care about this gap; teams prioritizing top-ranked capability in tool calling, long-context, faithfulness, or multilingual outputs may find Gemini worth the premium.

Real-World Cost Comparison

TaskDevstral MediumGemini 2.5 Pro
iChat response$0.0011$0.0053
iBlog post$0.0042$0.021
iDocument batch$0.108$0.525
iPipeline run$1.08$5.25

Bottom Line

Choose Devstral Medium if: you need the lowest per-token cost at scale (Devstral is $0.4 input / $2 output per mTok) and you can accept weaker performance on tool-calling, long-context, faithfulness, persona consistency, and creative-problem-solving. Good for high-volume inference where classification parity and lower costs matter. Choose Gemini 2.5 Pro if: you prioritize top-tier tool-calling, faithfulness, long-context handling, persona consistency, creative problem solving, or multilingual quality (Gemini wins 8 of 12 tests and ranks tied for 1st on several key axes), and you can pay the premium ($1.25 input / $10 output per mTok). Gemini also provides multimodal inputs and a much larger context window (1,048,576 vs 131,072).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions