Devstral Medium vs Gemini 2.5 Pro
Gemini 2.5 Pro is the better pick for accuracy-heavy and long-context workloads: it wins 8 of 12 benchmarks in our tests, notably tool-calling, faithfulness, and long-context. Devstral Medium is the cost-efficient alternative—it ties on classification and delivers lower per-token pricing if you need high throughput and can accept weaker tool-calling and long-context performance.
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
Gemini 2.5 Pro
Benchmark Scores
External Benchmarks
Pricing
Input
$1.25/MTok
Output
$10.00/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test suite, Gemini 2.5 Pro wins 8 benchmarks, Devstral Medium wins none outright, and 4 tests tie. Test-by-test (Devstral score vs Gemini score):
- Structured output: 4 vs 5 — Gemini wins; Gemini is tied for 1st on structured_output per our rankings ("tied for 1st with 24 other models"). This matters when you need strict JSON/schema compliance.
- Strategic analysis: 2 vs 4 — Gemini wins; Gemini ranks 27 of 54 (display shows "rank 27 of 54"), so it handles nuanced numeric tradeoffs substantially better in our tests.
- Creative problem solving: 2 vs 5 — Gemini wins; Gemini is tied for 1st here, so it produces more non-obvious, feasible ideas in our evaluation.
- Tool calling: 3 vs 5 — Gemini wins; Gemini is tied for 1st on tool_calling ("tied for 1st with 16 other models"), which correlates to more accurate function selection and argument sequencing.
- Faithfulness: 4 vs 5 — Gemini wins; Gemini is tied for 1st on faithfulness, reducing hallucination risk in source-driven tasks.
- Long context: 4 vs 5 — Gemini wins; Gemini is tied for 1st on long_context and also has a much larger context window (1,048,576 vs 131,072), so it performs better on retrieval and continuity across 30K+ tokens.
- Persona consistency: 3 vs 5 — Gemini wins; Gemini is tied for 1st on persona_consistency, so it resists injection and maintains character in our tests.
- Multilingual: 4 vs 5 — Gemini wins; Gemini is tied for 1st on multilingual outputs in our ranking set. Ties (no winner): constrained_rewriting 3/3, classification 4/4, safety_calibration 1/1, agentic_planning 4/4 — these represent parity in our suite. Notably, Devstral ties for 1st on classification per its ranking ("tied for 1st with 29 other models"), so for routing and categorization tasks it compares well. External benchmarks: Gemini scores 57.6% on SWE-bench Verified (Epoch AI) and 84.2% on AIME 2025 (Epoch AI); Devstral has no SWE-bench or AIME external scores in the payload. We cite those Epoch AI results as supplementary evidence for Gemini's coding/math capabilities.
Pricing Analysis
Per the payload, Devstral Medium costs $0.4 input / $2 output per mTok; Gemini 2.5 Pro costs $1.25 input / $10 output per mTok. Using a balanced 50/50 input/output split (1M tokens = 1,000 mTok total; 500 mTok input + 500 mTok output): Devstral = 500*$0.4 + 500*$2 = $200 + $1,000 = $1,200 per 1M tokens; Gemini = 500*$1.25 + 500*$10 = $625 + $5,000 = $5,625 per 1M tokens. Scale that to 10M tokens: Devstral ≈ $12,000; Gemini ≈ $56,250. At 100M tokens: Devstral ≈ $120,000; Gemini ≈ $562,500. In short, Devstral is ~20% of Gemini's cost per the priceRatio (0.2) in the payload. Teams doing high-volume inference (chat services, bulk generation) should care about this gap; teams prioritizing top-ranked capability in tool calling, long-context, faithfulness, or multilingual outputs may find Gemini worth the premium.
Real-World Cost Comparison
Bottom Line
Choose Devstral Medium if: you need the lowest per-token cost at scale (Devstral is $0.4 input / $2 output per mTok) and you can accept weaker performance on tool-calling, long-context, faithfulness, persona consistency, and creative-problem-solving. Good for high-volume inference where classification parity and lower costs matter. Choose Gemini 2.5 Pro if: you prioritize top-tier tool-calling, faithfulness, long-context handling, persona consistency, creative problem solving, or multilingual quality (Gemini wins 8 of 12 tests and ranks tied for 1st on several key axes), and you can pay the premium ($1.25 input / $10 output per mTok). Gemini also provides multimodal inputs and a much larger context window (1,048,576 vs 131,072).
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.