Devstral Medium vs Gemini 3.1 Pro Preview
Gemini 3.1 Pro Preview is the clear choice for high-performance, multimodal and agentic AI workflows — it wins 11 of 12 benchmarks in our testing and scores 95.6% on AIME 2025 (Epoch AI). Devstral Medium is the cost-efficient alternative: about 6× cheaper and the better pick when classification accuracy and tight budgets matter.
mistral
Devstral Medium
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$2.00/MTok
modelpicker.net
Gemini 3.1 Pro Preview
Benchmark Scores
External Benchmarks
Pricing
Input
$2.00/MTok
Output
$12.00/MTok
modelpicker.net
Benchmark Analysis
Summary of our 12-test head-to-head (scores are from our testing, 1–5 scale):
- Gemini wins 11 tests: structured output 5 vs 4 (Gemini ties for 1st — "tied for 1st with 24 other models out of 54 tested"; Devstral is "rank 26 of 54 (27 models share this score)"). Structured_output measures JSON/schema compliance — Gemini is stronger at strict format adherence.
- Strategic_analysis 5 vs 2 (Gemini: "tied for 1st with 25 other models out of 54 tested"). This implies Gemini handles nuanced tradeoffs and numeric reasoning in strategy tasks far better in our tests.
- Constrained_rewriting 4 vs 3 (Gemini ranks "rank 6 of 53 (25 models share this score)"): better at tight compression/rewrite tasks.
- Creative_problem_solving 5 vs 2 (Gemini "tied for 1st with 7 other models out of 54 tested"): Gemini produces more original, feasible ideas in our prompts.
- Tool_calling 4 vs 3 (Gemini "rank 18 of 54 (29 models share this score)"; Devstral "rank 47 of 54 (6 models share this score)"): Gemini selects and sequences functions more accurately in our tests.
- Faithfulness 5 vs 4 (Gemini "tied for 1st with 32 other models out of 55 tested"): Gemini adheres to source material better in our testing.
- Long_context 5 vs 4 (Gemini "tied for 1st with 36 other models out of 55 tested"): Gemini outperforms for retrieval and reasoning at 30K+ token contexts.
- Safety_calibration 2 vs 1 (Gemini "rank 12 of 55 (20 models share this score)"): Gemini refuses more harmful requests while permitting legitimate ones more reliably in our tests.
- Persona_consistency 5 vs 3 (Gemini "tied for 1st with 36 other models out of 53 tested"): Gemini better maintains character and resists injection in chat-style workloads.
- Agentic_planning 5 vs 4 (Gemini "tied for 1st with 14 other models out of 54 tested"): Gemini decomposes goals and recovers from failures more robustly in our agentic planning prompts.
- Multilingual 5 vs 4 (Gemini "tied for 1st with 34 other models out of 55 tested"): Gemini produces higher-quality non-English outputs in our tests.
- Devstral wins classification 4 vs 2 (Devstral "tied for 1st with 29 other models out of 53 tested"; Gemini "rank 51 of 53 (3 models share this score)"): Devstral is stronger at routing/categorization tasks in our benchmark scenarios. Extra external data: Gemini scores 95.6% on AIME 2025 (Epoch AI), which we cite as a third-party indicator of high mathematical/competition reasoning; Devstral has no AIME external score in the payload. Overall, Gemini's top ranks (many "tied for 1st" positions) indicate consistently best-in-class behavior on format, reasoning, creativity, long context, and agentic tests in our suite, while Devstral delivers a clear cost advantage and better classification in our testing.
Pricing Analysis
Prices in the payload are per mTok; cost per 1M tokens = price_per_mTok × 1,000. Devstral Medium: input $0.4 + output $2 = $2.4 per mTok → $2,400 per 1M tokens. Gemini 3.1 Pro Preview: input $2 + output $12 = $14 per mTok → $14,000 per 1M tokens. At 10M tokens/month Devstral ≈ $24,000 vs Gemini ≈ $140,000; at 100M tokens/month Devstral ≈ $240,000 vs Gemini ≈ $1,400,000. The ~6× price gap matters for high-volume production use (10M+ tokens/mo), startups, and any team optimizing cost of inference; it’s less critical for low-volume research or feature-prototype work where Gemini’s top-tier capabilities may justify the premium.
Real-World Cost Comparison
Bottom Line
Choose Devstral Medium if: you have strict cost constraints or very high token volumes (Devstral costs ~$2,400 per 1M tokens vs Gemini $14,000 per 1M), your primary tasks are classification/routing, or you need strong code/agentic reasoning at a much lower price point. Choose Gemini 3.1 Pro Preview if: you need top-tier performance across structured output, creative problem solving, long-context retrieval, agentic planning, multimodal inputs (Gemini supports text+image+file+audio+video→text), or you value the superior faithfulness and safety calibration shown in our tests (Gemini wins 11 of 12 benchmarks).
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.