Question 1

Is Devstral Medium better than Gemini 2.5 Pro?

Accepted Answer

No — in our 12-test suite Gemini 2.5 Pro wins 8 benchmarks while Devstral Medium wins none outright and ties on 4 tests. Devstral does tie Gemini on classification (both score 4) but loses on tool-calling, long-context, faithfulness, persona consistency, creative problem solving, and multilingual.

Question 2

Which model is cheaper per token?

Accepted Answer

Devstral Medium is cheaper: $0.4 input / $2 output per mTok versus Gemini 2.5 Pro at $1.25 input / $10 output per mTok. At a balanced 50/50 input/output split, per 1M tokens Devstral ≈ $1,200 and Gemini ≈ $5,625.

Question 3

Which model is better for long-context and multimodal workloads?

Accepted Answer

Gemini 2.5 Pro: it scores 5 vs Devstral's 4 on long_context, is tied for 1st in our long-context ranking, and has a far larger context window (1,048,576 vs 131,072). Gemini is also multimodal in the payload (text+image+file+audio+video->text); Devstral is text->text.

Question 4

Which model is better at tool calling and structured outputs?

Accepted Answer

Gemini 2.5 Pro wins both: tool_calling 5 vs 3 and structured_output 5 vs 4. Gemini is tied for 1st on tool_calling and structured_output in our rankings, so it produced more accurate function selection/arguments and stricter JSON/schema compliance in our tests.

Question 5

How do they compare on external coding/math benchmarks?

Accepted Answer

Gemini 2.5 Pro has external scores in the payload: 57.6% on SWE-bench Verified (Epoch AI) and 84.2% on AIME 2025 (Epoch AI). Devstral Medium has no SWE-bench or AIME external scores included in the payload, so external comparative data is not available for Devstral in this dataset.

Question 6

Are there safety differences?

Accepted Answer

Both models score 1 on safety_calibration in our internal suite, so neither demonstrated stronger refusal/permissiveness behavior in these tests.

Devstral Medium vs Gemini 2.5 Pro

Devstral Medium

Gemini 2.5 Pro

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions