Question 1

Is Devstral Small 1.1 better than Gemini 2.5 Pro?

Accepted Answer

Not overall. In our 12-test suite Gemini 2.5 Pro wins 9 tests while Devstral Small 1.1 wins 1 (safety_calibration) and ties 2. Gemini leads on long-context, tool calling, and faithfulness; Devstral’s advantage is cost and the single safety benchmark win.

Question 2

Which model is cheaper to run at scale?

Accepted Answer

Devstral Small 1.1 is far cheaper: $0.10 perM input and $0.30 perM output vs Gemini 2.5 Pro at $1.25 perM input and $10.00 perM output. With an assumed 50/50 input/output split, Devstral ≈ $0.20 per 1M tokens vs Gemini ≈ $5.625 per 1M tokens.

Question 3

Which model is better for coding and math?

Accepted Answer

Gemini 2.5 Pro performs better on tests tied to coding/math reasoning in our suite — it wins creative_problem_solving, strategic_analysis, and ranks highly on long_context and tool_calling. Additionally, on external benchmarks (Epoch AI) Gemini scores 57.6% on SWE-bench Verified and 84.2% on AIME 2025.

Question 4

Which is safer (refuses harmful prompts appropriately)?

Accepted Answer

In our safety_calibration test Devstral Small 1.1 outperforms Gemini (score 2 vs 1). Devstral ranks 12 of 55 on safety_calibration (tied with 19 others); Gemini ranks 32 of 55 (tied with 23 others) in our pool.

Question 5

How do context windows compare?

Accepted Answer

Devstral Small 1.1 has a 131,072-token context window; Gemini 2.5 Pro has a 1,048,576-token context window. That hardware difference aligns with Gemini’s 5/5 long_context score and its tied-for-1st long_context ranking in our tests.

Question 6

If I run 100M tokens/month, what are the cost differences?

Accepted Answer

Assuming a 50/50 input/output split: Devstral ≈ $20/month for 100M tokens; Gemini ≈ $562.50/month. The $/month gap grows linearly with usage and matters most for high-volume applications.

Devstral Small 1.1 vs Gemini 2.5 Pro

Devstral Small 1.1

Gemini 2.5 Pro

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions