Devstral Small 1.1 vs Gemini 3.1 Pro Preview

Gemini 3.1 Pro Preview is the clear winner on breadth — it outscores Devstral Small 1.1 on 9 of 12 benchmarks in our testing, including agentic planning (5 vs 2), strategic analysis (5 vs 2), and creative problem solving (5 vs 2). Devstral Small 1.1 holds one win: classification (4 vs 2), which matters for routing and categorization pipelines. The price gap is extreme — output tokens cost $0.30/M on Devstral Small 1.1 versus $12/M on Gemini 3.1 Pro Preview, a 40x difference that makes the calculus entirely about whether you need frontier-level reasoning or can scope your task narrowly enough for a cheaper model.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

google

Gemini 3.1 Pro Preview

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
2/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
95.6%

Pricing

Input

$2.00/MTok

Output

$12.00/MTok

Context Window1049K

modelpicker.net

Benchmark Analysis

Across our 12-test suite, Gemini 3.1 Pro Preview wins 9 benchmarks, Devstral Small 1.1 wins 1, and they tie on 2.

Where Gemini 3.1 Pro Preview dominates:

  • Agentic planning: 5 vs 2. Devstral Small 1.1 ranks 53rd of 54 on this test — near the bottom of all models we've tested. Gemini 3.1 Pro Preview ties for 1st with 14 others. For multi-step agent workflows requiring goal decomposition and failure recovery, this is a critical gap.
  • Strategic analysis: 5 vs 2. Gemini 3.1 Pro Preview ties for 1st with 25 other models; Devstral Small 1.1 ranks 44th of 54. Real-world implication: nuanced tradeoff reasoning and analysis tasks are not where Devstral Small 1.1 should be deployed.
  • Creative problem solving: 5 vs 2. Gemini 3.1 Pro Preview ties for 1st with 7 others; Devstral Small 1.1 ranks 47th of 54. Generating non-obvious, feasible ideas is a clear Gemini 3.1 Pro Preview strength.
  • Persona consistency: 5 vs 2. Devstral Small 1.1 ranks 51st of 53 — one of its weakest scores. Chatbot and character applications should avoid it.
  • Faithfulness: 5 vs 4. Both score reasonably, but Gemini 3.1 Pro Preview ties for 1st with 32 others while Devstral Small 1.1 ranks 34th of 55.
  • Long context: 5 vs 4. Gemini 3.1 Pro Preview ties for 1st with 36 others and has a 1,048,576-token context window vs Devstral Small 1.1's 131,072. The context window difference alone is meaningful for document-heavy workflows.
  • Multilingual: 5 vs 4. Both are competent, but Gemini 3.1 Pro Preview ties for 1st with 34 others.
  • Constrained rewriting: 4 vs 3. A meaningful gap for compression-heavy editorial tasks.
  • Structured output: 5 vs 4. Gemini 3.1 Pro Preview ties for 1st with 24 others; Devstral Small 1.1 ranks 26th of 54. Both are capable here, but Gemini 3.1 Pro Preview has a slight edge on JSON schema compliance.

Where Devstral Small 1.1 wins:

  • Classification: 4 vs 2. This is Devstral Small 1.1's clearest win and a real differentiator. It ties for 1st with 29 other models out of 53 tested; Gemini 3.1 Pro Preview ranks 51st of 53. Routing, tagging, and categorization tasks are the one domain where Devstral Small 1.1 clearly beats Gemini 3.1 Pro Preview in our testing.

Ties:

  • Tool calling: both score 4, both rank 18th of 54 (tied with 28 others). Equivalent for function-calling pipelines.
  • Safety calibration: both score 2, both rank 12th of 55 (tied with 19 others). Neither model stands out here.

External benchmark (Epoch AI): Gemini 3.1 Pro Preview scores 95.6% on AIME 2025, ranking 2nd of 23 models tested — placing it among the top math reasoning models by that external measure. No AIME 2025 score is available for Devstral Small 1.1 in our data.

BenchmarkDevstral Small 1.1Gemini 3.1 Pro Preview
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling4/54/5
Classification4/52/5
Agentic Planning2/55/5
Structured Output4/55/5
Safety Calibration2/52/5
Strategic Analysis2/55/5
Persona Consistency2/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/55/5
Summary1 wins9 wins

Pricing Analysis

Devstral Small 1.1 is priced at $0.10/M input and $0.30/M output. Gemini 3.1 Pro Preview costs $2.00/M input and $12.00/M output — 20x more on input, 40x more on output. At 1M output tokens/month, that's $300 vs $12,000. At 10M tokens, $3,000 vs $120,000. At 100M tokens, $30,000 vs $1,200,000. That $1.17M annual gap at scale is a budget line, not a rounding error. Note also that Gemini 3.1 Pro Preview uses reasoning tokens (flagged in the payload), which can significantly inflate token counts in complex workflows — meaning real costs may exceed headline rates. Devstral Small 1.1 is priced for high-volume, narrow-task deployments. Gemini 3.1 Pro Preview is priced for tasks where the cost of a wrong answer (in a business decision, a complex agent workflow, or a multimodal pipeline) exceeds the cost of the tokens. Teams running classification pipelines, code routing, or structured extraction at volume should take the price gap seriously.

Real-World Cost Comparison

TaskDevstral Small 1.1Gemini 3.1 Pro Preview
iChat response<$0.001$0.0064
iBlog post<$0.001$0.025
iDocument batch$0.017$0.640
iPipeline run$0.170$6.40

Bottom Line

Choose Devstral Small 1.1 if: you are running high-volume classification, routing, or tagging pipelines where cost efficiency is critical and the task is narrow; your budget cannot absorb $12/M output tokens at scale; you need a capable structured-output or tool-calling model for well-defined workflows where agentic reasoning and creative analysis are not required; or you need text-to-text at volume with predictable costs.

Choose Gemini 3.1 Pro Preview if: you need a capable agentic system that can decompose complex goals and recover from failures (scored 5 vs 2 in our testing); your workflows involve strategic analysis, creative problem solving, or long-document reasoning; you need multimodal input (image, audio, video, file) — Devstral Small 1.1 is text-only; you require a 1M-token context window for large codebases or document sets; or you're building a system where the cost of failure is high enough to justify 40x higher output token pricing.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions