Devstral Small 1.1 vs Gemini 3 Flash Preview

Gemini 3 Flash Preview is the stronger general-purpose AI, winning 10 of 12 benchmarks in our testing — including agentic planning, tool calling, strategic analysis, and creative problem solving — and scoring 75.4% on SWE-bench Verified (Epoch AI, rank 3 of 12). Devstral Small 1.1 edges it out only on safety calibration (2 vs 1 in our tests), and matches it on classification. The tradeoff is steep: Gemini 3 Flash Preview costs $0.50/$3.00 per million input/output tokens versus Devstral Small 1.1's $0.10/$0.30 — a 10x output cost premium that matters at scale.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

google

Gemini 3 Flash Preview

Overall
4.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
75.4%
MATH Level 5
N/A
AIME 2025
92.8%

Pricing

Input

$0.500/MTok

Output

$3.00/MTok

Context Window1049K

modelpicker.net

Benchmark Analysis

Gemini 3 Flash Preview outscores Devstral Small 1.1 across 10 of 12 benchmarks in our testing, with ties on classification and a Devstral Small 1.1 advantage only on safety calibration.

Tool Calling (5 vs 4): Gemini 3 Flash Preview ties for 1st among 54 models; Devstral Small 1.1 ranks 18th among 54, tied with 28 others. For agentic workflows that depend on accurate function selection and argument sequencing, this gap is meaningful.

Agentic Planning (5 vs 2): This is Devstral Small 1.1's worst result — it ranks 53rd of 54 models, tied with just one other. Gemini 3 Flash Preview ties for 1st among 54. If you're building autonomous agents that need goal decomposition and failure recovery, Devstral Small 1.1 is a poor fit.

Structured Output (5 vs 4): Gemini 3 Flash Preview ties for 1st among 54; Devstral Small 1.1 ranks 26th, tied with 26 others. Both are competent at JSON schema compliance, but Gemini 3 Flash Preview is more reliable.

Strategic Analysis (5 vs 2): Gemini 3 Flash Preview ties for 1st among 54; Devstral Small 1.1 ranks 44th. For nuanced tradeoff reasoning with real-world numbers, the gap is severe.

Creative Problem Solving (5 vs 2): Gemini 3 Flash Preview ties for 1st among 54; Devstral Small 1.1 ranks 47th of 54, tied with 7 others. Devstral Small 1.1 generates more conventional output on open-ended creative tasks.

Long Context (5 vs 4): Gemini 3 Flash Preview ties for 1st among 55; Devstral Small 1.1 ranks 38th. Gemini 3 Flash Preview also has an 8x larger context window (1,048,576 tokens vs 131,072), making it better suited for retrieval tasks over very long documents.

Faithfulness (5 vs 4): Gemini 3 Flash Preview ties for 1st among 55; Devstral Small 1.1 ranks 34th.

Persona Consistency (5 vs 2): Gemini 3 Flash Preview ties for 1st among 53; Devstral Small 1.1 ranks 51st of 53. For chatbot and role-based applications, Devstral Small 1.1 struggles to maintain character.

Multilingual (5 vs 4): Gemini 3 Flash Preview ties for 1st among 55; Devstral Small 1.1 ranks 36th.

Constrained Rewriting (4 vs 3): Gemini 3 Flash Preview ranks 6th of 53; Devstral Small 1.1 ranks 31st.

Classification (4 vs 4): Both tie for 1st among 53 models, tied with 29 others. Neither has an edge here.

Safety Calibration (2 vs 1): Devstral Small 1.1's only win. It scores 2 and ranks 12th of 55 (tied with 19 others); Gemini 3 Flash Preview scores 1 and ranks 32nd of 55. Both scores are below the field median of 2, so neither model excels at refusing harmful requests while permitting legitimate ones — Devstral Small 1.1 is just less bad.

External Benchmarks (Epoch AI): Gemini 3 Flash Preview scores 75.4% on SWE-bench Verified, placing it 3rd of 12 models with this data point — above the field median of 70.8%. It also scores 92.8% on AIME 2025, ranking 5th of 23 models and well above the median of 83.9%. Devstral Small 1.1 has no external benchmark data in the payload for direct comparison on these dimensions.

BenchmarkDevstral Small 1.1Gemini 3 Flash Preview
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling4/55/5
Classification4/54/5
Agentic Planning2/55/5
Structured Output4/55/5
Safety Calibration2/51/5
Strategic Analysis2/55/5
Persona Consistency2/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/55/5
Summary1 wins10 wins

Pricing Analysis

Devstral Small 1.1 costs $0.10 per million input tokens and $0.30 per million output tokens. Gemini 3 Flash Preview costs $0.50 input and $3.00 output — 5x more expensive on input, 10x more on output. At 1M output tokens/month, that gap is $2.70 in absolute terms — negligible. At 10M output tokens/month, you're paying $30,000 vs $3,000 annually, a $27,000 difference. At 100M output tokens/month, Gemini 3 Flash Preview costs $300,000 vs $30,000 for Devstral Small 1.1 — a $270,000 annual gap. Developers running high-volume pipelines where Devstral Small 1.1's capabilities are sufficient should take that cost gap seriously. For low-to-medium volume use cases or tasks where Gemini 3 Flash Preview's substantially higher benchmark scores translate to better output quality, the premium is easier to justify.

Real-World Cost Comparison

TaskDevstral Small 1.1Gemini 3 Flash Preview
iChat response<$0.001$0.0016
iBlog post<$0.001$0.0063
iDocument batch$0.017$0.160
iPipeline run$0.170$1.60

Bottom Line

Choose Devstral Small 1.1 if: you're running high-volume text pipelines (10M+ output tokens/month) where cost dominates, your tasks fall into classification or structured data extraction where it holds its own, you need reasonable safety calibration behavior, or you're specifically constrained to text-in/text-out workflows. Also consider it if Gemini 3 Flash Preview's weaknesses (safety calibration score of 1) are a hard blocker for your use case.

Choose Gemini 3 Flash Preview if: you're building agentic systems — it scores 5/5 on both tool calling and agentic planning vs Devstral Small 1.1's 4 and 2. Choose it for multi-modal inputs (it accepts text, image, file, audio, and video), for tasks requiring strategic reasoning (5 vs 2 in our testing), for long-context retrieval over documents exceeding 131K tokens, for creative and open-ended generation, or for multi-turn chat where persona consistency matters (5 vs 2). Its 75.4% SWE-bench Verified score (Epoch AI, rank 3 of 12) also makes it a strong candidate for coding assistance at production quality. The 10x output cost premium is justified when task complexity demands it — but evaluate whether your volume makes that math work.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions