Devstral Small 1.1 vs Gemini 3 Flash Preview
Gemini 3 Flash Preview is the stronger general-purpose AI, winning 10 of 12 benchmarks in our testing — including agentic planning, tool calling, strategic analysis, and creative problem solving — and scoring 75.4% on SWE-bench Verified (Epoch AI, rank 3 of 12). Devstral Small 1.1 edges it out only on safety calibration (2 vs 1 in our tests), and matches it on classification. The tradeoff is steep: Gemini 3 Flash Preview costs $0.50/$3.00 per million input/output tokens versus Devstral Small 1.1's $0.10/$0.30 — a 10x output cost premium that matters at scale.
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
Gemini 3 Flash Preview
Benchmark Scores
External Benchmarks
Pricing
Input
$0.500/MTok
Output
$3.00/MTok
modelpicker.net
Benchmark Analysis
Gemini 3 Flash Preview outscores Devstral Small 1.1 across 10 of 12 benchmarks in our testing, with ties on classification and a Devstral Small 1.1 advantage only on safety calibration.
Tool Calling (5 vs 4): Gemini 3 Flash Preview ties for 1st among 54 models; Devstral Small 1.1 ranks 18th among 54, tied with 28 others. For agentic workflows that depend on accurate function selection and argument sequencing, this gap is meaningful.
Agentic Planning (5 vs 2): This is Devstral Small 1.1's worst result — it ranks 53rd of 54 models, tied with just one other. Gemini 3 Flash Preview ties for 1st among 54. If you're building autonomous agents that need goal decomposition and failure recovery, Devstral Small 1.1 is a poor fit.
Structured Output (5 vs 4): Gemini 3 Flash Preview ties for 1st among 54; Devstral Small 1.1 ranks 26th, tied with 26 others. Both are competent at JSON schema compliance, but Gemini 3 Flash Preview is more reliable.
Strategic Analysis (5 vs 2): Gemini 3 Flash Preview ties for 1st among 54; Devstral Small 1.1 ranks 44th. For nuanced tradeoff reasoning with real-world numbers, the gap is severe.
Creative Problem Solving (5 vs 2): Gemini 3 Flash Preview ties for 1st among 54; Devstral Small 1.1 ranks 47th of 54, tied with 7 others. Devstral Small 1.1 generates more conventional output on open-ended creative tasks.
Long Context (5 vs 4): Gemini 3 Flash Preview ties for 1st among 55; Devstral Small 1.1 ranks 38th. Gemini 3 Flash Preview also has an 8x larger context window (1,048,576 tokens vs 131,072), making it better suited for retrieval tasks over very long documents.
Faithfulness (5 vs 4): Gemini 3 Flash Preview ties for 1st among 55; Devstral Small 1.1 ranks 34th.
Persona Consistency (5 vs 2): Gemini 3 Flash Preview ties for 1st among 53; Devstral Small 1.1 ranks 51st of 53. For chatbot and role-based applications, Devstral Small 1.1 struggles to maintain character.
Multilingual (5 vs 4): Gemini 3 Flash Preview ties for 1st among 55; Devstral Small 1.1 ranks 36th.
Constrained Rewriting (4 vs 3): Gemini 3 Flash Preview ranks 6th of 53; Devstral Small 1.1 ranks 31st.
Classification (4 vs 4): Both tie for 1st among 53 models, tied with 29 others. Neither has an edge here.
Safety Calibration (2 vs 1): Devstral Small 1.1's only win. It scores 2 and ranks 12th of 55 (tied with 19 others); Gemini 3 Flash Preview scores 1 and ranks 32nd of 55. Both scores are below the field median of 2, so neither model excels at refusing harmful requests while permitting legitimate ones — Devstral Small 1.1 is just less bad.
External Benchmarks (Epoch AI): Gemini 3 Flash Preview scores 75.4% on SWE-bench Verified, placing it 3rd of 12 models with this data point — above the field median of 70.8%. It also scores 92.8% on AIME 2025, ranking 5th of 23 models and well above the median of 83.9%. Devstral Small 1.1 has no external benchmark data in the payload for direct comparison on these dimensions.
Pricing Analysis
Devstral Small 1.1 costs $0.10 per million input tokens and $0.30 per million output tokens. Gemini 3 Flash Preview costs $0.50 input and $3.00 output — 5x more expensive on input, 10x more on output. At 1M output tokens/month, that gap is $2.70 in absolute terms — negligible. At 10M output tokens/month, you're paying $30,000 vs $3,000 annually, a $27,000 difference. At 100M output tokens/month, Gemini 3 Flash Preview costs $300,000 vs $30,000 for Devstral Small 1.1 — a $270,000 annual gap. Developers running high-volume pipelines where Devstral Small 1.1's capabilities are sufficient should take that cost gap seriously. For low-to-medium volume use cases or tasks where Gemini 3 Flash Preview's substantially higher benchmark scores translate to better output quality, the premium is easier to justify.
Real-World Cost Comparison
Bottom Line
Choose Devstral Small 1.1 if: you're running high-volume text pipelines (10M+ output tokens/month) where cost dominates, your tasks fall into classification or structured data extraction where it holds its own, you need reasonable safety calibration behavior, or you're specifically constrained to text-in/text-out workflows. Also consider it if Gemini 3 Flash Preview's weaknesses (safety calibration score of 1) are a hard blocker for your use case.
Choose Gemini 3 Flash Preview if: you're building agentic systems — it scores 5/5 on both tool calling and agentic planning vs Devstral Small 1.1's 4 and 2. Choose it for multi-modal inputs (it accepts text, image, file, audio, and video), for tasks requiring strategic reasoning (5 vs 2 in our testing), for long-context retrieval over documents exceeding 131K tokens, for creative and open-ended generation, or for multi-turn chat where persona consistency matters (5 vs 2). Its 75.4% SWE-bench Verified score (Epoch AI, rank 3 of 12) also makes it a strong candidate for coding assistance at production quality. The 10x output cost premium is justified when task complexity demands it — but evaluate whether your volume makes that math work.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.