Devstral 2 2512 vs Gemini 3 Flash Preview

For most production agentic, coding, and reasoning tasks we pick Gemini 3 Flash Preview — it wins 7 of 12 benchmarks and scores 5 on tool_calling, agentic_planning, and faithfulness in our tests. Devstral 2 2512 is the better value if cost or constrained rewriting matter: it wins constrained_rewriting and runs ~33% cheaper per-million-token output ($2 vs $3).

mistral

Devstral 2 2512

Overall
4.00/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
4/5
Persona Consistency
4/5
Constrained Rewriting
5/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window262K

modelpicker.net

google

Gemini 3 Flash Preview

Overall
4.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
75.4%
MATH Level 5
N/A
AIME 2025
92.8%

Pricing

Input

$0.500/MTok

Output

$3.00/MTok

Context Window1049K

modelpicker.net

Benchmark Analysis

Summary of head-to-head results in our 12-test suite: Gemini 3 Flash Preview wins 7 tests, Devstral 2 2512 wins 1, and 4 tests tie. Where Gemini wins: strategic_analysis (5 vs 4) — Gemini ranks "tied for 1st" (rank 1 of 54); creative_problem_solving (5 vs 4) — Gemini ranks tied for 1st; tool_calling (5 vs 4) — Gemini is tied for 1st (display: "tied for 1st with 16 other models"); faithfulness (5 vs 4) — Gemini ties for 1st (rank 1 of 55); classification (4 vs 3) — Gemini ties for 1st (display: "tied for 1st with 29 other models"); persona_consistency (5 vs 4) — Gemini ties for 1st; agentic_planning (5 vs 4) — Gemini ties for 1st. Devstral's win: constrained_rewriting (5 vs 4) — Devstral is "tied for 1st with 4 other models out of 53 tested," which matters for strict-length compression or tight character-limit tasks. Ties: structured_output (5 each), long_context (5 each), multilingual (5 each), and safety_calibration (both 1). Practical meaning: Gemini's higher scores and top-tier ranks on tool_calling, agentic_planning, and faithfulness translate to more accurate function selection, stronger multi-step goal decomposition, and fewer deviations from source material in our tests. Devstral matches Gemini on long-context and structured output and outperforms it on constrained rewriting, so it's preferable for tasks requiring tight output compression. External benchmarks (supplementary): Gemini 3 Flash Preview scores 75.4% on SWE-bench Verified (Epoch AI) and 92.8% on AIME 2025 (Epoch AI), which supports its coding and high-level math performance; Devstral has no external scores in the payload. Note both models score 1 on safety_calibration and rank 32 of 55 in our tests — neither reliably refuses disallowed or harmful prompts in our suite.

BenchmarkDevstral 2 2512Gemini 3 Flash Preview
Faithfulness4/55/5
Long Context5/55/5
Multilingual5/55/5
Tool Calling4/55/5
Classification3/54/5
Agentic Planning4/55/5
Structured Output5/55/5
Safety Calibration1/51/5
Strategic Analysis4/55/5
Persona Consistency4/55/5
Constrained Rewriting5/54/5
Creative Problem Solving4/55/5
Summary1 wins7 wins

Pricing Analysis

Per-million-token pricing (input/output per payload): Devstral 2 2512 charges $0.40 input / $2.00 output; Gemini 3 Flash Preview charges $0.50 input / $3.00 output. Assuming a 50/50 split of input/output tokens, monthly costs are: 1M tokens — Devstral $1.20 vs Gemini $1.75 (Gemini +$0.55, +45.8%); 10M tokens — Devstral $12.00 vs Gemini $17.50 (Gemini +$5.50); 100M tokens — Devstral $120.00 vs Gemini $175.00 (Gemini +$55.00). If your workload is output-heavy (e.g., mostly generation), the gap is larger: at 1M output-only tokens Devstral is $2.00 vs Gemini $3.00 (+$1/M). Teams generating tens of millions of tokens should care: the difference scales linearly (Gemini costs roughly 1.46x more under a balanced split).

Real-World Cost Comparison

TaskDevstral 2 2512Gemini 3 Flash Preview
iChat response$0.0011$0.0016
iBlog post$0.0042$0.0063
iDocument batch$0.108$0.160
iPipeline run$1.08$1.60

Bottom Line

Choose Devstral 2 2512 if: you need a lower-cost option (≈33% cheaper per-M under balanced usage), you require strong constrained_rewriting or top-tier structured output at a 256K context window, and you prioritize cost-efficiency at scale. Choose Gemini 3 Flash Preview if: you need best-in-suite tool calling, agentic planning, classification, faithfulness, and creative/problem-solving performance (it wins 7 of 12 benchmarks and holds high ranks), require multimodal inputs or very large context (1,048,576), and can justify the ~1.46x cost under balanced token usage.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions