Devstral Small 1.1 vs Gemini 3.1 Flash Lite Preview

Gemini 3.1 Flash Lite Preview wins 9 of 12 benchmarks in our testing, outscoring Devstral Small 1.1 on strategic analysis, structured output, agentic planning, creative problem solving, faithfulness, safety calibration, persona consistency, multilingual, and constrained rewriting. Devstral Small 1.1's only benchmark win is classification, where it scores 4/5 vs Gemini 3.1 Flash Lite Preview's 3/5. The tradeoff is real: Devstral Small 1.1 costs $0.10/$0.30 per million tokens (input/output) vs $0.25/$1.50 for Gemini 3.1 Flash Lite Preview — making the latter 5x more expensive on output at significantly higher quality across most dimensions.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

google

Gemini 3.1 Flash Lite Preview

Overall
4.42/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
4/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.250/MTok

Output

$1.50/MTok

Context Window1049K

modelpicker.net

Benchmark Analysis

Across our 12-test suite, Gemini 3.1 Flash Lite Preview wins 9 benchmarks, Devstral Small 1.1 wins 1, and they tie on 2.

Where Gemini 3.1 Flash Lite Preview leads:

  • Structured output (5 vs 4): Gemini ties for 1st among 54 models; Devstral ranks 26th. For production pipelines relying on JSON schema compliance, this gap matters.
  • Strategic analysis (5 vs 2): Gemini ties for 1st among 54 models; Devstral ranks 44th — a severe gap. Complex tradeoff reasoning, business analysis, and nuanced decision-making tasks strongly favor Gemini.
  • Agentic planning (4 vs 2): Gemini ranks 16th of 54; Devstral ranks dead last at 53rd of 54. This is a disqualifying weakness for Devstral in any agentic workflow — goal decomposition and failure recovery are foundational to autonomous AI pipelines.
  • Creative problem solving (4 vs 2): Gemini ranks 9th of 54; Devstral ranks 47th. Non-obvious ideation and lateral thinking tasks go to Gemini by a wide margin.
  • Faithfulness (5 vs 4): Gemini ties for 1st among 55 models; Devstral ranks 34th. For RAG applications or summarization where hallucination risk is high, Gemini is the safer choice.
  • Safety calibration (5 vs 2): Gemini ties for 1st among 55 models; Devstral ranks 12th but scores only 2/5 — below the median (p50: 2). Gemini correctly refuses harmful requests while permitting legitimate ones at the highest level in our testing.
  • Persona consistency (5 vs 2): Gemini ties for 1st among 53 models; Devstral ranks 51st. For chatbot or character applications, Devstral is a poor fit.
  • Multilingual (5 vs 4): Gemini ties for 1st among 55 models; Devstral ranks 36th. Non-English applications should strongly prefer Gemini.
  • Constrained rewriting (4 vs 3): Gemini ranks 6th of 53; Devstral ranks 31st. Compression tasks with hard character limits favor Gemini.

Where Devstral Small 1.1 leads:

  • Classification (4 vs 3): Devstral ties for 1st among 53 models; Gemini ranks 31st. This is Devstral's clearest competitive advantage — routing, tagging, and categorization tasks.

Ties:

  • Tool calling (4 vs 4): Both rank 18th of 54, with 29 models sharing this score. No meaningful difference in function-calling reliability.
  • Long context (4 vs 4): Both rank 38th of 55. Both handle 30K+ token retrieval comparably — though Gemini's 1,048,576-token context window dwarfs Devstral's 131,072 tokens, which may matter for very long documents even if the benchmark score is equal.
BenchmarkDevstral Small 1.1Gemini 3.1 Flash Lite Preview
Faithfulness4/55/5
Long Context4/54/5
Multilingual4/55/5
Tool Calling4/54/5
Classification4/53/5
Agentic Planning2/54/5
Structured Output4/55/5
Safety Calibration2/55/5
Strategic Analysis2/55/5
Persona Consistency2/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/54/5
Summary1 wins9 wins

Pricing Analysis

Devstral Small 1.1 costs $0.10/M input tokens and $0.30/M output tokens. Gemini 3.1 Flash Lite Preview costs $0.25/M input and $1.50/M output — 2.5x more expensive on input and 5x more on output. At 1M output tokens/month, that's $0.30 vs $1.50 — a $1.20 difference that's negligible. At 10M output tokens, the gap grows to $12 vs $15 — still manageable. At 100M output tokens, you're looking at $30,000 vs $150,000 annually — a $120,000 difference that demands justification. For high-volume pipelines where classification is the primary task (Devstral Small 1.1's one benchmark win), the cheaper model makes a compelling case. For agentic workflows, multilingual products, or applications requiring high faithfulness and safety calibration, Gemini 3.1 Flash Lite Preview's quality premium buys real capability gains. Developers running cost-sensitive batch workloads on text-only inputs should weight the 5x output cost gap heavily; those building multimodal applications will find Gemini 3.1 Flash Lite Preview is the only option here, as Devstral Small 1.1 supports text-only input and output.

Real-World Cost Comparison

TaskDevstral Small 1.1Gemini 3.1 Flash Lite Preview
iChat response<$0.001<$0.001
iBlog post<$0.001$0.0031
iDocument batch$0.017$0.080
iPipeline run$0.170$0.800

Bottom Line

Choose Devstral Small 1.1 if: Your primary workload is classification, routing, or categorization at high volume — it ties for 1st on our classification benchmark and costs 5x less on output tokens. It's also the only viable option if your stack requires text-in/text-out pipelines at the lowest possible API cost and quality on classification is your north star metric. Be aware that agentic planning (rank 53/54) and persona consistency (rank 51/53) are genuine weaknesses.

Choose Gemini 3.1 Flash Lite Preview if: You need a general-purpose model that performs well across the board — especially for agentic workflows (rank 16/54 vs Devstral's 53/54), strategic analysis (rank 1/54), multilingual output (rank 1/55), or applications where safety calibration and faithfulness matter. Gemini 3.1 Flash Lite Preview also supports multimodal input (text, image, file, audio, video), making it the only option here for non-text inputs. The 5x output cost premium is justified for any use case beyond high-volume classification.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions