Devstral Small 1.1 vs GPT-5.2

GPT-5.2 is the practical winner for most production AI tasks, scoring higher on agentic planning, safety, long-context, faithfulness, and creative problem solving in our 12-test suite. Devstral Small 1.1 is the cost-efficient alternative — much lower price ($0.40/mTok vs $15.75/mTok) and a reasonable pick when budget or high request volume dominate requirements.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

openai

GPT-5.2

Overall
4.67/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
73.8%
MATH Level 5
N/A
AIME 2025
96.1%

Pricing

Input

$1.75/MTok

Output

$14.00/MTok

Context Window400K

modelpicker.net

Benchmark Analysis

Overview: Across our 12-test suite, GPT-5.2 wins 9 categories, Devstral Small 1.1 wins 0, and 3 are ties. Scores (Devstral vs GPT-5.2):

  • Agentic planning: 2 vs 5 — GPT-5.2 wins and ranks tied for 1st (display: "tied for 1st with 14 other models"), so expect stronger goal decomposition and failure-recovery in multi-step agents on GPT-5.2. Devstral’s 2 (rank 53 of 54) indicates weaker decomposition.
  • Safety calibration: 2 vs 5 — GPT-5.2 wins (ranked tied for 1st), so it better refuses harmful requests while allowing legitimate ones in our tests; Devstral’s 2 is comparatively low (rank 12 of 55 but shared with many models).
  • Long-context: 4 vs 5 — GPT-5.2 wins and ties for 1st (long-context rank tied for 1st); Devstral’s 4 (rank 38 of 55) still handles long context but trails GPT-5.2’s retrieval accuracy at 30K+ tokens.
  • Faithfulness: 4 vs 5 — GPT-5.2 wins (tied for 1st), so fewer hallucinations in source-driven tasks; Devstral’s 4 indicates reasonable adherence but not top-tier.
  • Creative problem solving: 2 vs 5 — GPT-5.2 wins and ties for 1st (creative problem solving rank 1); Devstral scored 2, so GPT-5.2 produces more novel, specific, feasible ideas in our tests.
  • Strategic analysis: 2 vs 5 — GPT-5.2 wins (tied for 1st), valuable when nuanced tradeoffs and numeric reasoning matter.
  • Constrained rewriting: 3 vs 4 — GPT-5.2 wins (rank 6 of 53), better at tight character-limited rewrites; Devstral’s 3 is middling.
  • Persona consistency: 2 vs 5 — GPT-5.2 wins and ties for 1st; Devstral ranks poorly (rank 51 of 53), so GPT-5.2 better maintains role/persona in dialogue and resists prompt injection.
  • Multilingual: 4 vs 5 — GPT-5.2 wins (tied for 1st); Devstral’s 4 is decent but behind on non-English parity. Ties (both models score 4): structured output, tool calling, classification — both models match on JSON/schema compliance, function selection/arguments sequencing, and categorization tasks. Rankings show both tied at similar positions for those tests. External benchmarks (Epoch AI): GPT-5.2 scores 73.8% on SWE-bench Verified (Epoch AI), ranking 5 of 12, and 96.1% on AIME 2025 (Epoch AI), ranking 1 of 23 — these external measures corroborate GPT-5.2’s strength on coding and hard math tasks. Devstral has no external SWE/AIME scores in the payload.
BenchmarkDevstral Small 1.1GPT-5.2
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling4/54/5
Classification4/54/5
Agentic Planning2/55/5
Structured Output4/54/5
Safety Calibration2/55/5
Strategic Analysis2/55/5
Persona Consistency2/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/55/5
Summary0 wins9 wins

Pricing Analysis

Per the payload, Devstral Small 1.1 charges $0.10 input + $0.30 output = $0.40 per mTok; GPT-5.2 charges $1.75 input + $14.00 output = $15.75 per mTok. At 1M tokens/month (1,000 mTok): Devstral = $400; GPT-5.2 = $15,750. At 10M tokens (10,000 mTok): Devstral = $4,000; GPT-5.2 = $157,500. At 100M tokens (100,000 mTok): Devstral = $40,000; GPT-5.2 = $1,575,000. The priceRatio in the payload is ~0.0214, so Devstral costs ~2.14% of GPT-5.2 per mTok. High-volume apps, startups, and cost-constrained deployments should care most about this gap; teams needing top-tier safety, long-context, agentic planning, or math/engineering performance may justify GPT-5.2’s much higher spend.

Real-World Cost Comparison

TaskDevstral Small 1.1GPT-5.2
iChat response<$0.001$0.0073
iBlog post<$0.001$0.029
iDocument batch$0.017$0.735
iPipeline run$0.170$7.35

Bottom Line

Choose Devstral Small 1.1 if: you need a far lower-cost model for high-volume text->text pipelines, are building budget-conscious engineering agents, or can accept weaker agentic planning, safety, and long-context performance in exchange for $0.40/mTok and a 131,072-token context window. Choose GPT-5.2 if: you prioritize best-in-class agentic planning, safety calibration, long-context fidelity, faithfulness, creative problem solving, and multilingual/advanced-math performance (GPT-5.2 scores 5 vs Devstral’s 2–4 across those tests), and you can justify $15.75/mTok for substantially higher quality and broader modality/context support (text+image+file).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions