Devstral Small 1.1 vs GPT-5.2
GPT-5.2 is the practical winner for most production AI tasks, scoring higher on agentic planning, safety, long-context, faithfulness, and creative problem solving in our 12-test suite. Devstral Small 1.1 is the cost-efficient alternative — much lower price ($0.40/mTok vs $15.75/mTok) and a reasonable pick when budget or high request volume dominate requirements.
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
openai
GPT-5.2
Benchmark Scores
External Benchmarks
Pricing
Input
$1.75/MTok
Output
$14.00/MTok
modelpicker.net
Benchmark Analysis
Overview: Across our 12-test suite, GPT-5.2 wins 9 categories, Devstral Small 1.1 wins 0, and 3 are ties. Scores (Devstral vs GPT-5.2):
- Agentic planning: 2 vs 5 — GPT-5.2 wins and ranks tied for 1st (display: "tied for 1st with 14 other models"), so expect stronger goal decomposition and failure-recovery in multi-step agents on GPT-5.2. Devstral’s 2 (rank 53 of 54) indicates weaker decomposition.
- Safety calibration: 2 vs 5 — GPT-5.2 wins (ranked tied for 1st), so it better refuses harmful requests while allowing legitimate ones in our tests; Devstral’s 2 is comparatively low (rank 12 of 55 but shared with many models).
- Long-context: 4 vs 5 — GPT-5.2 wins and ties for 1st (long-context rank tied for 1st); Devstral’s 4 (rank 38 of 55) still handles long context but trails GPT-5.2’s retrieval accuracy at 30K+ tokens.
- Faithfulness: 4 vs 5 — GPT-5.2 wins (tied for 1st), so fewer hallucinations in source-driven tasks; Devstral’s 4 indicates reasonable adherence but not top-tier.
- Creative problem solving: 2 vs 5 — GPT-5.2 wins and ties for 1st (creative problem solving rank 1); Devstral scored 2, so GPT-5.2 produces more novel, specific, feasible ideas in our tests.
- Strategic analysis: 2 vs 5 — GPT-5.2 wins (tied for 1st), valuable when nuanced tradeoffs and numeric reasoning matter.
- Constrained rewriting: 3 vs 4 — GPT-5.2 wins (rank 6 of 53), better at tight character-limited rewrites; Devstral’s 3 is middling.
- Persona consistency: 2 vs 5 — GPT-5.2 wins and ties for 1st; Devstral ranks poorly (rank 51 of 53), so GPT-5.2 better maintains role/persona in dialogue and resists prompt injection.
- Multilingual: 4 vs 5 — GPT-5.2 wins (tied for 1st); Devstral’s 4 is decent but behind on non-English parity. Ties (both models score 4): structured output, tool calling, classification — both models match on JSON/schema compliance, function selection/arguments sequencing, and categorization tasks. Rankings show both tied at similar positions for those tests. External benchmarks (Epoch AI): GPT-5.2 scores 73.8% on SWE-bench Verified (Epoch AI), ranking 5 of 12, and 96.1% on AIME 2025 (Epoch AI), ranking 1 of 23 — these external measures corroborate GPT-5.2’s strength on coding and hard math tasks. Devstral has no external SWE/AIME scores in the payload.
Pricing Analysis
Per the payload, Devstral Small 1.1 charges $0.10 input + $0.30 output = $0.40 per mTok; GPT-5.2 charges $1.75 input + $14.00 output = $15.75 per mTok. At 1M tokens/month (1,000 mTok): Devstral = $400; GPT-5.2 = $15,750. At 10M tokens (10,000 mTok): Devstral = $4,000; GPT-5.2 = $157,500. At 100M tokens (100,000 mTok): Devstral = $40,000; GPT-5.2 = $1,575,000. The priceRatio in the payload is ~0.0214, so Devstral costs ~2.14% of GPT-5.2 per mTok. High-volume apps, startups, and cost-constrained deployments should care most about this gap; teams needing top-tier safety, long-context, agentic planning, or math/engineering performance may justify GPT-5.2’s much higher spend.
Real-World Cost Comparison
Bottom Line
Choose Devstral Small 1.1 if: you need a far lower-cost model for high-volume text->text pipelines, are building budget-conscious engineering agents, or can accept weaker agentic planning, safety, and long-context performance in exchange for $0.40/mTok and a 131,072-token context window. Choose GPT-5.2 if: you prioritize best-in-class agentic planning, safety calibration, long-context fidelity, faithfulness, creative problem solving, and multilingual/advanced-math performance (GPT-5.2 scores 5 vs Devstral’s 2–4 across those tests), and you can justify $15.75/mTok for substantially higher quality and broader modality/context support (text+image+file).
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.