Devstral Small 1.1 vs GPT-5.1

In our testing GPT-5.1 is the better all-purpose model: it wins the majority of benchmarks (8 wins) and outperforms Devstral Small 1.1 on long-context, faithfulness, creative problem solving and multilingual tasks. Devstral Small 1.1 is the cost-efficient alternative — it ties or matches GPT-5.1 on structured output, classification and tool calling but at a tiny fraction of the price.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

openai

GPT-5.1

Overall
4.25/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
68.0%
MATH Level 5
N/A
AIME 2025
88.6%

Pricing

Input

$1.25/MTok

Output

$10.00/MTok

Context Window400K

modelpicker.net

Benchmark Analysis

Walkthrough of our 12-test suite (model scores are from our testing):

  • Ties: structured output (both 4), tool calling (both 4), classification (both 4), safety calibration (both 2). For schema/JSON tasks and function selection, both models perform equivalently in our tests.
  • GPT-5.1 wins: faithfulness 5 vs 4 (GPT-5.1 ranks tied for 1st of 55 for faithfulness), long context 5 vs 4 (GPT-5.1 tied for 1st of 55 on long-context), creative problem solving 4 vs 2 (GPT-5.1 ranks 9th of 54 on creative problem solving), multilingual 5 vs 4 (GPT-5.1 tied for 1st of 55), persona consistency 5 vs 2 (GPT-5.1 tied for 1st of 53), agentic planning 4 vs 2 (GPT-5.1 rank 16 of 54), strategic analysis 5 vs 2 (GPT-5.1 tied for 1st of 54), constrained rewriting 4 vs 3 (GPT-5.1 rank 6 of 53). These wins mean GPT-5.1 is measurably stronger for: maintaining factual fidelity in outputs (less hallucination risk), retrieval and reasoning across very long contexts, multilingual parity, character consistency, multi-step planning and nuanced tradeoff reasoning.
  • Devstral Small 1.1 has no outright wins in our 12-test comparison; it ties on several practical engineering tasks (structured output, tool calling, classification). That explains why Devstral is attractive for engineering agents that need reliable schema adherence and lower-cost bulk inference.
  • External benchmarks (Epoch AI): GPT-5.1 scores 68% on SWE-bench Verified (rank 7 of 12) and 88.6% on AIME 2025 (rank 7 of 23). Devstral Small 1.1 has no external SWE-bench/AIME scores in the payload. Use those external results as supplementary evidence that GPT-5.1 is strong on coding/problem benchmarks.
BenchmarkDevstral Small 1.1GPT-5.1
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling4/54/5
Classification4/54/5
Agentic Planning2/54/5
Structured Output4/54/5
Safety Calibration2/52/5
Strategic Analysis2/55/5
Persona Consistency2/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/54/5
Summary0 wins8 wins

Pricing Analysis

Per the payload, Devstral Small 1.1 charges $0.10 input + $0.30 output = $0.40 per mTok; GPT-5.1 charges $1.25 input + $10.00 output = $11.25 per mTok. Assuming a 50/50 split between input and output tokens, monthly costs are: for 1M total tokens — Devstral ≈ $200 vs GPT-5.1 ≈ $5,625; for 10M tokens — Devstral ≈ $2,000 vs GPT-5.1 ≈ $56,250; for 100M tokens — Devstral ≈ $20,000 vs GPT-5.1 ≈ $562,500. The absolute dollar gap means cost-sensitive, high-volume applications (chatbots, automated classification pipelines, large batch inference) should prefer Devstral. Teams that need multimodal inputs, extreme long-context, or top-tier reasoning should budget for GPT-5.1 despite the much higher cost.

Real-World Cost Comparison

TaskDevstral Small 1.1GPT-5.1
iChat response<$0.001$0.0053
iBlog post<$0.001$0.021
iDocument batch$0.017$0.525
iPipeline run$0.170$5.25

Bottom Line

Choose Devstral Small 1.1 if: you run high-volume, cost-sensitive automation (classification, schema-constrained outputs, tool-call orchestration) and need a model with a 131,072-token context window for text-only workloads — you save orders of magnitude on inference costs. Choose GPT-5.1 if: you need the best faithfulness, multimodal inputs (text+image+file), extreme long-context (400,000 tokens), stronger multilingual and creative reasoning, or external-benchmarked coding/math performance — accept much higher per-token costs for higher capability.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions