Devstral Small 1.1 vs GPT-5

GPT-5 wins 10 of 12 benchmarks in our testing, outscoring Devstral Small 1.1 across agentic planning, strategic analysis, creative problem solving, and multilingual tasks by significant margins. Devstral Small 1.1 wins zero benchmarks outright and ties on two (classification and safety calibration), making GPT-5 the clear capability winner. The central tradeoff is cost: at $0.30/MTok output vs GPT-5's $10/MTok, Devstral Small 1.1 is 33x cheaper on output — a difference that matters enormously at scale.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

openai

GPT-5

Overall
4.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
73.6%
MATH Level 5
98.1%
AIME 2025
91.4%

Pricing

Input

$1.25/MTok

Output

$10.00/MTok

Context Window400K

modelpicker.net

Benchmark Analysis

In our 12-test benchmark suite, GPT-5 outscores Devstral Small 1.1 on 10 tests and ties on 2. Devstral wins none.

Where GPT-5 dominates:

  • Agentic planning: GPT-5 scores 5/5 (tied 1st of 54) vs Devstral's 2/5 (rank 53 of 54, near the bottom of the field). This is the starkest gap. Agentic planning measures goal decomposition and failure recovery — critical for autonomous coding agents, multi-step pipelines, and LLM orchestration.
  • Strategic analysis: GPT-5 scores 5/5 (tied 1st of 54) vs Devstral's 2/5 (rank 44 of 54). This measures nuanced tradeoff reasoning with real numbers — relevant for research, business analysis, and decision-support applications.
  • Creative problem solving: GPT-5 scores 4/5 (rank 9 of 54) vs Devstral's 2/5 (rank 47 of 54). Devstral lands near the bottom of the field here.
  • Persona consistency: GPT-5 scores 5/5 (tied 1st of 53) vs Devstral's 2/5 (rank 51 of 53 — second to last). This matters for chatbots and character-consistent applications.
  • Tool calling: GPT-5 scores 5/5 (tied 1st of 54) vs Devstral's 4/5 (rank 18 of 54). Both are above the median (4/5), but GPT-5 reaches the ceiling. Tool calling governs function selection, argument accuracy, and sequencing — directly impacting agentic and API-connected workflows.
  • Faithfulness: GPT-5 scores 5/5 (tied 1st of 55) vs Devstral's 4/5 (rank 34 of 55). GPT-5 is less likely to hallucinate or drift from source material.
  • Long context: GPT-5 scores 5/5 (tied 1st of 55) vs Devstral's 4/5 (rank 38 of 55). GPT-5 also has a substantially larger context window: 400,000 tokens vs Devstral's 131,072.
  • Structured output: GPT-5 scores 5/5 (tied 1st of 54) vs Devstral's 4/5 (rank 26 of 54). Both exceed the median but GPT-5 reaches the ceiling on JSON schema compliance.
  • Constrained rewriting: GPT-5 scores 4/5 (rank 6 of 53) vs Devstral's 3/5 (rank 31 of 53).
  • Multilingual: GPT-5 scores 5/5 (tied 1st of 55) vs Devstral's 4/5 (rank 36 of 55).

Where they tie:

  • Classification: Both score 4/5, both tied for 1st of 53 models alongside 29 others. No meaningful difference here.
  • Safety calibration: Both score 2/5, both at rank 12 of 55 alongside 20 other models. Neither excels at refusing harmful requests while permitting legitimate ones — a shared weakness.

External benchmarks (Epoch AI): GPT-5 scores 73.6% on SWE-bench Verified (rank 6 of 12 models tested), placing it solidly above the median of 70.8% among models with this score. On MATH Level 5, GPT-5 scores 98.1% — rank 1 of 14 models tested, the sole holder of that score, well above the median of 94.15%. On AIME 2025, GPT-5 scores 91.4% (rank 6 of 23), above the 83.9% median. Devstral Small 1.1 has no external benchmark scores in our data. These third-party results independently corroborate GPT-5's strength in mathematical reasoning and code-related tasks.

BenchmarkDevstral Small 1.1GPT-5
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling4/55/5
Classification4/54/5
Agentic Planning2/55/5
Structured Output4/55/5
Safety Calibration2/52/5
Strategic Analysis2/55/5
Persona Consistency2/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/54/5
Summary0 wins10 wins

Pricing Analysis

Devstral Small 1.1 costs $0.10/MTok input and $0.30/MTok output. GPT-5 costs $1.25/MTok input and $10.00/MTok output — 12.5x more expensive on input and 33x more expensive on output. In practice: at 1M output tokens/month, you pay $0.30 for Devstral Small 1.1 vs $10.00 for GPT-5. At 10M output tokens, that's $3 vs $100. At 100M output tokens — a realistic volume for a production app or enterprise pipeline — Devstral costs $300 vs GPT-5's $1,000. GPT-5 also uses reasoning tokens (flagged in the payload), meaning token consumption on complex tasks can be substantially higher than the base rate suggests, widening the cost gap further. Developers running high-throughput pipelines, batch processing, or cost-sensitive SaaS products should take the 33x output cost difference seriously. GPT-5's pricing is justified only when its capability advantages — particularly on agentic planning (5 vs 2), strategic analysis (5 vs 2), and creative problem solving (4 vs 2) — directly drive business value in your use case.

Real-World Cost Comparison

TaskDevstral Small 1.1GPT-5
iChat response<$0.001$0.0053
iBlog post<$0.001$0.021
iDocument batch$0.017$0.525
iPipeline run$0.170$5.25

Bottom Line

Choose Devstral Small 1.1 if: Cost is a primary constraint and your use case maps to its strengths — structured output (4/5), tool calling (4/5), classification (4/5), and long context (4/5). It's a viable choice for high-volume pipelines that need solid JSON generation, function calling, or document classification at 33x lower output cost. It's also worth considering if you're experimenting, prototyping, or building on a limited budget where GPT-5's capabilities aren't needed.

Choose GPT-5 if: You need strong agentic behavior (5/5 vs 2/5), reliable multi-step planning, strategic reasoning (5/5 vs 2/5), creative problem solving (4/5 vs 2/5), or persona-consistent chat (5/5 vs 2/5). Its 400K context window dwarfs Devstral's 131K, making it the clear choice for long-document tasks. It also leads on SWE-bench Verified at 73.6% and tops the field on MATH Level 5 at 98.1% (Epoch AI). For production applications where output quality directly affects user experience or business outcomes, GPT-5's benchmark advantage is hard to argue with — provided the cost fits your budget.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions