Devstral Small 1.1 vs GPT-5 Mini

GPT-5 Mini is the stronger general-purpose AI, winning 10 of 12 benchmarks in our testing — including decisive advantages on strategic analysis (5 vs 2), agentic planning (4 vs 2), and persona consistency (5 vs 2). Devstral Small 1.1 wins only one benchmark outright — tool calling (4 vs 3) — making it a narrow specialist rather than a broad competitor. The tradeoff is significant: GPT-5 Mini's output costs $2.00/MTok versus Devstral Small 1.1's $0.30/MTok, so the decision is whether GPT-5 Mini's across-the-board quality advantage justifies paying roughly 6.7x more on output.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

openai

GPT-5 Mini

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
3/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
64.7%
MATH Level 5
97.8%
AIME 2025
86.7%

Pricing

Input

$0.250/MTok

Output

$2.00/MTok

Context Window400K

modelpicker.net

Benchmark Analysis

GPT-5 Mini wins 10 of 12 benchmarks in our testing; Devstral Small 1.1 wins 1; they tie on 1.

Where GPT-5 Mini leads:

  • Strategic analysis: GPT-5 Mini scores 5/5 (tied for 1st among 54 models) vs Devstral Small 1.1's 2/5 (rank 44 of 54). This is the largest gap in the dataset — a model scoring 2 on nuanced tradeoff reasoning will struggle with business analysis, competitive assessments, and any task requiring weighing real-world constraints.
  • Agentic planning: GPT-5 Mini scores 4/5 (rank 16 of 54) vs Devstral Small 1.1's 2/5 (rank 53 of 54 — near the bottom of the field). For goal decomposition and multi-step task execution, Devstral Small 1.1 is a poor fit.
  • Persona consistency: GPT-5 Mini scores 5/5 (tied for 1st among 53 models) vs Devstral Small 1.1's 2/5 (rank 51 of 53). Critical for chatbot or character-based applications.
  • Creative problem solving: GPT-5 Mini scores 4/5 (rank 9 of 54) vs Devstral Small 1.1's 2/5 (rank 47 of 54). Devstral Small 1.1 is below the 25th percentile on this dimension.
  • Faithfulness: GPT-5 Mini scores 5/5 (tied for 1st among 55 models) vs Devstral Small 1.1's 4/5 (rank 34 of 55). Both are solid, but GPT-5 Mini reaches the top tier.
  • Long context: GPT-5 Mini scores 5/5 (tied for 1st among 55 models) vs Devstral Small 1.1's 4/5 (rank 38 of 55). GPT-5 Mini also supports a 400K context window vs Devstral Small 1.1's 131K — a practical advantage for large document workflows.
  • Multilingual: GPT-5 Mini scores 5/5 (tied for 1st among 55 models) vs Devstral Small 1.1's 4/5 (rank 36 of 55).
  • Structured output: GPT-5 Mini scores 5/5 (tied for 1st among 54 models) vs Devstral Small 1.1's 4/5 (rank 26 of 54). Both are capable, but GPT-5 Mini edges ahead on JSON schema compliance.
  • Constrained rewriting: GPT-5 Mini scores 4/5 (rank 6 of 53) vs Devstral Small 1.1's 3/5 (rank 31 of 53).
  • Safety calibration: GPT-5 Mini scores 3/5 (rank 10 of 55) vs Devstral Small 1.1's 2/5 (rank 12 of 55). Both sit above the 75th percentile on this dimension given the dataset's low median (p50 = 2), but GPT-5 Mini is the stronger performer here.

Where Devstral Small 1.1 leads:

  • Tool calling: Devstral Small 1.1 scores 4/5 (rank 18 of 54) vs GPT-5 Mini's 3/5 (rank 47 of 54). This is Devstral Small 1.1's clearest advantage — function selection, argument accuracy, and sequencing. GPT-5 Mini ranks near the bottom third of models tested on this dimension.

Tie:

  • Classification: Both score 4/5, tied for 1st with 29 other models among 53 tested. No meaningful difference here.

External benchmarks (Epoch AI):

GPT-5 Mini has external benchmark data that Devstral Small 1.1 lacks in this payload. On SWE-bench Verified, GPT-5 Mini scores 64.7% (rank 8 of 12 models with scores in our dataset — above the dataset median of 70.8% is not met, placing it in the lower half of tracked models on this test). On MATH Level 5, GPT-5 Mini scores 97.8% (rank 2 of 14, above the dataset median of 94.15%). On AIME 2025, GPT-5 Mini scores 86.7% (rank 9 of 23, near the dataset median of 83.9%). No comparable external benchmark data exists for Devstral Small 1.1 in this payload.

BenchmarkDevstral Small 1.1GPT-5 Mini
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling4/53/5
Classification4/54/5
Agentic Planning2/54/5
Structured Output4/55/5
Safety Calibration2/53/5
Strategic Analysis2/55/5
Persona Consistency2/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/54/5
Summary1 wins10 wins

Pricing Analysis

Devstral Small 1.1 costs $0.10/MTok input and $0.30/MTok output. GPT-5 Mini costs $0.25/MTok input and $2.00/MTok output — 2.5x more on input and 6.7x more on output.

At 1M output tokens/month: Devstral Small 1.1 costs $0.30 vs GPT-5 Mini's $2.00 — a $1.70 difference that barely registers.

At 10M output tokens/month: $3.00 vs $20.00 — a $17 gap. Still manageable for most teams.

At 100M output tokens/month: $300 vs $2,000 — a $1,700/month difference that becomes a real budget line item. At this scale, if your workload skews toward tasks where Devstral Small 1.1 performs competitively (tool calling, classification, structured output), the cost argument for Devstral Small 1.1 becomes compelling.

GPT-5 Mini also uses reasoning tokens (flagged in the payload), which can inflate token counts on complex requests — worth accounting for in high-volume cost projections. Developers building latency-sensitive or cost-constrained pipelines at scale should weigh whether GPT-5 Mini's quality advantages on strategic analysis, creative problem solving, and long context are worth the output cost premium for their specific workload.

Real-World Cost Comparison

TaskDevstral Small 1.1GPT-5 Mini
iChat response<$0.001$0.0010
iBlog post<$0.001$0.0041
iDocument batch$0.017$0.105
iPipeline run$0.170$1.05

Bottom Line

Choose Devstral Small 1.1 if: Your primary workload is tool-calling-heavy pipelines — agentic code execution, API orchestration, or function-routing tasks where its score of 4/5 (vs GPT-5 Mini's 3/5, rank 47 of 54) gives it a real edge. It's also the right call if you're running high output volumes (100M+ tokens/month) where GPT-5 Mini's $2.00/MTok output cost becomes a significant expense, and your tasks fall within the narrow set where Devstral Small 1.1 competes (classification, structured output, faithfulness). Devstral Small 1.1 is described as purpose-built for software engineering agents — if that matches your use case, the cost savings are meaningful.

Choose GPT-5 Mini if: You need a capable general-purpose AI across a wide range of tasks. It wins 10 of 12 benchmarks in our testing, with especially large advantages on strategic analysis (5 vs 2), agentic planning (4 vs 2), creative problem solving (4 vs 2), and persona consistency (5 vs 2). Its 400K context window doubles Devstral Small 1.1's capacity for large-document work. It also supports image and file inputs, where Devstral Small 1.1 is text-only. At lower to mid volumes, the $1.70/MTok output premium is easy to justify for the quality differential across most task types.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions