Devstral Small 1.1 vs GPT-4.1 Mini

GPT-4.1 Mini is the stronger general-purpose model, winning 7 of 12 benchmarks in our testing against Devstral Small 1.1's 1 win and 4 ties — with meaningful advantages in agentic planning, strategic analysis, persona consistency, and long-context retrieval. Devstral Small 1.1 edges ahead only on classification, where it ties for 1st among 53 models. At $0.10/$0.30 per million tokens versus GPT-4.1 Mini's $0.40/$1.60, Devstral Small 1.1 is roughly 5x cheaper on output — making it a credible option when classification or structured output is the primary workload and budget is the primary constraint.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

openai

GPT-4.1 Mini

Overall
3.92/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
4/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
87.3%
AIME 2025
44.7%

Pricing

Input

$0.400/MTok

Output

$1.60/MTok

Context Window1048K

modelpicker.net

Benchmark Analysis

Across our 12-test benchmark suite, GPT-4.1 Mini wins 7 tests outright, Devstral Small 1.1 wins 1, and the two tie on 4.

Where GPT-4.1 Mini wins:

  • Long context (5 vs 4): GPT-4.1 Mini ties for 1st among 55 models; Devstral Small 1.1 ranks 38th. For workloads requiring accurate retrieval at 30K+ tokens, this is a significant gap.
  • Persona consistency (5 vs 2): GPT-4.1 Mini ties for 1st among 53 models; Devstral Small 1.1 ranks 51st — near the bottom of tested models. This matters for chatbots, roleplay, and any system prompt that must hold under adversarial input.
  • Agentic planning (4 vs 2): GPT-4.1 Mini ranks 16th of 54; Devstral Small 1.1 ranks 53rd — second to last. Goal decomposition and failure recovery are critical for autonomous agent workflows, and this gap is stark.
  • Strategic analysis (4 vs 2): GPT-4.1 Mini ranks 27th of 54; Devstral Small 1.1 ranks 44th. Complex tradeoff reasoning favors GPT-4.1 Mini substantially.
  • Constrained rewriting (4 vs 3): GPT-4.1 Mini ranks 6th of 53; Devstral Small 1.1 ranks 31st. Compression within hard limits is notably better on GPT-4.1 Mini.
  • Creative problem solving (3 vs 2): GPT-4.1 Mini ranks 30th of 54; Devstral Small 1.1 ranks 47th. Neither model excels here, but GPT-4.1 Mini is clearly ahead.
  • Multilingual (5 vs 4): GPT-4.1 Mini ties for 1st among 55 models; Devstral Small 1.1 ranks 36th. Non-English use cases strongly favor GPT-4.1 Mini.

Where Devstral Small 1.1 wins:

  • Classification (4 vs 3): Devstral Small 1.1 ties for 1st among 53 models; GPT-4.1 Mini ranks 31st. This is Devstral's clearest differentiator — it categorizes and routes inputs as well as any model in our suite.

Ties (both score the same):

  • Structured output (4/4): Both rank 26th of 54, tied with 26 other models. JSON schema compliance is equivalent.
  • Tool calling (4/4): Both rank 18th of 54, tied with 28 other models. Function selection and argument accuracy are on par.
  • Faithfulness (4/4): Both rank 34th of 55. Neither model has an edge on staying grounded to source material.
  • Safety calibration (2/2): Both rank 12th of 55, tied with 19 other models — below the field median of 2 at the 50th percentile, which means both are only average here.

Third-party benchmark context: GPT-4.1 Mini scores 87.3% on MATH Level 5 (rank 9 of 14 models with this score) and 44.7% on AIME 2025 (rank 18 of 23) according to Epoch AI data. Devstral Small 1.1 has no external benchmark scores in the payload. GPT-4.1 Mini's description notes it scores 45.1% on hard coding tasks; Devstral Small 1.1 is described as purpose-built for software engineering agents, fine-tuned from Mistral Small 3.1 in collaboration with All Hands AI — but our benchmark suite does not include a direct SWE-bench score for it.

BenchmarkDevstral Small 1.1GPT-4.1 Mini
Faithfulness4/54/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling4/54/5
Classification4/53/5
Agentic Planning2/54/5
Structured Output4/54/5
Safety Calibration2/52/5
Strategic Analysis2/54/5
Persona Consistency2/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/53/5
Summary1 wins7 wins

Pricing Analysis

Devstral Small 1.1 costs $0.10/M input and $0.30/M output tokens. GPT-4.1 Mini costs $0.40/M input and $1.60/M output — 4x more expensive on input, and more than 5x more expensive on output. In practice, output cost dominates most workloads. At 1M output tokens/month, GPT-4.1 Mini costs $1.60 vs $0.30 for Devstral Small 1.1 — a $1.30 difference that's negligible for most teams. Scale to 10M output tokens and the gap becomes $13 vs $3, still modest. At 100M output tokens/month — the scale of a production API serving thousands of users — GPT-4.1 Mini runs $160 vs $30, a $130/month difference that starts to matter in budget planning. For high-volume inference pipelines where GPT-4.1 Mini's broader capabilities aren't needed, Devstral Small 1.1 offers real savings. For most individual developers or small teams, the cost gap won't be the deciding factor.

Real-World Cost Comparison

TaskDevstral Small 1.1GPT-4.1 Mini
iChat response<$0.001<$0.001
iBlog post<$0.001$0.0034
iDocument batch$0.017$0.088
iPipeline run$0.170$0.880

Bottom Line

Choose Devstral Small 1.1 if: Your primary workload is classification or routing — it ties for 1st among 53 models on that benchmark in our testing, and does so at $0.30/M output tokens. It's also a reasonable fit for structured output and tool calling pipelines where cost matters and you can live with weaker performance on reasoning, planning, and long-context tasks. If you're running high-volume classification inference and every dollar counts, it's the clear pick for that specific job.

Choose GPT-4.1 Mini if: You need a capable general-purpose model. It wins 7 of 12 benchmarks in our testing, with strong leads on agentic planning (4 vs 2, ranking 16th vs 53rd of 54), persona consistency (5 vs 2, ranking 1st vs 51st of 53), long-context retrieval (5 vs 4, ranking 1st of 55), and multilingual output. It also supports image and file inputs, which Devstral Small 1.1 does not per the payload. For developer agents, customer-facing chatbots, multilingual apps, or anything requiring sustained reasoning over long documents, GPT-4.1 Mini's benchmark profile is considerably stronger. The 5x output cost premium is justified unless your workload maps narrowly to Devstral's strengths.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions