Devstral Small 1.1 vs GPT-4.1 Mini
GPT-4.1 Mini is the stronger general-purpose model, winning 7 of 12 benchmarks in our testing against Devstral Small 1.1's 1 win and 4 ties — with meaningful advantages in agentic planning, strategic analysis, persona consistency, and long-context retrieval. Devstral Small 1.1 edges ahead only on classification, where it ties for 1st among 53 models. At $0.10/$0.30 per million tokens versus GPT-4.1 Mini's $0.40/$1.60, Devstral Small 1.1 is roughly 5x cheaper on output — making it a credible option when classification or structured output is the primary workload and budget is the primary constraint.
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
openai
GPT-4.1 Mini
Benchmark Scores
External Benchmarks
Pricing
Input
$0.400/MTok
Output
$1.60/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test benchmark suite, GPT-4.1 Mini wins 7 tests outright, Devstral Small 1.1 wins 1, and the two tie on 4.
Where GPT-4.1 Mini wins:
- Long context (5 vs 4): GPT-4.1 Mini ties for 1st among 55 models; Devstral Small 1.1 ranks 38th. For workloads requiring accurate retrieval at 30K+ tokens, this is a significant gap.
- Persona consistency (5 vs 2): GPT-4.1 Mini ties for 1st among 53 models; Devstral Small 1.1 ranks 51st — near the bottom of tested models. This matters for chatbots, roleplay, and any system prompt that must hold under adversarial input.
- Agentic planning (4 vs 2): GPT-4.1 Mini ranks 16th of 54; Devstral Small 1.1 ranks 53rd — second to last. Goal decomposition and failure recovery are critical for autonomous agent workflows, and this gap is stark.
- Strategic analysis (4 vs 2): GPT-4.1 Mini ranks 27th of 54; Devstral Small 1.1 ranks 44th. Complex tradeoff reasoning favors GPT-4.1 Mini substantially.
- Constrained rewriting (4 vs 3): GPT-4.1 Mini ranks 6th of 53; Devstral Small 1.1 ranks 31st. Compression within hard limits is notably better on GPT-4.1 Mini.
- Creative problem solving (3 vs 2): GPT-4.1 Mini ranks 30th of 54; Devstral Small 1.1 ranks 47th. Neither model excels here, but GPT-4.1 Mini is clearly ahead.
- Multilingual (5 vs 4): GPT-4.1 Mini ties for 1st among 55 models; Devstral Small 1.1 ranks 36th. Non-English use cases strongly favor GPT-4.1 Mini.
Where Devstral Small 1.1 wins:
- Classification (4 vs 3): Devstral Small 1.1 ties for 1st among 53 models; GPT-4.1 Mini ranks 31st. This is Devstral's clearest differentiator — it categorizes and routes inputs as well as any model in our suite.
Ties (both score the same):
- Structured output (4/4): Both rank 26th of 54, tied with 26 other models. JSON schema compliance is equivalent.
- Tool calling (4/4): Both rank 18th of 54, tied with 28 other models. Function selection and argument accuracy are on par.
- Faithfulness (4/4): Both rank 34th of 55. Neither model has an edge on staying grounded to source material.
- Safety calibration (2/2): Both rank 12th of 55, tied with 19 other models — below the field median of 2 at the 50th percentile, which means both are only average here.
Third-party benchmark context: GPT-4.1 Mini scores 87.3% on MATH Level 5 (rank 9 of 14 models with this score) and 44.7% on AIME 2025 (rank 18 of 23) according to Epoch AI data. Devstral Small 1.1 has no external benchmark scores in the payload. GPT-4.1 Mini's description notes it scores 45.1% on hard coding tasks; Devstral Small 1.1 is described as purpose-built for software engineering agents, fine-tuned from Mistral Small 3.1 in collaboration with All Hands AI — but our benchmark suite does not include a direct SWE-bench score for it.
Pricing Analysis
Devstral Small 1.1 costs $0.10/M input and $0.30/M output tokens. GPT-4.1 Mini costs $0.40/M input and $1.60/M output — 4x more expensive on input, and more than 5x more expensive on output. In practice, output cost dominates most workloads. At 1M output tokens/month, GPT-4.1 Mini costs $1.60 vs $0.30 for Devstral Small 1.1 — a $1.30 difference that's negligible for most teams. Scale to 10M output tokens and the gap becomes $13 vs $3, still modest. At 100M output tokens/month — the scale of a production API serving thousands of users — GPT-4.1 Mini runs $160 vs $30, a $130/month difference that starts to matter in budget planning. For high-volume inference pipelines where GPT-4.1 Mini's broader capabilities aren't needed, Devstral Small 1.1 offers real savings. For most individual developers or small teams, the cost gap won't be the deciding factor.
Real-World Cost Comparison
Bottom Line
Choose Devstral Small 1.1 if: Your primary workload is classification or routing — it ties for 1st among 53 models on that benchmark in our testing, and does so at $0.30/M output tokens. It's also a reasonable fit for structured output and tool calling pipelines where cost matters and you can live with weaker performance on reasoning, planning, and long-context tasks. If you're running high-volume classification inference and every dollar counts, it's the clear pick for that specific job.
Choose GPT-4.1 Mini if: You need a capable general-purpose model. It wins 7 of 12 benchmarks in our testing, with strong leads on agentic planning (4 vs 2, ranking 16th vs 53rd of 54), persona consistency (5 vs 2, ranking 1st vs 51st of 53), long-context retrieval (5 vs 4, ranking 1st of 55), and multilingual output. It also supports image and file inputs, which Devstral Small 1.1 does not per the payload. For developer agents, customer-facing chatbots, multilingual apps, or anything requiring sustained reasoning over long documents, GPT-4.1 Mini's benchmark profile is considerably stronger. The 5x output cost premium is justified unless your workload maps narrowly to Devstral's strengths.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.