Devstral Small 1.1 vs GPT-5.4 Mini

GPT-5.4 Mini is the stronger general-purpose model, winning 9 of 12 benchmarks in our testing — including strategic analysis (5 vs 2), agentic planning (4 vs 2), and persona consistency (5 vs 2) — with no benchmark where Devstral Small 1.1 pulls ahead. The tradeoff is cost: GPT-5.4 Mini runs $4.50/M output tokens versus Devstral Small 1.1's $0.30/M, a 15x gap that matters enormously at scale. Devstral Small 1.1 is a viable alternative only if your workload is cost-sensitive, text-only, and concentrated in areas where both models tie — tool calling, classification, and safety calibration.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

openai

GPT-5.4 Mini

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.750/MTok

Output

$4.50/MTok

Context Window400K

modelpicker.net

Benchmark Analysis

Neither model wins on safety calibration — both score 2/5 in our testing, tied at rank 12 of 55 alongside 18 other models. That's below the 75th percentile for the field, so neither should be deployed in high-stakes refusal-sensitive contexts without additional guardrails.

On classification and tool calling, the two models tie: both score 4/5, sharing rank 1 and rank 18 respectively across 53–54 models tested. For structured routing tasks or function-calling pipelines that don't require reasoning depth, Devstral Small 1.1 matches GPT-5.4 Mini at a fraction of the price.

From there, GPT-5.4 Mini pulls ahead on every remaining benchmark:

  • Structured output (JSON schema compliance): GPT-5.4 Mini scores 5/5 (tied for 1st of 54), Devstral Small 1.1 scores 4/5 (rank 26 of 54). The gap is meaningful for API-heavy applications where malformed JSON causes downstream failures.

  • Long context (retrieval at 30K+ tokens): GPT-5.4 Mini scores 5/5 (tied for 1st of 55), Devstral Small 1.1 scores 4/5 (rank 38 of 55). GPT-5.4 Mini also has a 400K-token context window vs. Devstral's 131K, compounding the advantage for document-heavy tasks.

  • Faithfulness (sticking to source without hallucination): GPT-5.4 Mini scores 5/5 (tied for 1st of 55), Devstral Small 1.1 scores 4/5 (rank 34 of 55). In RAG pipelines or summarization, this difference translates to fewer hallucinated facts.

  • Constrained rewriting (compression within hard limits): GPT-5.4 Mini scores 4/5 (rank 6 of 53), Devstral Small 1.1 scores 3/5 (rank 31 of 53). A clear win for copy editing, headline generation, and similar tasks.

  • Creative problem solving: GPT-5.4 Mini scores 4/5 (rank 9 of 54), Devstral Small 1.1 scores 2/5 (rank 47 of 54). Devstral Small 1.1 is near the bottom of the field here.

  • Strategic analysis (nuanced tradeoff reasoning): GPT-5.4 Mini scores 5/5 (tied for 1st of 54), Devstral Small 1.1 scores 2/5 (rank 44 of 54). One of the starkest gaps in this comparison.

  • Agentic planning (goal decomposition, failure recovery): GPT-5.4 Mini scores 4/5 (rank 16 of 54), Devstral Small 1.1 scores 2/5 (rank 53 of 54 — nearly last). For autonomous agent workflows, Devstral Small 1.1's score here is a hard disqualifier.

  • Persona consistency (character maintenance, injection resistance): GPT-5.4 Mini scores 5/5 (tied for 1st of 53), Devstral Small 1.1 scores 2/5 (rank 51 of 53). Nearly last in the field.

  • Multilingual: GPT-5.4 Mini scores 5/5 (tied for 1st of 55), Devstral Small 1.1 scores 4/5 (rank 36 of 55). Non-English deployments strongly favor GPT-5.4 Mini.

BenchmarkDevstral Small 1.1GPT-5.4 Mini
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling4/54/5
Classification4/54/5
Agentic Planning2/54/5
Structured Output4/55/5
Safety Calibration2/52/5
Strategic Analysis2/55/5
Persona Consistency2/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/54/5
Summary0 wins9 wins

Pricing Analysis

Devstral Small 1.1 costs $0.10/M input tokens and $0.30/M output tokens. GPT-5.4 Mini costs $0.75/M input and $4.50/M output — 7.5x more expensive on input and 15x more on output.

At 1M output tokens/month: Devstral Small 1.1 costs $0.30 vs GPT-5.4 Mini's $4.50 — a $4.20 difference that's negligible for most teams.

At 10M output tokens/month: $3.00 vs $45.00 — a $42 gap that starts to matter for startups on tight margins.

At 100M output tokens/month: $300 vs $4,500 — a $4,200/month difference that is a real line item in any engineering budget.

The pricing gap is most relevant to high-volume API consumers running inference at scale: content pipelines, customer support automation, code generation tools. For developers making occasional API calls or running low-volume experiments, the absolute dollar difference is small enough that GPT-5.4 Mini's benchmark advantages likely justify the cost. GPT-5.4 Mini also supports image and file inputs (text+image+file->text modality) and a 400,000-token context window compared to Devstral Small 1.1's text-only, 131,072-token window — so for multimodal use cases, there is no direct substitution possible.

Real-World Cost Comparison

TaskDevstral Small 1.1GPT-5.4 Mini
iChat response<$0.001$0.0024
iBlog post<$0.001$0.0094
iDocument batch$0.017$0.240
iPipeline run$0.170$2.40

Bottom Line

Choose Devstral Small 1.1 if: your workload is text-only, high-volume, and cost is the dominant constraint; you need tool calling or classification at scale and can tolerate weaker reasoning elsewhere; and your output volume exceeds 10M tokens/month where the 15x output cost gap becomes a budget line item.

Choose GPT-5.4 Mini if: you need multimodal inputs (images, files); your use case involves agentic planning, strategic analysis, or persona-consistent chatbots — areas where Devstral Small 1.1 scores near the bottom of the field; you're processing long documents and need a 400K-token context window; or you need maximum faithfulness in RAG and summarization pipelines. For most developers and most use cases, GPT-5.4 Mini's benchmark advantages are substantial enough that the price premium is justified unless volume is extreme.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions