Devstral Small 1.1 vs GPT-4o-mini

For most production chat and agentic workflows pick GPT-4o-mini: it wins more benchmark categories in our testing (safety calibration 4 vs 2, persona consistency 4 vs 2, agentic planning 3 vs 2) and supports multimodal inputs. Choose Devstral Small 1.1 when cost and faithfulness matter — it costs ~50% less on typical token mixes and scores higher on faithfulness (4 vs 3).

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

openai

GPT-4o-mini

Overall
3.42/5Usable

Benchmark Scores

Faithfulness
3/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
4/5
Strategic Analysis
2/5
Persona Consistency
4/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
52.6%
AIME 2025
6.9%

Pricing

Input

$0.150/MTok

Output

$0.600/MTok

Context Window128K

modelpicker.net

Benchmark Analysis

Across our 12-test suite the two models tie on eight tasks, GPT-4o-mini wins three, and Devstral Small 1.1 wins one. Detailed walkthrough (our scores):

  • Wins for GPT-4o-mini: safety calibration 4 vs 2 (GPT rank 6 of 55; Devstral rank 12 of 55). This implies GPT-4o-mini is substantially better at refusing harmful requests while permitting legitimate ones in our testing. persona consistency 4 vs 2 (GPT rank 38 of 53; Devstral rank 51 of 53) — GPT-4o-mini better maintains character and resists injection. agentic planning 3 vs 2 (GPT rank 42 of 54; Devstral rank 53 of 54) — GPT-4o-mini showed stronger goal decomposition and failure recovery in our agentic tests.
  • Win for Devstral Small 1.1: faithfulness 4 vs 3 (Devstral rank 34 of 55; GPT rank 52 of 55). In practice this means Devstral is more likely to stick to source material and avoid mild hallucinations on our tasks.
  • Ties (equal scores): structured output 4/4, strategic analysis 2/2, constrained rewriting 3/3, creative problem solving 2/2, tool calling 4/4, classification 4/4, long context 4/4, multilingual 4/4. For these, both models performed similarly in our testing: e.g., both tied for top classification performance (tied for 1st with 29 other models), both matched on tool calling and long-context retrieval at 30K+ tokens.
  • External math benchmarks (Epoch AI): GPT-4o-mini posts 52.6% on MATH Level 5 and 6.9% on AIME 2025 (Epoch AI). These external scores are supplementary data points and reflect task-specific math performance reported by Epoch AI, not our internal 1–5 scores. In short: GPT-4o-mini is stronger where safety, persona, and resilient planning matter; Devstral is measurably more faithful and substantially cheaper. Many common tasks (classification, tool calling, long context, multilingual outputs) were ties in our tests, so cost and modality become deciding factors for those use cases.
BenchmarkDevstral Small 1.1GPT-4o-mini
Faithfulness4/53/5
Long Context4/54/5
Multilingual4/54/5
Tool Calling4/54/5
Classification4/54/5
Agentic Planning2/53/5
Structured Output4/54/5
Safety Calibration2/54/5
Strategic Analysis2/52/5
Persona Consistency2/54/5
Constrained Rewriting3/53/5
Creative Problem Solving2/52/5
Summary1 wins3 wins

Pricing Analysis

Costs are per 1,000 tokens (mTOK). Devstral Small 1.1: input $0.10/mTOK, output $0.30/mTOK. GPT-4o-mini: input $0.15/mTOK, output $0.60/mTOK. Assuming a 50/50 split of input/output tokens, total monthly costs are: 1M tokens — Devstral $200 vs GPT-4o-mini $375; 10M tokens — Devstral $2,000 vs GPT-4o-mini $3,750; 100M tokens — Devstral $20,000 vs GPT-4o-mini $37,500. The gap grows linearly with usage; teams running tens of millions of tokens/month should care (savings of $1,750 at 10M, $17,500 at 100M). If your app is latency/feature-constrained rather than token-cost constrained, the higher GPT-4o-mini price may be justifiable for its safety and persona gains.

Real-World Cost Comparison

TaskDevstral Small 1.1GPT-4o-mini
iChat response<$0.001<$0.001
iBlog post<$0.001$0.0013
iDocument batch$0.017$0.033
iPipeline run$0.170$0.330

Bottom Line

Choose Devstral Small 1.1 if: you need lower token costs at scale (about half the combined token price on a 50/50 input/output split), higher faithfulness (score 4 vs 3), or a model optimized for software-engineering agent workflows per its description. Choose GPT-4o-mini if: you prioritize safety and persona consistency (safety calibration 4 vs 2, persona consistency 4 vs 2), need better agentic planning (3 vs 2), or require multimodal inputs (text+image+file → text). If your workload is classification, tool calling, long-context retrieval, or multilingual output, both models tied in our tests — pick by cost, modality, or integration requirements.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions