Devstral Small 1.1 vs GPT-5.4

For most production and research use cases that demand long context, safety, faithfulness, and agentic planning, GPT-5.4 is the better pick in our testing. Devstral Small 1.1 is the cost-focused choice — it only beats GPT-5.4 on classification but delivers huge savings (about 2% of GPT-5.4 pricing), so pick it when volume and budget dominate requirements.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

openai

GPT-5.4

Overall
4.58/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
76.9%
MATH Level 5
N/A
AIME 2025
95.3%

Pricing

Input

$2.50/MTok

Output

$15.00/MTok

Context Window1050K

modelpicker.net

Benchmark Analysis

We ran a 12-test suite (scores 1–5) and report results below as "in our testing." Summary win/tie: GPT-5.4 wins 10 tests, Devstral Small 1.1 wins 1, and 1 is a tie. Detailed comparison (Devstral score → GPT-5.4 score):

  • Persona consistency: 2 → 5. GPT-5.4 ranks tied 1st of 53 for persona consistency; Devstral ranks 51 of 53. This matters for dialogue agents that must maintain a character or role.
  • Safety calibration: 2 → 5. GPT-5.4 is tied for 1st of 55 on safety calibration; Devstral is rank 12 of 55. In practice GPT-5.4 refuses harmful requests and permits legitimate ones much more reliably in our tests.
  • Structured output: 4 → 5. GPT-5.4 is tied for 1st of 54; Devstral sits mid-pack (rank 26). For JSON/schema compliance, GPT-5.4 is more reliable.
  • Classification: 4 → 3. Devstral wins (tied for 1st with many models out of 53); choose Devstral when accurate routing or categorization is the priority.
  • Tool calling: 4 → 4 (tie). Both scored equally on function selection and argument accuracy in our tests.
  • Long context: 4 → 5. GPT-5.4 is tied for 1st of 55 on long context; Devstral ranks 38. For retrieval or summarization across 30K+ tokens, GPT-5.4 is clearly stronger.
  • Faithfulness: 4 → 5. GPT-5.4 is tied for 1st of 55; expect fewer hallucinations from GPT-5.4 in our tests.
  • Constrained rewriting: 3 → 4. GPT-5.4 ranks 6 of 53; it handles hard character limits better in our evaluation.
  • Creative problem solving: 2 → 4. GPT-5.4 ranks 9 of 54; it produced more non-obvious, feasible ideas in our runs.
  • Strategic analysis: 2 → 5. GPT-5.4 tied for 1st of 54, showing much stronger numeric tradeoff reasoning in our tests.
  • Agentic planning: 2 → 5. GPT-5.4 tied for 1st of 54; it decomposes goals and plans failure recovery more effectively in our trials.
  • Multilingual: 4 → 5. GPT-5.4 tied for 1st of 55; it produced higher-quality non-English outputs in our sampling. External benchmarks: GPT-5.4 also scores 76.9% on SWE-bench Verified and 95.3% on AIME 2025 according to Epoch AI — we cite these as supplementary, third-party evidence of its coding and math strengths. Devstral has no external SWE/AIME scores in the payload. Overall interpretation: GPT-5.4 is markedly stronger across safety, long-context, planning, and reasoning tasks in our testing; Devstral is viable where classification accuracy plus minimal cost are the top constraints.
BenchmarkDevstral Small 1.1GPT-5.4
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling4/54/5
Classification4/53/5
Agentic Planning2/55/5
Structured Output4/55/5
Safety Calibration2/55/5
Strategic Analysis2/55/5
Persona Consistency2/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/54/5
Summary1 wins10 wins

Pricing Analysis

Costs from the payload: Devstral Small 1.1 charges $0.10 input / $0.30 output per 1K tokens; GPT-5.4 charges $2.50 input / $15.00 output per 1K tokens. At a 50/50 input/output split: 1M tokens/month costs Devstral $200 and GPT-5.4 $8,750; 10M tokens costs Devstral $2,000 vs GPT-5.4 $87,500; 100M tokens costs Devstral $20,000 vs GPT-5.4 $875,000. If all tokens are outputs (worst-case for cost): 1M tokens = Devstral $300 vs GPT-5.4 $15,000. The payload reports a priceRatio of 0.02 (Devstral ≈ 2% of GPT-5.4), which aligns with these figures. Who should care: high-volume services, startups, and cost-sensitive APIs will find Devstral’s price compelling; teams requiring top-tier safety, long-context reasoning, or mission-critical fidelity should budget for GPT-5.4’s substantially higher cost.

Real-World Cost Comparison

TaskDevstral Small 1.1GPT-5.4
iChat response<$0.001$0.0080
iBlog post<$0.001$0.031
iDocument batch$0.017$0.800
iPipeline run$0.170$8.00

Bottom Line

Choose Devstral Small 1.1 if: you operate at high token volumes and need the lowest cost (Devstral ≈ 2% of GPT-5.4 pricing by the payload), your workloads emphasize classification or inexpensive chat/utility tasks, or you must hit tight budget envelopes (examples: high-QPS classification APIs, telemetry tagging, low-cost assistants). Choose GPT-5.4 if: you need top-tier safety calibration, long-context retrieval and summarization (tied 1st for long context), agentic planning and strategic analysis, multilingual parity, or you rely on third-party coding/math benchmarks (GPT-5.4 = 76.9% SWE-bench Verified, 95.3% AIME 2025 per Epoch AI).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions