Devstral Small 1.1 vs GPT-5.4
For most production and research use cases that demand long context, safety, faithfulness, and agentic planning, GPT-5.4 is the better pick in our testing. Devstral Small 1.1 is the cost-focused choice — it only beats GPT-5.4 on classification but delivers huge savings (about 2% of GPT-5.4 pricing), so pick it when volume and budget dominate requirements.
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
openai
GPT-5.4
Benchmark Scores
External Benchmarks
Pricing
Input
$2.50/MTok
Output
$15.00/MTok
modelpicker.net
Benchmark Analysis
We ran a 12-test suite (scores 1–5) and report results below as "in our testing." Summary win/tie: GPT-5.4 wins 10 tests, Devstral Small 1.1 wins 1, and 1 is a tie. Detailed comparison (Devstral score → GPT-5.4 score):
- Persona consistency: 2 → 5. GPT-5.4 ranks tied 1st of 53 for persona consistency; Devstral ranks 51 of 53. This matters for dialogue agents that must maintain a character or role.
- Safety calibration: 2 → 5. GPT-5.4 is tied for 1st of 55 on safety calibration; Devstral is rank 12 of 55. In practice GPT-5.4 refuses harmful requests and permits legitimate ones much more reliably in our tests.
- Structured output: 4 → 5. GPT-5.4 is tied for 1st of 54; Devstral sits mid-pack (rank 26). For JSON/schema compliance, GPT-5.4 is more reliable.
- Classification: 4 → 3. Devstral wins (tied for 1st with many models out of 53); choose Devstral when accurate routing or categorization is the priority.
- Tool calling: 4 → 4 (tie). Both scored equally on function selection and argument accuracy in our tests.
- Long context: 4 → 5. GPT-5.4 is tied for 1st of 55 on long context; Devstral ranks 38. For retrieval or summarization across 30K+ tokens, GPT-5.4 is clearly stronger.
- Faithfulness: 4 → 5. GPT-5.4 is tied for 1st of 55; expect fewer hallucinations from GPT-5.4 in our tests.
- Constrained rewriting: 3 → 4. GPT-5.4 ranks 6 of 53; it handles hard character limits better in our evaluation.
- Creative problem solving: 2 → 4. GPT-5.4 ranks 9 of 54; it produced more non-obvious, feasible ideas in our runs.
- Strategic analysis: 2 → 5. GPT-5.4 tied for 1st of 54, showing much stronger numeric tradeoff reasoning in our tests.
- Agentic planning: 2 → 5. GPT-5.4 tied for 1st of 54; it decomposes goals and plans failure recovery more effectively in our trials.
- Multilingual: 4 → 5. GPT-5.4 tied for 1st of 55; it produced higher-quality non-English outputs in our sampling. External benchmarks: GPT-5.4 also scores 76.9% on SWE-bench Verified and 95.3% on AIME 2025 according to Epoch AI — we cite these as supplementary, third-party evidence of its coding and math strengths. Devstral has no external SWE/AIME scores in the payload. Overall interpretation: GPT-5.4 is markedly stronger across safety, long-context, planning, and reasoning tasks in our testing; Devstral is viable where classification accuracy plus minimal cost are the top constraints.
Pricing Analysis
Costs from the payload: Devstral Small 1.1 charges $0.10 input / $0.30 output per 1K tokens; GPT-5.4 charges $2.50 input / $15.00 output per 1K tokens. At a 50/50 input/output split: 1M tokens/month costs Devstral $200 and GPT-5.4 $8,750; 10M tokens costs Devstral $2,000 vs GPT-5.4 $87,500; 100M tokens costs Devstral $20,000 vs GPT-5.4 $875,000. If all tokens are outputs (worst-case for cost): 1M tokens = Devstral $300 vs GPT-5.4 $15,000. The payload reports a priceRatio of 0.02 (Devstral ≈ 2% of GPT-5.4), which aligns with these figures. Who should care: high-volume services, startups, and cost-sensitive APIs will find Devstral’s price compelling; teams requiring top-tier safety, long-context reasoning, or mission-critical fidelity should budget for GPT-5.4’s substantially higher cost.
Real-World Cost Comparison
Bottom Line
Choose Devstral Small 1.1 if: you operate at high token volumes and need the lowest cost (Devstral ≈ 2% of GPT-5.4 pricing by the payload), your workloads emphasize classification or inexpensive chat/utility tasks, or you must hit tight budget envelopes (examples: high-QPS classification APIs, telemetry tagging, low-cost assistants). Choose GPT-5.4 if: you need top-tier safety calibration, long-context retrieval and summarization (tied 1st for long context), agentic planning and strategic analysis, multilingual parity, or you rely on third-party coding/math benchmarks (GPT-5.4 = 76.9% SWE-bench Verified, 95.3% AIME 2025 per Epoch AI).
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.