Devstral Small 1.1 vs GPT-4o
For most production agent and multimodal use cases that prioritize persona, agentic planning, and creative problem solving, GPT-4o is the better pick in our testing. Devstral Small 1.1 is the cost-saving choice — it wins safety calibration in our benchmarks and ties on many core tasks while costing roughly 3% of GPT-4o.
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
openai
GPT-4o
Benchmark Scores
External Benchmarks
Pricing
Input
$2.50/MTok
Output
$10.00/MTok
modelpicker.net
Benchmark Analysis
We compared both models across our 12-test suite (scores 1–5). Summary of where each model wins in our testing: GPT-4o wins creative_problem_solving (3 vs 2), persona_consistency (5 vs 2), and agentic_planning (4 vs 2). Devstral Small 1.1 wins safety_calibration (2 vs 1). The remaining eight tests are ties. Detailed walk-through (scoreA = Devstral, scoreB = GPT-4o):
- persona_consistency: 2 vs 5 — GPT-4o is substantially stronger in maintaining character/resisting injection in our tests; GPT-4o’s persona_consistency ranks tied for 1st of 53 models, while Devstral ranks 51 of 53. This matters for bots, roleplay agents, and systems that rely on strict persona behavior.
- safety_calibration: 2 vs 1 — Devstral edges GPT-4o in our safety calibration test (Devstral rank 12 of 55 vs GPT-4o rank 32 of 55). If rejecting harmful prompts while allowing legitimate ones is a priority, Devstral performed better in our runs.
- structured_output: 4 vs 4 (tie) — both models score 4 on JSON/schema compliance; both rank mid-table (rank 26 of 54). Use either for schema-constrained outputs but validate outputs in production.
- classification: 4 vs 4 (tie) — both models tied for 1st with many others (tied with 29 models), so both are strong for routing and labeling in our testing.
- tool_calling: 4 vs 4 (tie) — both handle function selection/arguments comparably (rank 18 of 54). Expect similar reliability for basic tool-invocation logic.
- long_context: 4 vs 4 (tie) — both scored 4 for retrieval at 30K+ tokens and share the same rank display; note Devstral has a 131,072-token window and GPT-4o 128,000 tokens in the payload.
- faithfulness: 4 vs 4 (tie) — both models scored 4 and rank similarly (rank 34-ish), indicating comparable adherence to source material in our tests.
- constrained_rewriting: 3 vs 3 (tie) — both perform similarly compressing content into strict limits.
- creative_problem_solving: 2 vs 3 — GPT-4o outperforms Devstral for non-obvious, feasible idea generation in our testing (GPT-4o rank 30 of 54 vs Devstral rank 47 of 54).
- strategic_analysis: 2 vs 2 (tie) — both are similar on nuanced tradeoff reasoning in our suite.
- agentic_planning: 2 vs 4 — GPT-4o is stronger at goal decomposition and recovery in our experiments (GPT-4o rank 16 of 54 vs Devstral rank 53 of 54), which matters for multi-step agent workflows.
- multilingual: 4 vs 4 (tie) — both deliver comparable non-English quality per our tests. External benchmarks: GPT-4o also has third-party scores recorded: SWE-bench Verified 31% (Epoch AI), MATH Level 5 53.3% (Epoch AI), and AIME 2025 6.4% (Epoch AI). Devstral Small 1.1 has no external benchmark entries in the payload. These external numbers provide additional task-specific context (coding/math) from Epoch AI but do not override our internal 12-test results.
Pricing Analysis
Per the payload prices: Devstral Small 1.1 charges $0.10/1k input and $0.30/1k output; GPT-4o charges $2.50/1k input and $10.00/1k output. If you assume 1M input tokens + 1M output tokens (equal I/O), Devstral costs: input $0.10 * 1000 = $100, output $0.30 * 1000 = $300, total $400 per 1M in+out tokens. GPT-4o costs: input $2.50 * 1000 = $2,500, output $10.00 * 1000 = $10,000, total $12,500 per 1M in+out tokens. At 10M in+out tokens/month multiply those totals by 10 ($4,000 vs $125,000); at 100M multiply by 100 ($40,000 vs $1,250,000). The payload’s priceRatio is 0.03 (Devstral ≈ 3% of GPT-4o). Teams with high-volume workloads (millions+ tokens/month), tight budgets, or predictable structured tasks should care deeply about the gap; teams needing stronger persona, agentic planning, or multimodal inputs may justify GPT-4o’s much higher cost.
Real-World Cost Comparison
Bottom Line
Choose Devstral Small 1.1 if: you must minimize inference cost at scale (≈ $400 per 1M in+out tokens), need a high-throughput classifier or structured-output engine, and can accept weaker persona and agentic planning. Choose GPT-4o if: you need stronger persona consistency (5 vs 2), better agentic planning (4 vs 2), multimodal inputs (text+image+file→text in the payload), or higher creative/problem-solving capacity and you can absorb much higher inference spend ($12,500 per 1M in+out tokens).
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.