Devstral Medium vs GPT-4o

GPT-4o outscores Devstral Medium on creative problem solving (3 vs 2), tool calling (4 vs 3), and persona consistency (5 vs 3) in our testing — making it the stronger general-purpose AI. However, Devstral Medium ties GPT-4o on 9 of 12 benchmarks while costing 80% less on output tokens ($2.00 vs $10.00 per million), which is a meaningful tradeoff for cost-sensitive workloads. For developers running high-volume pipelines where those three differentiating capabilities are not critical, Devstral Medium delivers equivalent results at a fraction of the cost.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

openai

GPT-4o

Overall
3.50/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
31.0%
MATH Level 5
53.3%
AIME 2025
6.4%

Pricing

Input

$2.50/MTok

Output

$10.00/MTok

Context Window128K

modelpicker.net

Benchmark Analysis

Across our 12-test suite, GPT-4o wins 3 benchmarks and Devstral Medium wins 0, with 9 ties. Neither model dominates — but GPT-4o holds clear edges in specific areas.

Where GPT-4o wins:

  • Persona consistency: GPT-4o scores 5 vs Devstral Medium's 3 — tied for 1st among 53 models vs rank 45 of 53. This is a significant gap. For chatbot products, roleplay applications, or any workflow requiring a stable AI character that resists prompt injection, GPT-4o is meaningfully better in our testing.
  • Tool calling: GPT-4o scores 4 vs Devstral Medium's 3 — rank 18 of 54 vs rank 47 of 54. Function selection, argument accuracy, and sequencing all factor into this score. For agentic workflows that chain multiple tool calls, GPT-4o's advantage here is practically significant — tool calling errors compound across steps.
  • Creative problem solving: GPT-4o scores 3 vs Devstral Medium's 2 — rank 30 of 54 vs rank 47 of 54. Neither model excels here (the median across all 54 models is 4), but GPT-4o produces more novel, feasible ideas in our non-obvious ideation tests.

Where both models tie (9 of 12 tests):

  • Agentic planning (both 4/5, rank 16 of 54): Strong goal decomposition and failure recovery from both models.
  • Structured output (both 4/5, rank 26 of 54): Solid JSON schema compliance from both — above the median.
  • Classification (both 4/5, tied for 1st of 53): Both models match the top tier on routing and categorization tasks.
  • Long context (both 4/5, rank 38 of 55): Above the 25th percentile, though not top-tier, on retrieval at 30K+ tokens. Both models offer ~128–131K context windows.
  • Faithfulness (both 4/5, rank 34 of 55): Both stick to source material reliably — relevant for summarization and RAG pipelines.
  • Multilingual (both 4/5, rank 36 of 55): Equivalent non-English output quality in our tests.
  • Constrained rewriting (both 3/5, rank 31 of 53): Neither model excels at compression within hard character limits — both sit at the median.
  • Strategic analysis (both 2/5, rank 44 of 54): A shared weakness — both score below the median on nuanced tradeoff reasoning with real numbers.
  • Safety calibration (both 1/5, rank 32 of 55): Both score at the 25th percentile on refusing harmful requests while permitting legitimate ones. A concern if safety guardrails are critical for your deployment.

External benchmarks (Epoch AI): GPT-4o carries third-party benchmark data in the payload. On SWE-bench Verified (real GitHub issue resolution), GPT-4o scores 31% — ranking 12th of 12 models with scores in our dataset, below the 25th percentile benchmark of 61.1%. On MATH Level 5 (competition math), GPT-4o scores 53.3% — rank 12 of 14, well below the median of 94.2%. On AIME 2025, GPT-4o scores 6.4% — rank 22 of 23, well below the median of 83.9%. These are Epoch AI figures, not our internal tests. No external benchmark scores are available in the payload for Devstral Medium. What the GPT-4o external scores tell us: GPT-4o is not a top-tier math or coding model by these third-party measures, and its SWE-bench result suggests developers should not rely on it for complex autonomous code repair tasks.

BenchmarkDevstral MediumGPT-4o
Faithfulness4/54/5
Long Context4/54/5
Multilingual4/54/5
Tool Calling3/54/5
Classification4/54/5
Agentic Planning4/54/5
Structured Output4/54/5
Safety Calibration1/51/5
Strategic Analysis2/52/5
Persona Consistency3/55/5
Constrained Rewriting3/53/5
Creative Problem Solving2/53/5
Summary0 wins3 wins

Pricing Analysis

Devstral Medium costs $0.40/M input and $2.00/M output. GPT-4o costs $2.50/M input and $10.00/M output — 6.25× more on input and 5× more on output. At 1M output tokens/month, that gap is $8 vs $10 — trivial. At 10M output tokens, it's $20 vs $100 — a $80/month difference worth noticing. At 100M output tokens (a serious production workload), Devstral Medium saves $800/month on output alone, or roughly $9,600/year. Input costs follow the same pattern: 100M input tokens costs $40 with Devstral Medium vs $250 with GPT-4o. For consumer apps, the difference is negligible. For high-throughput API users — document processing, code review pipelines, classification at scale — Devstral Medium's pricing is a genuine operational advantage, especially on the 9 benchmarks where both models score identically.

Real-World Cost Comparison

TaskDevstral MediumGPT-4o
iChat response$0.0011$0.0055
iBlog post$0.0042$0.021
iDocument batch$0.108$0.550
iPipeline run$1.08$5.50

Bottom Line

Choose Devstral Medium if: You are building high-volume text pipelines — classification, document processing, structured data extraction, RAG — where the 9 tied benchmarks cover your use cases and you want to pay $2.00 vs $10.00 per million output tokens. At 100M tokens/month, that's $9,600/year in savings with no benchmark regression on those tasks. Also consider it if your workflow involves agentic planning or faithfulness-sensitive tasks where both models score equally.

Choose GPT-4o if: Your application depends on reliable tool calling (4 vs 3 in our tests, rank 18 vs 47 of 54), persona consistency (5 vs 3, rank 1 vs 45 of 53), or creative ideation (3 vs 2). Chatbots with defined characters, multi-step agentic systems that chain tool calls, and creative applications are the clearest cases where GPT-4o's higher scores justify its 5× output cost premium. GPT-4o also supports image and file inputs (text+image+file->text modality) per the payload, while Devstral Medium is text-only — a decisive factor if your application processes visual content.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions