Question 1

Is Devstral Medium better than GPT-4o?

Accepted Answer

It depends on the task. In our testing across 12 benchmarks, GPT-4o wins 3 (creative problem solving, tool calling, persona consistency) and Devstral Medium wins 0, with 9 ties. Neither model is broadly superior — GPT-4o has a clear edge for persona-driven and tool-chaining applications, while Devstral Medium matches GPT-4o on the majority of tested capabilities at 80% lower output cost.

Question 2

Which is cheaper, Devstral Medium or GPT-4o?

Accepted Answer

Devstral Medium is significantly cheaper. It costs $0.40/M input tokens and $2.00/M output tokens. GPT-4o costs $2.50/M input and $10.00/M output — 6.25× more on input, 5× more on output. At 10M output tokens/month, that's $20 vs $100. At 100M output tokens/month, Devstral Medium saves $800/month on output alone.

Question 3

Which is better for coding — Devstral Medium or GPT-4o?

Accepted Answer

Neither model has internal coding-specific benchmark scores in our dataset. On the external SWE-bench Verified benchmark (Epoch AI), GPT-4o scores 31% — ranking last (12th of 12) among models with scores in our dataset, below the 25th percentile of 61.1%. No SWE-bench score is available in the payload for Devstral Medium. Devstral Medium is described as a code generation and agentic reasoning model, but without a comparable external score in the payload, we cannot make a direct comparison.

Question 4

Which model has better tool calling for agentic applications?

Accepted Answer

GPT-4o scores 4/5 on tool calling in our tests (rank 18 of 54), versus Devstral Medium's 3/5 (rank 47 of 54). For multi-step agentic systems where function selection and argument accuracy matter, GPT-4o's advantage here is real — errors in tool calling compound across pipeline steps. Both models tie on agentic planning (4/5, rank 16 of 54), so the gap is specifically in tool execution, not goal decomposition.

Question 5

Can GPT-4o process images and files while Devstral Medium cannot?

Accepted Answer

Yes. According to the payload, GPT-4o supports text+image+file inputs, while Devstral Medium is text-only (text->text). If your application needs to process screenshots, documents, or visual content, GPT-4o is the only option of the two.

Question 6

How do their context windows compare?

Accepted Answer

Devstral Medium has a 131,072-token context window. GPT-4o has a 128,000-token context window with a maximum output of 16,384 tokens. The context window sizes are nearly identical. Both models scored 4/5 on our long-context benchmark (rank 38 of 55), meaning real-world retrieval performance at 30K+ tokens is equivalent in our testing.

Devstral Medium vs GPT-4o

Devstral Medium

GPT-4o

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions