Question 1

Is Devstral Small 1.1 better than GPT-5?

Accepted Answer

Not on our benchmarks. GPT-5 wins 10 of 12 tests in our suite; Devstral Small 1.1 wins zero and ties on two (classification and safety calibration). The gap is especially large on agentic planning (5 vs 2), strategic analysis (5 vs 2), and creative problem solving (4 vs 2). Where Devstral competes is price: it costs $0.30/MTok output vs GPT-5's $10.00/MTok — 33x cheaper.

Question 2

Which is cheaper, Devstral Small 1.1 or GPT-5?

Accepted Answer

Devstral Small 1.1 is dramatically cheaper. Input costs $0.10/MTok vs GPT-5's $1.25/MTok (12.5x difference). Output costs $0.30/MTok vs $10.00/MTok (33x difference). At 10M output tokens/month, you pay $3 with Devstral vs $100 with GPT-5. At 100M output tokens, that's $300 vs $1,000. Note that GPT-5 uses reasoning tokens, which can push its real-world costs higher than the base rate on complex tasks.

Question 3

Which is better for coding?

Accepted Answer

On external benchmarks from Epoch AI, GPT-5 scores 73.6% on SWE-bench Verified (rank 6 of 12 models with this score), which measures real GitHub issue resolution. Devstral Small 1.1 has no SWE-bench Verified score in our data. GPT-5 also scores higher on tool calling (5 vs 4) and agentic planning (5 vs 2) in our internal tests — capabilities that matter for automated code agents. Devstral Small 1.1 is described as a software engineering-focused model, but the available benchmark evidence favors GPT-5.

Question 4

Which model is better for building AI agents?

Accepted Answer

GPT-5 is substantially stronger for agentic use cases. In our testing it scores 5/5 on agentic planning (tied 1st of 54 models) vs Devstral's 2/5 (rank 53 of 54 — near the bottom of the field). GPT-5 also scores 5/5 on tool calling vs Devstral's 4/5, and its 400K context window handles much longer agent sessions than Devstral's 131K. The tradeoff is cost: agentic pipelines can burn significant tokens, and GPT-5's $10/MTok output price adds up quickly.

Question 5

Does Devstral Small 1.1 support tool calling and structured output?

Accepted Answer

Yes. The payload shows Devstral Small 1.1 supports tools, tool_choice, structured outputs, and response_format parameters. It scores 4/5 on both tool calling and structured output in our tests, placing it in the middle tier — above the median for structured output (median 4/5) and at the median for tool calling. These are functional but not top-tier scores; 29 models tie it on tool calling at 4/5.

Question 6

Which model handles longer documents better?

Accepted Answer

GPT-5 on both context window size and benchmark score. GPT-5 has a 400,000-token context window vs Devstral's 131,072 tokens — more than 3x larger. In our long context benchmark (retrieval accuracy at 30K+ tokens), GPT-5 scores 5/5 (tied 1st of 55) while Devstral scores 4/5 (rank 38 of 55). For applications involving large codebases, lengthy documents, or extended conversation history, GPT-5 is the stronger choice.

Devstral Small 1.1 vs GPT-5

Devstral Small 1.1

GPT-5

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions