Question 1

Is Devstral 2 2512 better than GPT-5?

Accepted Answer

Not overall. In our 12-test suite GPT-5 wins 7 benchmarks while Devstral 2 2512 wins 1 (constrained rewriting). Devstral is far cheaper but GPT-5 leads on tool calling, faithfulness, classification, strategic analysis, agentic planning, persona consistency, and safety calibration.

Question 2

Which model is cheaper to run?

Accepted Answer

Devstral 2 2512 is significantly cheaper: input $0.40 / mTok and output $2.00 / mTok versus GPT-5 at $1.25 / mTok input and $10.00 / mTok output. Under a 50/50 input/output split that's about $1,200/month for 1M tokens on Devstral vs $5,625/month on GPT-5.

Question 3

Which model is better for coding and math?

Accepted Answer

GPT-5 shows stronger external results: 73.6% on SWE-bench Verified, 98.1% on MATH Level 5, and 91.4% on AIME 2025 (all scores from Epoch AI). Devstral 2 2512 has no SWE/MATH/AIME external scores in the payload, so GPT-5 is the safer pick for coding/math per those third-party measures.

Question 4

Which model is better at tool calling and function selection?

Accepted Answer

GPT-5 wins tool_calling in our tests (GPT-5 = 5 vs Devstral 2 = 4) and is tied for 1st on that metric across models, indicating stronger function selection, argument accuracy, and sequencing in our benchmark runs.

Question 5

How do they compare on long contexts and structured output?

Accepted Answer

They tie on long_context and structured_output (both score 5/5). Both are tied for top ranks on long_context and structured_output in our tests, so expect similar behavior for 30K+ retrieval accuracy and JSON/schema adherence.

Question 6

Is safety calibration better in GPT-5?

Accepted Answer

Yes. In our tests GPT-5 scores 2 versus Devstral 2's 1 on safety_calibration; GPT-5 ranks 12 of 55 while Devstral ranks 32 of 55, indicating GPT-5 more reliably refuses harmful requests while allowing legitimate ones.

Devstral 2 2512 vs GPT-5

Devstral 2 2512

GPT-5

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions