Question 1

Is Devstral Small 1.1 better than o3?

Accepted Answer

On most benchmarks, no. In our testing, o3 wins 9 of 12 tests, Devstral Small 1.1 wins 2 (classification and safety calibration), and they tie on long context. However, Devstral Small 1.1 is significantly cheaper — $0.30/M vs $8.00/M on output tokens — so it's the better choice when you only need classification or standard text tasks at scale.

Question 2

Which model is cheaper, and by how much?

Accepted Answer

Devstral Small 1.1 is dramatically cheaper: $0.10/M input and $0.30/M output, versus o3's $2.00/M input and $8.00/M output. On output tokens, Devstral Small 1.1 costs roughly 2.7% of what o3 costs — a 27x difference. At 10M output tokens/month, that's $3,000 vs $80,000.

Question 3

Which model is better for coding?

Accepted Answer

Devstral Small 1.1 is described as a software engineering agent model, but our benchmark data doesn't include an internal coding-specific score for it. On SWE-bench Verified (Epoch AI), o3 scores 62.3% — real GitHub issue resolution — placing it 9th of 12 models with scores in our dataset, near the lower quartile. Devstral Small 1.1 has no external benchmark score in our dataset. On agentic planning, which is central to code agent workflows, o3 scores 5 vs Devstral Small 1.1's 2 in our testing.

Question 4

Which is better for agentic or autonomous AI workflows?

Accepted Answer

o3 by a wide margin. On agentic planning — goal decomposition and failure recovery — o3 scores 5 and ties for 1st among 54 models in our testing. Devstral Small 1.1 scores 2 and ranks 53rd of 54. On tool calling, o3 also scores 5 (tied 1st of 54) vs Devstral Small 1.1's 4 (rank 18 of 54). If you're building autonomous agents, o3 is the stronger foundation based on our benchmarks.

Question 5

Does o3 support image inputs?

Accepted Answer

Yes. Per the payload, o3's modality is listed as text+image+file->text, meaning it accepts image and file inputs alongside text. Devstral Small 1.1 is text->text only, so it cannot process images or files directly.

Question 6

Which model handles math better?

Accepted Answer

o3 is the stronger math model based on available external benchmark data. On MATH Level 5 competition problems (Epoch AI), o3 scores 97.8%, ranking 2nd of 14 models with scores in our dataset. On AIME 2025, o3 scores 83.9%, at the field median (rank 12 of 23). No external math benchmark scores are available for Devstral Small 1.1 in our dataset.

Devstral Small 1.1 vs o3

Devstral Small 1.1

o3

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions