Question 1

Is Devstral Small 1.1 better than GPT-4.1?

Accepted Answer

No — in our 12-test suite GPT-4.1 wins 9 benchmarks while Devstral Small 1.1 wins only safety_calibration; they tie on structured_output and classification. Devstral is cheaper but scores lower on strategic_analysis (2 vs 5), persona_consistency (2 vs 5), and long_context (4 vs 5).

Question 2

Which model is cheaper to run?

Accepted Answer

Devstral Small 1.1 is far cheaper: $0.10 per mTok input and $0.30 per mTok output versus GPT-4.1’s $2 input and $8 output. With a 50/50 token split, 1M tokens/month costs ≈ $200 on Devstral vs ≈ $5,000 on GPT-4.1.

Question 3

Which model is better for coding and SWE tasks?

Accepted Answer

On our internal suite GPT-4.1 outperforms Devstral on tool_calling (5 vs 4) and strategic_analysis (5 vs 2), and on external SWE-bench Verified (Epoch AI) GPT-4.1 scores 48.5%. That makes GPT-4.1 the stronger candidate for complex engineering workflows in our tests.

Question 4

Which model handles long documents better?

Accepted Answer

GPT-4.1 scores 5 on long_context and is tied for 1st in our rankings for retrieval accuracy at 30K+ tokens; Devstral scores 4 and ranks lower (rank 38 of 55). For long-context tasks in our testing, GPT-4.1 is superior.

Question 5

Which model is safer or better at refusing harmful requests?

Accepted Answer

Devstral Small 1.1 scored 2 on safety_calibration versus GPT-4.1’s 1 in our tests; Devstral ranks 12 of 55 while GPT-4.1 ranks 32 of 55. In our suite Devstral was likelier to refuse harmful prompts appropriately.

Question 6

How do the external benchmarks affect the choice?

Accepted Answer

External scores for GPT-4.1 from Epoch AI (SWE-bench Verified 48.5%, MATH Level 5 83%, AIME 2025 38.3%) are supplementary evidence of its strengths on coding and math challenges; we use them alongside our internal 12-test results but do not combine scoring systems.

Devstral Small 1.1 vs GPT-4.1

Devstral Small 1.1

GPT-4.1

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions