Question 1

Is Devstral Small 1.1 better than Llama 4 Maverick?

Accepted Answer

It depends on the task. In our 12-test suite Devstral wins classification and tool_calling (scores 4 each); Llama wins persona_consistency (5), creative_problem_solving (3), and agentic_planning (3). Neither model wins a majority of all 12 tests because 7 tests tied.

Question 2

Which model is cheaper to run?

Accepted Answer

Devstral Small 1.1 is cheaper: $0.10 input + $0.30 output = $0.40 per 1,000 tokens vs Llama 4 Maverick at $0.15 + $0.60 = $0.75 per 1,000 tokens. For 1M tokens/month the difference is $400 vs $750; for 10M it's $4,000 vs $7,500; for 100M it's $40,000 vs $75,000.

Question 3

Which is better for coding or agentic tool workflows?

Accepted Answer

In our tests Devstral Small 1.1 scored 4 on tool_calling and ranks 18 of 54, and it’s described in the payload as designed for software engineering agents. That makes Devstral preferable for tool-integrated coding agents and function-selection tasks.

Question 4

Which is better for chat role-play and persona consistency?

Accepted Answer

Llama 4 Maverick scored 5 on persona_consistency in our testing (tied for 1st with 36 others out of 53), while Devstral scored 2 (rank 51 of 53). For persona-driven chat and resisting injection attacks, Llama is the stronger option in our benchmarks.

Question 5

Do either models support long context or multimodal inputs?

Accepted Answer

Both scored 4 on our long_context test, but Llama 4 Maverick has a much larger context window (1,048,576 tokens) vs Devstral’s 131,072. The payload also lists Llama as multimodal (text+image→text); Devstral is text→text.

Question 6

Did any external issues affect the tests?

Accepted Answer

Yes — the payload notes Llama’s tool_calling run hit a 429 rate limit on OpenRouter during testing, which may have impacted its tool_calling result; we flag that as likely transient.

Devstral Small 1.1 vs Llama 4 Maverick

Devstral Small 1.1

Llama 4 Maverick

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions