Devstral Small 1.1 vs Llama 4 Maverick

For general-purpose chat, persona-driven apps, and creative/agentic tasks, Llama 4 Maverick is the better pick in our 12-test suite (wins 3 vs Devstral's 2). Devstral Small 1.1 wins at classification and tool calling and is ~50% cheaper, so it’s the pragmatic choice for high-volume, tool-integrated engineering workflows.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

meta

Llama 4 Maverick

Overall
3.36/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Classification
3/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.150/MTok

Output

$0.600/MTok

Context Window1049K

modelpicker.net

Benchmark Analysis

Summary of our 12-test suite (scores are our 1–5 ratings): Devstral Small 1.1 wins: classification (4 vs 3) — tied for 1st with 29 others out of 53 tested, and tool_calling (4 vs Llama’s tested result), where Devstral ranks 18 of 54. These wins matter for routing, triage, and agent function selection in production agents. Llama 4 Maverick wins: persona_consistency (5 vs 2) — tied for 1st with 36 others out of 53, creative_problem_solving (3 vs 2), and agentic_planning (3 vs 2). Those victories translate to stronger role-play/chat stability, ideation quality, and goal decomposition in our tests. Ties (equal scores in our testing): structured_output (4/4), strategic_analysis (2/2), constrained_rewriting (3/3), faithfulness (4/4), long_context (4/4), safety_calibration (2/2), and multilingual (4/4). Notable context and feature differences that affected results: Llama 4 Maverick offers a 1,048,576-token context window vs Devstral’s 131,072, and Llama is multimodal (text+image→text) in the payload — though both scored 4 on our long_context test, indicating parity on retrieval accuracy at 30K+ tokens in our implementation. Also, Llama’s tool_calling run hit a 429 rate limit on OpenRouter during testing (noted as likely transient in the payload), which may have impacted its tool_calling result. In short: Devstral is stronger where function selection and classification matter; Llama is stronger where persona integrity, creative problem solving, and planning matter.

BenchmarkDevstral Small 1.1Llama 4 Maverick
Faithfulness4/54/5
Long Context4/54/5
Multilingual4/54/5
Tool Calling4/50/5
Classification4/53/5
Agentic Planning2/53/5
Structured Output4/54/5
Safety Calibration2/52/5
Strategic Analysis2/52/5
Persona Consistency2/55/5
Constrained Rewriting3/53/5
Creative Problem Solving2/53/5
Summary2 wins3 wins

Pricing Analysis

Unit costs (per 1,000 tokens): Devstral Small 1.1 = $0.10 input + $0.30 output = $0.40/mTok; Llama 4 Maverick = $0.15 input + $0.60 output = $0.75/mTok. Assuming equal input/output share, monthly costs: 1M tokens → Devstral $400 vs Llama $750; 10M → Devstral $4,000 vs Llama $7,500; 100M → Devstral $40,000 vs Llama $75,000. The ~2x total cost gap matters for product teams with sustained high-volume inference (10M+ tokens/month), budget-conscious startups, or any application where cost per query is a primary constraint. If you run low-volume prototypes or prioritize persona and multimodal features, the higher Llama cost can be justified; if you operate at scale and need classification/tool reliability, Devstral’s 50% price advantage reduces operating spend materially.

Real-World Cost Comparison

TaskDevstral Small 1.1Llama 4 Maverick
iChat response<$0.001<$0.001
iBlog post<$0.001$0.0013
iDocument batch$0.017$0.033
iPipeline run$0.170$0.330

Bottom Line

Choose Devstral Small 1.1 if you need lower-cost inference at scale, reliable classification (score 4; tied for 1st of 53), and robust tool calling (score 4, rank 18/54) for software-engineering agent workflows. Choose Llama 4 Maverick if your priority is persona-driven chat, creative ideation, or agentic planning (persona_consistency 5 vs 2; creative_problem_solving 3 vs 2; agentic_planning 3 vs 2), or you need multimodal input and a very large context window (1,048,576 tokens). If budget is tight and you expect 10M+ tokens/month, Devstral’s ~50% lower per-token cost will materially reduce spend; if accuracy on persona and planning tasks is mission-critical, accept Llama’s higher cost.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions