mistral

Devstral Small 1.1

Devstral Small 1.1 is a 24-billion parameter open-weight language model built specifically for software engineering agents. It was developed in collaboration with All Hands AI and fine-tuned from Mistral Small 3.1 for agentic code tasks — not general-purpose reasoning, writing, or analysis. At $0.10/MTok input and $0.30/MTok output, it is one of the cheapest models in the tested set, but it ranks 51st out of 52 overall in our 12-test benchmark suite with an average score of 3.08. The benchmark suite covers general capabilities (strategic analysis, faithfulness, persona consistency, etc.); Devstral Small 1.1's training focus on software engineering means it performs well below its peers on most of these dimensions. Within the mistral lineup, it is the budget software engineering option — Devstral 2 2512 ($2/MTok output, avg 4.0) and Devstral Medium ($2/MTok output, avg 3.17) represent higher-capability alternatives for code-centric tasks.

Performance

In our 12-test benchmark suite, Devstral Small 1.1 ranks 51st out of 52 models with an average score of 3.08. These results reflect a general-capability assessment, not a software engineering evaluation. The model's strongest scores in our suite: classification (4/5, tied for 1st with 29 others out of 53), tool calling (4/5, rank 18 of 54), structured output (4/5, rank 26 of 54), multilingual (4/5, rank 36 of 55), faithfulness (4/5, rank 34 of 55), and long context (4/5, rank 38 of 55). The weak areas are significant: persona consistency (2/5, rank 51 of 53 — second to last), agentic planning (2/5, rank 53 of 54 — second to last), creative problem solving (2/5, rank 47 of 54), strategic analysis (2/5, rank 44 of 54), constrained rewriting (3/5, rank 31 of 53), and safety calibration (2/5, rank 12 of 55). These weaknesses reflect a model fine-tuned on software engineering tasks where general reasoning, persona maintenance, and strategic analysis are not target use cases.

Pricing

Devstral Small 1.1 costs $0.10 per million input tokens and $0.30 per million output tokens. At 10 million output tokens monthly, output cost is $3. At 100 million output tokens, output cost is $30. This is one of the lowest output prices in the tested set. However, given the model's rank of 51st out of 52 on our general benchmark suite, the low price reflects the model's narrow specialization rather than broad capability. For general-purpose workloads at similar cost, Gemma 4 26B A4B ($0.35/MTok output, avg 4.25) or Llama 3.3 70B Instruct ($0.32/MTok output, avg 3.5) deliver significantly better general-purpose results. Devstral Small 1.1's cost advantage is meaningful primarily for teams running high-volume software engineering agent workloads where its specialized training is directly applicable.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

Real-World Costs

iChat response<$0.001
iBlog post<$0.001
iDocument batch$0.017
iPipeline run$0.170

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

This modelOther models

Try It

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY",
)

response = client.chat.completions.create(
    model="mistralai/devstral-small",
    messages=[
        {"role": "user", "content": "Hello, Devstral Small 1.1!"}
    ],
)

print(response.choices[0].message.content)

Recommendation

Devstral Small 1.1 should be evaluated on software engineering benchmarks — not on our general-purpose suite, where it ranks 51st out of 52. The model is designed for agentic code tasks: code generation, repository navigation, test writing, and automated bug fixing in agentic pipelines. Teams building software engineering agents at low cost may find it performs well on those tasks despite its low general-purpose ranking. Teams looking for a general-purpose model at this price point should look elsewhere — Llama 3.3 70B Instruct at $0.32/MTok output scores 3.5 average on our suite versus 3.08 for Devstral Small 1.1, and Gemma 4 26B A4B at $0.35/MTok output scores 4.25. For more capable software engineering agents, Devstral 2 2512 ($2/MTok output, avg 4.0) and Devstral Medium ($2/MTok output, avg 3.17) offer alternatives.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.