mistral

Devstral Medium

Devstral Medium is a code generation and agentic reasoning model. It accepts text-only inputs and targets software development workflows. At $0.40/M input and $2.00/M output with a 131,072 token context window, its pricing matches Mistral Medium 3.1. In our 12-test benchmark suite, Devstral Medium ranks 50th out of 52 active models — near the bottom overall. This reflects our benchmark composition: our tests favor general reasoning, multilingual performance, and safety calibration. Devstral Medium is a specialized coding model, so these results should be interpreted in that context. Classification (4/5, tied for 1st) and agentic planning (4/5, rank 16 of 54) are its strongest areas in our testing.

Performance

In our 12-test general-purpose benchmark suite, Devstral Medium ranks 50th out of 52 active models. Its strongest areas are classification (4/5, tied for 1st with 29 other models out of 53), agentic planning (4/5, rank 16 of 54), faithfulness (4/5, rank 34 of 55), and structured output (4/5, rank 26 of 54). Weaker areas include tool calling (3/5, rank 47 of 54), creative problem solving (2/5, rank 47 of 54), strategic analysis (2/5, rank 44 of 54), and safety calibration (1/5, rank 32 of 55). Persona consistency scored 3/5 (rank 45 of 53) and constrained rewriting 3/5 (rank 31 of 53). These results reflect our general-purpose test suite, not code-specific evaluations.

Pricing

Devstral Medium costs $0.40 per million input tokens and $2.00 per million output tokens. At 10 million output tokens per month, that is $20. At 100 million tokens, $200. At the same $2.00/M output price, Mistral Medium 3.1 scores significantly higher on our general-purpose benchmarks (rank 15 vs rank 50 of 52). For teams where general-purpose benchmark performance matters, Mistral Medium 3.1 offers more breadth at the same cost. For code-specific workflows, Devstral Medium's positioning as a specialized code model may justify the trade-off.

mistral

Devstral Medium

Overall
3.17/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
3/5
Classification
4/5
Agentic Planning
4/5
Structured Output
4/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window131K

modelpicker.net

Real-World Costs

iChat response$0.0011
iBlog post$0.0042
iDocument batch$0.108
iPipeline run$1.08

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

This modelOther models

Try It

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY",
)

response = client.chat.completions.create(
    model="mistralai/devstral-medium",
    messages=[
        {"role": "user", "content": "Hello, Devstral Medium!"}
    ],
)

print(response.choices[0].message.content)

Recommendation

Devstral Medium is positioned for code generation and coding agent use cases. Our general-purpose benchmarks show it at rank 50 of 52, so teams evaluating it for general text tasks should look at higher-ranking options. If your workflow is specifically code-focused — particularly around agentic coding assistants or code review pipelines — its classification (4/5) and agentic planning (4/5) scores are more relevant. For the same $2.00/M output price, Mistral Medium 3.1 delivers significantly broader performance across our benchmark suite.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions