mistral

Devstral 2 2512

Devstral 2 2512 is an open-source agentic coding model built on a 123-billion-parameter dense transformer. It accepts text-only inputs and specializes in code generation and agentic software workflows. At $0.40/M input and $2.00/M output with a 262,144 token context window, it offers one of the larger context windows in its price bracket. In our 12-test benchmark suite, Devstral 2 2512 ranks 28th out of 52 active models — a solid middle-tier position. Its general-purpose strengths are constrained rewriting (5/5, tied for 1st with 4 other models), structured output (5/5, tied for 1st with 24 others), long context (5/5), and multilingual (5/5). These results reflect our general benchmarks, not code-specific evaluations.

Performance

Devstral 2 2512 ranks 28th out of 52 active models overall. Top strengths in our testing: constrained rewriting (5/5, tied for 1st with 4 other models out of 53), structured output (5/5, tied for 1st with 24 others out of 54), long context (5/5, tied for 1st with 36 others out of 55), and multilingual (5/5, tied for 1st with 34 others). Agentic planning scored 4/5 (rank 16 of 54), tool calling 4/5 (rank 18 of 54), and creative problem solving 4/5 (rank 9 of 54). Weaker areas: safety calibration scored 1/5 (rank 32 of 55, well below the field median of 2/5), classification 3/5 (rank 31 of 53), and persona consistency 4/5 but ranked 38 of 53. These benchmark results are from our general-purpose test suite and may not fully reflect performance on code-specific tasks.

Pricing

Devstral 2 2512 costs $0.40 per million input tokens and $2.00 per million output tokens. At 10 million output tokens per month, that is $20. At 100 million tokens, $200. This pricing is identical to Devstral Medium and Mistral Medium 3.1, but Devstral 2 2512 ranks 28th of 52 — significantly better than Devstral Medium (rank 50) and somewhat below Mistral Medium 3.1 (rank 15) on our general benchmarks. Its open-source nature means teams can also self-host rather than use API access, giving pricing flexibility beyond the $2.00/M output rate. The 262K context window is twice that of Mistral Medium 3.1's 131K, which matters for large codebase ingestion.

mistral

Devstral 2 2512

Overall
4.00/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
4/5
Persona Consistency
4/5
Constrained Rewriting
5/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.400/MTok

Output

$2.00/MTok

Context Window262K

modelpicker.net

Real-World Costs

iChat response$0.0011
iBlog post$0.0042
iDocument batch$0.108
iPipeline run$1.08

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

This modelOther models

Try It

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY",
)

response = client.chat.completions.create(
    model="mistralai/devstral-2512",
    messages=[
        {"role": "user", "content": "Hello, Devstral 2 2512!"}
    ],
)

print(response.choices[0].message.content)

Recommendation

Devstral 2 2512 is a solid choice for code-focused agentic pipelines, structured output generation, and long-context code analysis. Its 5/5 scores on constrained rewriting and structured output make it one of the stronger options at $2.00/M output for JSON generation and schema-constrained tasks. Open-weight availability is a meaningful differentiator for teams with compliance or self-hosting requirements. Avoid it for safety-critical applications (1/5 safety calibration) or use cases where classification accuracy is critical (3/5, rank 31 of 53). For general-purpose text tasks at the same price, Mistral Medium 3.1 scores higher across more benchmark dimensions.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions