mistral

Codestral 2508

Codestral 2508 is Mistral's code-specialized model, designed for low-latency, high-frequency coding tasks including fill-in-the-middle, code correction, and test generation. At $0.30 input / $0.90 output per million tokens, it is priced below most general-purpose models while targeting a focused use case: code-related workflows where faithfulness to source, reliable tool invocation, and structured output matter more than creative reasoning or strategic analysis. In our testing, it ranked 43rd out of 52 models overall — but that aggregate rank masks strong performance on the benchmarks most relevant to coding assistants.

Performance

Codestral 2508's three strongest benchmarks in our testing are tool calling (5/5, tied for 1st with 16 other models out of 54 tested), faithfulness (5/5, tied for 1st with 32 other models out of 55 tested), and structured output (5/5, tied for 1st with 24 other models out of 54 tested). Long context also scored 5/5 (tied for 1st with 36 other models out of 55 tested). These are the four dimensions most directly relevant to agentic coding — accurate function calls, reliable source adherence, schema-compliant output, and retrieval in large codebases. The model's notable weaknesses are strategic analysis (2/5, rank 44 of 54) and creative problem solving (2/5, rank 47 of 54), both of which fall well below the field median. Safety calibration scored 1/5 (rank 32 of 55). Overall rank: 43 out of 52 tested models.

Pricing

Codestral 2508 costs $0.30 per million input tokens and $0.90 per million output tokens. At 1 million output tokens/month, that is $0.90; at 10 million output tokens, $9.00. It undercuts GPT-4o ($10/MTok output), GPT-4.1 ($8/MTok output), and even Mistral Large 3 2512 ($1.50/MTok output) while outscoring all three on faithfulness and tool calling in our tests. For code-focused applications — where you run many short completions with tool calls — the economics are favorable. The 256,000-token context window accommodates large codebases without tiered pricing penalties.

mistral

Codestral 2508

Overall
3.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
5/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.300/MTok

Output

$0.900/MTok

Context Window256K

modelpicker.net

Real-World Costs

iChat response<$0.001
iBlog post$0.0020
iDocument batch$0.051
iPipeline run$0.510

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

This modelOther models

Try It

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY",
)

response = client.chat.completions.create(
    model="mistralai/codestral-2508",
    messages=[
        {"role": "user", "content": "Hello, Codestral 2508!"}
    ],
)

print(response.choices[0].message.content)

Recommendation

Codestral 2508 is the right choice for developers building code-specific pipelines where tool calling accuracy, faithfulness to source, and structured output are the primary requirements. At $0.90/MTok output, it delivers 5/5 on all three of those dimensions and supports fill-in-the-middle use cases. It is not a good fit for general-purpose assistants, strategic analysis tasks, or creative work — scoring 2/5 on both strategic analysis and creative problem solving. Developers who need a single model to cover both code and reasoning should evaluate models with more balanced scores, such as Mistral Medium 3.1 (avg 4.25, $2.00/MTok output).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions