Mistral

Mistral offers a 12-model lineup focused on developer and enterprise use: coding assistants, agentic systems, and multimodal apps. In our testing the provider averages 3.56/5 across a 12-test suite. Mistral is positioned between premium research-scale vendors and low-cost open models: it emphasizes long-context, structured output, and specialized code/agent models rather than chasing top avg scores from Anthropic or OpenAI.

Models

12

Cheapest Output

N/A

Avg Score

3.56/5

Price Range

N/A

Model Lineup

Mistral’s models fall into three practical tiers in our data: flagship, mid/high-performance, and budget/specialist. Flagship: Mistral Large 3 2512 — described by the provider as “most capable”; context window 262,144 tokens; input $0.50/mtok, output $1.50/mtok. Mid / high-performance: Mistral Medium 3.1 (best average score in our tests, avg 4.25) — multimodal, 131,072 context, input $0.40/mtok, output $2.00/mtok; Devstral 2 2512 — large coding/agent model, 262,144 context, input $0.40/mtok, output $2.00/mtok; Codestral 2508 — coding-specialist, low-latency, context 256,000, input $0.30/mtok, output $0.90/mtok. Budget / specialist: Mistral Small 4 (input $0.15/mtok, output $0.60/mtok) and the Ministral family (Ministral 3 14B: $0.20/$0.20; Ministral 3 8B: $0.15/$0.15; Ministral 3 3B: $0.10/$0.10) for cost-sensitive, vision-capable or lightweight deployments. Devstral Medium and Devstral Small target code/agent workflows at intermediate prices (Devstral Medium: $0.40/$2.00; Devstral Small 1.1: $0.10/$0.30). Choose Mistral Large 3 2512 as the flagship when provider capability/architecture is the priority; pick Mistral Medium 3.1 in our testing for the best average benchmark score; use Codestral/Devstral models for production coding assistants; use Ministral or Mistral Small variants when cost or small footprint is the priority.

Strengths and Weaknesses

Strengths (our testing): Mistral models score highly on long-context (shared p50 = 5), structured output (many models p50≈4), and multilingual ability (shared p50 = 5). Mistral’s code and agent-focused models (Codestral, Devstral family) show strong tool-calling, structured-output, and faithfulness in our task suite. Weaknesses (our testing): safety calibration is a recurring low point across the lineup (many models score 1–2 on safety calibration), and creative problem solving is uneven (several models at 2–3). Across the provider, the average score is 3.56/5 in our 12-test suite — below Anthropic and OpenAI (4.67) and below Google/DeepSeek (4.5), but slightly above Meta’s listed average (3.5). External test context: across our dataset the median on SWE-bench Verified is 70.8% (SWE-bench Verified, Epoch AI) and median math benchmarks (MATH Level 5 p50 = 94.15%; AIME 2025 p50 = 83.9%) — useful context when considering Mistral’s code/math strengths, but those external numbers are dataset-level and should be used as supplementary reference (Epoch AI).

Pricing

Mistral’s observed output pricing runs $0.10–$2.00 per M-token (provider range). That sits well below premium providers in our competitor set (Anthropic/Claude Sonnet 4.6: $15.00/mtok output; OpenAI/GPT-5.2: $14.00/mtok) and is competitive with mid-tier vendors (Google Gemini 3 Flash Preview: $3.00; DeepSeek R1 0528: $2.15). Mistral is therefore mid-range: considerably cheaper than top-tier research models but generally more expensive than the lowest-cost research releases (Meta/Llama 3.3 70B Instruct at $0.32 and Meta-Llama variants at $0.30). Use Mistral when you need long-context or structured-output capabilities without paying top-tier premium rates.

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

Mistral modelsOther models

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions