mistral

Mistral Large 3 2512

Mistral Large 3 2512 is a large-scale multimodal model accepting text and image inputs. At $0.50/M input and $1.50/M output with a 262,144 token context window, it offers a relatively low output cost for a model described as the most capable in its provider's lineup. In our 12-test benchmark suite, it ranks 38th out of 52 active models — below the field median. Its strongest performances are in structured output (5/5, tied for 1st with 24 others), faithfulness (5/5, tied for 1st with 32 others), and multilingual (5/5). The overall rank is dragged down by weak scores across several dimensions, including safety calibration (1/5) and persona consistency (3/5, rank 45 of 53). At the same $1.50/M output price, Gemini 3.1 Flash Lite Preview ranks 8th overall — a significant performance gap at the same output cost.

Performance

Mistral Large 3 2512 ranks 38th out of 52 active models in our overall benchmark average. Its top three strengths: structured output (5/5, tied for 1st with 24 other models out of 54), faithfulness (5/5, tied for 1st with 32 others out of 55), and multilingual (5/5, tied for 1st with 34 others out of 55). Agentic planning and tool calling both scored 4/5. Significant weaknesses: safety calibration at 1/5 (rank 32 of 55, well below the field median of 2/5), persona consistency at 3/5 (rank 45 of 53 — near the bottom), constrained rewriting at 3/5 (rank 31 of 53), and classification at 3/5 (rank 31 of 53). Creative problem solving also scored 3/5 (rank 30 of 54). Long context scored 4/5 but ranked only 38 of 55.

Pricing

Mistral Large 3 2512 costs $0.50 per million input tokens and $1.50 per million output tokens. At 10 million output tokens per month, that is $15. At 100 million tokens, $150. The $1.50/M output price is low relative to a flagship-tier model, but two similarly priced models rank significantly higher on our benchmarks: Gemini 3.1 Flash Lite Preview at $1.50/M output ranks 8th of 52, and Grok Code Fast 1 at $1.50/M output ranks 38th (tied with Mistral Large 3 2512). For teams specifically needing image input support alongside structured output and faithfulness strengths, Mistral Large 3 2512 may justify its place. Otherwise, the $1.50/M output price point offers better overall performance elsewhere in our dataset.

mistral

Mistral Large 3 2512

Overall
3.67/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
4/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
4/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.500/MTok

Output

$1.50/MTok

Context Window262K

modelpicker.net

Real-World Costs

iChat response<$0.001
iBlog post$0.0033
iDocument batch$0.085
iPipeline run$0.850

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

This modelOther models

Try It

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY",
)

response = client.chat.completions.create(
    model="mistralai/mistral-large-2512",
    messages=[
        {"role": "user", "content": "Hello, Mistral Large 3 2512!"}
    ],
)

print(response.choices[0].message.content)

Recommendation

Mistral Large 3 2512 is most suitable for applications that specifically need image input support plus strong faithfulness (5/5) and structured output (5/5) — for example, document extraction from images with structured JSON output. Its 262K context window is substantial. However, for general-purpose use at $1.50/M output, Gemini 3.1 Flash Lite Preview ranks dramatically higher (8th vs 38th) at the same output price with broader multimodal support. Avoid Mistral Large 3 2512 for safety-critical applications (1/5 safety calibration), persona-dependent products (3/5, rank 45 of 53), or tasks requiring classification accuracy (3/5, rank 31 of 53).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions