Devstral Small 1.1 vs Mistral Large 3 2512

Mistral Large 3 2512 is the stronger general-purpose model, winning 7 of 12 benchmarks in our testing — including structured output (5 vs 4), strategic analysis (4 vs 2), faithfulness (5 vs 4), and agentic planning (4 vs 2). Devstral Small 1.1 wins only on classification (4 vs 3) and safety calibration (2 vs 1), while the two tie on tool calling, long context, and constrained rewriting. At $0.30/M output tokens versus $1.50/M, Devstral Small 1.1 costs 80% less — a real factor for high-volume pipelines where you're willing to accept lower scores on reasoning and planning tasks.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

mistral

Mistral Large 3 2512

Overall
3.67/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
4/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
4/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.500/MTok

Output

$1.50/MTok

Context Window262K

modelpicker.net

Benchmark Analysis

Across our 12-test internal benchmark suite (scored 1–5), Mistral Large 3 2512 wins 7 tests, Devstral Small 1.1 wins 2, and they tie on 3.

Where Mistral Large 3 2512 wins:

  • Structured output: 5 vs 4. Mistral Large 3 2512 ties for 1st among 54 models on JSON schema compliance; Devstral Small 1.1 ranks 26th of 54. For applications dependent on reliable schema adherence, this gap matters.
  • Strategic analysis: 4 vs 2. Mistral Large 3 2512 ranks 27th of 54; Devstral Small 1.1 ranks 44th of 54. A two-point gap on nuanced tradeoff reasoning is significant for analytical applications.
  • Creative problem solving: 3 vs 2. Mistral Large 3 2512 ranks 30th of 54; Devstral Small 1.1 ranks 47th of 54 — near the bottom of the field.
  • Faithfulness: 5 vs 4. Mistral Large 3 2512 ties for 1st among 55 models; Devstral Small 1.1 ranks 34th. If your application requires staying close to source material without hallucinating, this is a substantial difference.
  • Persona consistency: 3 vs 2. Both score poorly — Mistral Large 3 2512 ranks 45th of 53, Devstral Small 1.1 ranks 51st of 53. Neither model is recommended for character-maintenance tasks.
  • Agentic planning: 4 vs 2. Mistral Large 3 2512 ranks 16th of 54; Devstral Small 1.1 ranks 53rd of 54 — last place (tied with one other model). This is the starkest gap: goal decomposition and failure recovery is a core weakness of Devstral Small 1.1.
  • Multilingual: 5 vs 4. Mistral Large 3 2512 ties for 1st among 55 models; Devstral Small 1.1 ranks 36th. Non-English deployments should default to Mistral Large 3 2512.

Where Devstral Small 1.1 wins:

  • Classification: 4 vs 3. Devstral Small 1.1 ties for 1st among 53 models (30 models share this score); Mistral Large 3 2512 ranks 31st of 53. For routing and categorization tasks, Devstral Small 1.1 actually outperforms at one-fifth the cost.
  • Safety calibration: 2 vs 1. Both models score poorly here — Devstral Small 1.1 ranks 12th of 55 (tied with 19 others); Mistral Large 3 2512 ranks 32nd of 55. Neither handles the refuse/permit balance well, but Devstral Small 1.1 is less miscalibrated.

Where they tie:

  • Tool calling: both score 4/5, both rank 18th of 54. Function selection and argument accuracy is equally strong.
  • Long context: both score 4/5, both rank 38th of 55. Retrieval at 30K+ tokens is equivalent, though Mistral Large 3 2512's 262K context window is double Devstral Small 1.1's 131K.
  • Constrained rewriting: both score 3/5, both rank 31st of 53. Compression under hard limits is a shared weakness.

One structural note: Mistral Large 3 2512 accepts image input in addition to text; Devstral Small 1.1 is text-only. This is not a benchmark score — it's a capability gate that makes the models non-equivalent for multimodal tasks regardless of scores.

BenchmarkDevstral Small 1.1Mistral Large 3 2512
Faithfulness4/55/5
Long Context4/54/5
Multilingual4/55/5
Tool Calling4/54/5
Classification4/53/5
Agentic Planning2/54/5
Structured Output4/55/5
Safety Calibration2/51/5
Strategic Analysis2/54/5
Persona Consistency2/53/5
Constrained Rewriting3/53/5
Creative Problem Solving2/53/5
Summary2 wins7 wins

Pricing Analysis

Devstral Small 1.1 costs $0.10/M input and $0.30/M output. Mistral Large 3 2512 costs $0.50/M input and $1.50/M output — exactly 5x more on both dimensions. At 1M output tokens/month, that's $300 vs $1,500: a $1,200 gap. At 10M tokens/month, the difference grows to $12,000. At 100M tokens/month, you're looking at $120,000 more per month for Mistral Large 3 2512. For developers running classification pipelines, structured extraction jobs, or tool-calling workflows where the two models tie, Devstral Small 1.1 is the obvious cost choice. But for agentic applications, strategic analysis tasks, or multilingual deployments where Mistral Large 3 2512 scores meaningfully higher, the premium buys real capability. The 5x cost ratio also matters modally: Mistral Large 3 2512 accepts image input alongside text, while Devstral Small 1.1 is text-only — a binary difference that may decide the question before pricing enters the picture.

Real-World Cost Comparison

TaskDevstral Small 1.1Mistral Large 3 2512
iChat response<$0.001<$0.001
iBlog post<$0.001$0.0033
iDocument batch$0.017$0.085
iPipeline run$0.170$0.850

Bottom Line

Choose Devstral Small 1.1 if: you are building a high-volume classification or routing pipeline (it ties for 1st on classification in our tests while costing 80% less), you need reliable tool calling or long-context retrieval at lower cost (tied scores, 5x cheaper), your workload is text-only and English-primary, or cost at scale is the binding constraint (saving $120,000+/month at 100M output tokens is real).

Choose Mistral Large 3 2512 if: you are building agentic systems (it scores 4 vs 2 on agentic planning, ranking 16th vs 53rd of 54 models — Devstral Small 1.1 is near last place here), you need reliable structured output for complex schema compliance (5 vs 4, tied for 1st), your application handles multiple languages (5 vs 4, tied for 1st multilingual), you need faithfulness to source material (5 vs 4, tied for 1st), you need image input alongside text, or your context requirements exceed 131K tokens (Mistral Large 3 2512 offers 262K). The 5x cost premium is justified for any of these use cases — but not for classification or pure tool-calling pipelines where Devstral Small 1.1 matches or beats it.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions