Devstral Small 1.1 vs Mistral Large 3 2512
Mistral Large 3 2512 is the stronger general-purpose model, winning 7 of 12 benchmarks in our testing — including structured output (5 vs 4), strategic analysis (4 vs 2), faithfulness (5 vs 4), and agentic planning (4 vs 2). Devstral Small 1.1 wins only on classification (4 vs 3) and safety calibration (2 vs 1), while the two tie on tool calling, long context, and constrained rewriting. At $0.30/M output tokens versus $1.50/M, Devstral Small 1.1 costs 80% less — a real factor for high-volume pipelines where you're willing to accept lower scores on reasoning and planning tasks.
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
mistral
Mistral Large 3 2512
Benchmark Scores
External Benchmarks
Pricing
Input
$0.500/MTok
Output
$1.50/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test internal benchmark suite (scored 1–5), Mistral Large 3 2512 wins 7 tests, Devstral Small 1.1 wins 2, and they tie on 3.
Where Mistral Large 3 2512 wins:
- Structured output: 5 vs 4. Mistral Large 3 2512 ties for 1st among 54 models on JSON schema compliance; Devstral Small 1.1 ranks 26th of 54. For applications dependent on reliable schema adherence, this gap matters.
- Strategic analysis: 4 vs 2. Mistral Large 3 2512 ranks 27th of 54; Devstral Small 1.1 ranks 44th of 54. A two-point gap on nuanced tradeoff reasoning is significant for analytical applications.
- Creative problem solving: 3 vs 2. Mistral Large 3 2512 ranks 30th of 54; Devstral Small 1.1 ranks 47th of 54 — near the bottom of the field.
- Faithfulness: 5 vs 4. Mistral Large 3 2512 ties for 1st among 55 models; Devstral Small 1.1 ranks 34th. If your application requires staying close to source material without hallucinating, this is a substantial difference.
- Persona consistency: 3 vs 2. Both score poorly — Mistral Large 3 2512 ranks 45th of 53, Devstral Small 1.1 ranks 51st of 53. Neither model is recommended for character-maintenance tasks.
- Agentic planning: 4 vs 2. Mistral Large 3 2512 ranks 16th of 54; Devstral Small 1.1 ranks 53rd of 54 — last place (tied with one other model). This is the starkest gap: goal decomposition and failure recovery is a core weakness of Devstral Small 1.1.
- Multilingual: 5 vs 4. Mistral Large 3 2512 ties for 1st among 55 models; Devstral Small 1.1 ranks 36th. Non-English deployments should default to Mistral Large 3 2512.
Where Devstral Small 1.1 wins:
- Classification: 4 vs 3. Devstral Small 1.1 ties for 1st among 53 models (30 models share this score); Mistral Large 3 2512 ranks 31st of 53. For routing and categorization tasks, Devstral Small 1.1 actually outperforms at one-fifth the cost.
- Safety calibration: 2 vs 1. Both models score poorly here — Devstral Small 1.1 ranks 12th of 55 (tied with 19 others); Mistral Large 3 2512 ranks 32nd of 55. Neither handles the refuse/permit balance well, but Devstral Small 1.1 is less miscalibrated.
Where they tie:
- Tool calling: both score 4/5, both rank 18th of 54. Function selection and argument accuracy is equally strong.
- Long context: both score 4/5, both rank 38th of 55. Retrieval at 30K+ tokens is equivalent, though Mistral Large 3 2512's 262K context window is double Devstral Small 1.1's 131K.
- Constrained rewriting: both score 3/5, both rank 31st of 53. Compression under hard limits is a shared weakness.
One structural note: Mistral Large 3 2512 accepts image input in addition to text; Devstral Small 1.1 is text-only. This is not a benchmark score — it's a capability gate that makes the models non-equivalent for multimodal tasks regardless of scores.
Pricing Analysis
Devstral Small 1.1 costs $0.10/M input and $0.30/M output. Mistral Large 3 2512 costs $0.50/M input and $1.50/M output — exactly 5x more on both dimensions. At 1M output tokens/month, that's $300 vs $1,500: a $1,200 gap. At 10M tokens/month, the difference grows to $12,000. At 100M tokens/month, you're looking at $120,000 more per month for Mistral Large 3 2512. For developers running classification pipelines, structured extraction jobs, or tool-calling workflows where the two models tie, Devstral Small 1.1 is the obvious cost choice. But for agentic applications, strategic analysis tasks, or multilingual deployments where Mistral Large 3 2512 scores meaningfully higher, the premium buys real capability. The 5x cost ratio also matters modally: Mistral Large 3 2512 accepts image input alongside text, while Devstral Small 1.1 is text-only — a binary difference that may decide the question before pricing enters the picture.
Real-World Cost Comparison
Bottom Line
Choose Devstral Small 1.1 if: you are building a high-volume classification or routing pipeline (it ties for 1st on classification in our tests while costing 80% less), you need reliable tool calling or long-context retrieval at lower cost (tied scores, 5x cheaper), your workload is text-only and English-primary, or cost at scale is the binding constraint (saving $120,000+/month at 100M output tokens is real).
Choose Mistral Large 3 2512 if: you are building agentic systems (it scores 4 vs 2 on agentic planning, ranking 16th vs 53rd of 54 models — Devstral Small 1.1 is near last place here), you need reliable structured output for complex schema compliance (5 vs 4, tied for 1st), your application handles multiple languages (5 vs 4, tied for 1st multilingual), you need faithfulness to source material (5 vs 4, tied for 1st), you need image input alongside text, or your context requirements exceed 131K tokens (Mistral Large 3 2512 offers 262K). The 5x cost premium is justified for any of these use cases — but not for classification or pure tool-calling pipelines where Devstral Small 1.1 matches or beats it.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.