Devstral Small 1.1 vs Grok 3 Mini
Grok 3 Mini is the practical winner for agents, assistants, and long-context workflows — it wins 8 of 12 benchmarks (tool calling, faithfulness, long-context, persona). Devstral Small 1.1 is the cost-conscious choice: it ties on classification and structured output while costing materially less per mTok.
mistral
Devstral Small 1.1
Benchmark Scores
External Benchmarks
Pricing
Input
$0.100/MTok
Output
$0.300/MTok
modelpicker.net
xai
Grok 3 Mini
Benchmark Scores
External Benchmarks
Pricing
Input
$0.300/MTok
Output
$0.500/MTok
modelpicker.net
Benchmark Analysis
Across our 12-test suite Grok 3 Mini wins 8 categories, Devstral Small 1.1 wins none, and four tests tie. Detailed walk-through (score shown as Devstral / Grok):
- Tool calling: 4 vs 5 — Grok wins and is tied for 1st on our tool_calling ranking ("tied for 1st with 16 other models"), which matters for function selection, argument accuracy and sequencing in agent pipelines.
- Faithfulness: 4 vs 5 — Grok wins and ranks tied for 1st on faithfulness; expect fewer source hallucinations for tasks that must stick closely to input text.
- Long context: 4 vs 5 — Grok wins and is tied for 1st on long_context; better retrieval and coherence when working with 30K+ token contexts.
- Persona consistency: 2 vs 5 — Grok wins and is tied for 1st; better at maintaining character and resisting injection attacks for chat agents.
- Agentic planning: 2 vs 3 — Grok wins (rank 42 of 54) which translates to better goal decomposition and failure recovery in planners.
- Strategic analysis: 2 vs 3 — Grok wins; higher scores mean clearer tradeoff reasoning for numeric or multi-step decisions.
- Creative problem solving: 2 vs 3 — Grok wins; stronger at producing specific, feasible ideas.
- Constrained rewriting: 3 vs 4 — Grok wins (rank 6 of 53); better at tight format rewriting and compression.
- Structured output: 4 vs 4 — tie; both handle JSON/schema compliance comparably (Devstral rank 26, Grok rank 26).
- Classification: 4 vs 4 — tie; both are high-performing here (Devstral is tied for 1st with 29 others).
- Safety calibration: 2 vs 2 — tie; similar refusal/permissive behavior in our tests.
- Multilingual: 4 vs 4 — tie; both produce comparable non-English outputs. Practical meaning: Grok is the stronger choice where correctness under tool use, source fidelity, and very long context matter. Devstral matches or ties Grok on classification and structured-output tasks while costing far less, but it lags on persona, long-context, and faithfulness metrics (Devstral ranks lower: e.g., persona_consistency rank 51 of 53).
Pricing Analysis
Per the payload, Devstral Small 1.1 charges $0.10 input / $0.30 output per mTok; Grok 3 Mini charges $0.30 input / $0.50 output per mTok. Assuming a 50/50 split of input vs output tokens, costs for total monthly tokens are: 1M total tokens -> Devstral ≈ $200, Grok ≈ $400; 10M -> Devstral ≈ $2,000, Grok ≈ $4,000; 100M -> Devstral ≈ $20,000, Grok ≈ $40,000. The Grok bill is roughly double Devstral under this usage pattern. Teams with high-volume production workloads, embedded assistants, or tight margins should prefer Devstral for cost savings. Teams that need the wins Grok provides (tool calling, long context, faithfulness, persona) should budget for roughly 2x the token cost.
Real-World Cost Comparison
Bottom Line
Choose Devstral Small 1.1 if: you need a lower-cost model for high-volume classification, schema/JSON outputs, or cost-sensitive production where ties on classification/structured output are sufficient (Devstral: $0.10 input / $0.30 output per mTok). Choose Grok 3 Mini if: you need best-in-suite behavior for tool calling, faithfulness, long-context coherence, persona consistency, or stronger agentic planning — accept roughly 2x token costs (Grok: $0.30 input / $0.50 output per mTok).
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.