Devstral Small 1.1 vs Grok 3 Mini

Grok 3 Mini is the practical winner for agents, assistants, and long-context workflows — it wins 8 of 12 benchmarks (tool calling, faithfulness, long-context, persona). Devstral Small 1.1 is the cost-conscious choice: it ties on classification and structured output while costing materially less per mTok.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

xai

Grok 3 Mini

Overall
3.92/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
3/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.300/MTok

Output

$0.500/MTok

Context Window131K

modelpicker.net

Benchmark Analysis

Across our 12-test suite Grok 3 Mini wins 8 categories, Devstral Small 1.1 wins none, and four tests tie. Detailed walk-through (score shown as Devstral / Grok):

  • Tool calling: 4 vs 5 — Grok wins and is tied for 1st on our tool_calling ranking ("tied for 1st with 16 other models"), which matters for function selection, argument accuracy and sequencing in agent pipelines.
  • Faithfulness: 4 vs 5 — Grok wins and ranks tied for 1st on faithfulness; expect fewer source hallucinations for tasks that must stick closely to input text.
  • Long context: 4 vs 5 — Grok wins and is tied for 1st on long_context; better retrieval and coherence when working with 30K+ token contexts.
  • Persona consistency: 2 vs 5 — Grok wins and is tied for 1st; better at maintaining character and resisting injection attacks for chat agents.
  • Agentic planning: 2 vs 3 — Grok wins (rank 42 of 54) which translates to better goal decomposition and failure recovery in planners.
  • Strategic analysis: 2 vs 3 — Grok wins; higher scores mean clearer tradeoff reasoning for numeric or multi-step decisions.
  • Creative problem solving: 2 vs 3 — Grok wins; stronger at producing specific, feasible ideas.
  • Constrained rewriting: 3 vs 4 — Grok wins (rank 6 of 53); better at tight format rewriting and compression.
  • Structured output: 4 vs 4 — tie; both handle JSON/schema compliance comparably (Devstral rank 26, Grok rank 26).
  • Classification: 4 vs 4 — tie; both are high-performing here (Devstral is tied for 1st with 29 others).
  • Safety calibration: 2 vs 2 — tie; similar refusal/permissive behavior in our tests.
  • Multilingual: 4 vs 4 — tie; both produce comparable non-English outputs. Practical meaning: Grok is the stronger choice where correctness under tool use, source fidelity, and very long context matter. Devstral matches or ties Grok on classification and structured-output tasks while costing far less, but it lags on persona, long-context, and faithfulness metrics (Devstral ranks lower: e.g., persona_consistency rank 51 of 53).
BenchmarkDevstral Small 1.1Grok 3 Mini
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/54/5
Tool Calling4/55/5
Classification4/54/5
Agentic Planning2/53/5
Structured Output4/54/5
Safety Calibration2/52/5
Strategic Analysis2/53/5
Persona Consistency2/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/53/5
Summary0 wins8 wins

Pricing Analysis

Per the payload, Devstral Small 1.1 charges $0.10 input / $0.30 output per mTok; Grok 3 Mini charges $0.30 input / $0.50 output per mTok. Assuming a 50/50 split of input vs output tokens, costs for total monthly tokens are: 1M total tokens -> Devstral ≈ $200, Grok ≈ $400; 10M -> Devstral ≈ $2,000, Grok ≈ $4,000; 100M -> Devstral ≈ $20,000, Grok ≈ $40,000. The Grok bill is roughly double Devstral under this usage pattern. Teams with high-volume production workloads, embedded assistants, or tight margins should prefer Devstral for cost savings. Teams that need the wins Grok provides (tool calling, long context, faithfulness, persona) should budget for roughly 2x the token cost.

Real-World Cost Comparison

TaskDevstral Small 1.1Grok 3 Mini
iChat response<$0.001<$0.001
iBlog post<$0.001$0.0011
iDocument batch$0.017$0.031
iPipeline run$0.170$0.310

Bottom Line

Choose Devstral Small 1.1 if: you need a lower-cost model for high-volume classification, schema/JSON outputs, or cost-sensitive production where ties on classification/structured output are sufficient (Devstral: $0.10 input / $0.30 output per mTok). Choose Grok 3 Mini if: you need best-in-suite behavior for tool calling, faithfulness, long-context coherence, persona consistency, or stronger agentic planning — accept roughly 2x token costs (Grok: $0.30 input / $0.50 output per mTok).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions