Devstral Small 1.1 vs Grok 4

Grok 4 is the better pick for demanding, long‑context, or multimodal workflows—it wins 8 of 12 benchmarks in our tests (long_context, faithfulness, strategic_analysis, etc.). Devstral Small 1.1 loses on most quality metrics but is orders of magnitude cheaper, making it a strong choice for cost-sensitive, text-only use cases.

mistral

Devstral Small 1.1

Overall
3.08/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
2/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
2/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.100/MTok

Output

$0.300/MTok

Context Window131K

modelpicker.net

xai

Grok 4

Overall
4.08/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window256K

modelpicker.net

Benchmark Analysis

Summary of our 12-test suite (scores shown as Devstral Small 1.1 vs Grok 4):

  • strategic_analysis: 2 vs 5 — Grok wins and is tied for 1st on the ranking (tied with 25 others). This matters for tasks that require nuanced tradeoff reasoning with real numbers. Devstral's 2 (rank 44/54) signals weaker performance on those analyses.
  • constrained_rewriting: 3 vs 4 — Grok wins (rank 6/53). Use Grok for tight character-limit compression; Devstral performs worse (rank 31).
  • creative_problem_solving: 2 vs 3 — Grok wins (rank 30/54). Grok generates more non‑obvious, feasible ideas in our tests.
  • faithfulness: 4 vs 5 — Grok wins and is tied for 1st (rank 1/55). For tasks that must stick to sources and avoid hallucination, Grok is measurably better.
  • long_context: 4 vs 5 — Grok wins and is tied for 1st on long-context (rank 1/55). This aligns with its 256k context window versus Devstral's 131k; Grok is stronger on retrieval at 30K+ tokens.
  • persona_consistency: 2 vs 5 — Grok wins (tied for 1st). In our tests Grok better resists persona injection and maintains character.
  • agentic_planning: 2 vs 3 — Grok wins (rank 42/54). Grok shows stronger decomposition and failure-recovery ability in our suite.
  • multilingual: 4 vs 5 — Grok wins and ties for 1st (rank 1/55). Grok is best for equivalent-quality non‑English output in our tests. Ties (no clear winner in our tests): structured_output 4/4 (both rank 26), tool_calling 4/4 (both rank 18), classification 4/4 (both tie for 1st), safety_calibration 2/2 (both rank 12). Tool calling parity means both models select functions and arguments similarly in our tests; classification parity means both reach top scores on routing/categorization. Safety calibration is low for both (score 2), so neither is a standout for safety-sensitive refusal/permissiveness. What these scores mean for real tasks: Grok 4 is objectively stronger on high-demand tasks — strategy, faithfulness, long-context retrieval, multilingual output and persona stability — and also offers multimodal inputs per the payload. Devstral Small 1.1 is competent on classification, structured output, and standard tool calling but trails on strategy/faithfulness and persona. The rank positions show Grok often sits among the top performers (multiple rank 1 ties), while Devstral ranks in the lower half for several high-value dimensions (agentic_planning rank 53, persona_consistency rank 51).
BenchmarkDevstral Small 1.1Grok 4
Faithfulness4/55/5
Long Context4/55/5
Multilingual4/55/5
Tool Calling4/54/5
Classification4/54/5
Agentic Planning2/53/5
Structured Output4/54/5
Safety Calibration2/52/5
Strategic Analysis2/55/5
Persona Consistency2/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/53/5
Summary0 wins8 wins

Pricing Analysis

Costs are listed per mTok in the payload (per 1,000 tokens). We assume a 50/50 split of input/output tokens to illustrate real usage. Devstral Small 1.1: input $0.1 + output $0.3 per mTok. At 1M tokens (1,000 mToks) that equals $200/month (500 mToks input = $50; 500 mToks output = $150). At 10M tokens it's $2,000/month; at 100M tokens $20,000/month. Grok 4: input $3 + output $15 per mTok. With the same 50/50 split, 1M tokens = $9,000/month (500 mToks input = $1,500; 500 mToks output = $7,500); 10M = $90,000; 100M = $900,000. The payload's priceRatio = 0.02 reflects this gap: Devstral runs at ~2% of Grok's sticker cost in this comparison. High-volume consumers (teams sending millions of tokens/month) and cost-sensitive startups should care most; only teams that need Grok 4's higher scores and multimodal/256k context features should justify the much higher spend.

Real-World Cost Comparison

TaskDevstral Small 1.1Grok 4
iChat response<$0.001$0.0081
iBlog post<$0.001$0.032
iDocument batch$0.017$0.810
iPipeline run$0.170$8.10

Bottom Line

Choose Devstral Small 1.1 if: you must minimize inference cost (runs ≈ $200/mo at 1M tokens with a 50/50 I/O split), your workloads are text-only, and you need solid classification, structured output, or standard tool calling at low price. Choose Grok 4 if: you need top-tier long-context retrieval, strict faithfulness, stronger strategic reasoning, multilingual parity, or multimodal inputs (image/file support) and can accept the much higher cost (≈ $9,000/mo at 1M tokens with the same I/O mix).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions