Codestral 2508 vs Grok 4.20

Grok 4.20 is the better pick for assistant-style, multilingual and strategic reasoning use cases — it wins 6 of 12 benchmarks in our testing. Codestral 2508 is the budget choice for high-throughput coding workflows: it ties Grok on tooling, structure, faithfulness and long context but costs ~15% as much.

mistral

Codestral 2508

Overall
3.50/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
5/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
2/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
2/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.300/MTok

Output

$0.900/MTok

Context Window256K

modelpicker.net

xai

Grok 4.20

Overall
4.33/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$2.00/MTok

Output

$6.00/MTok

Context Window2000K

modelpicker.net

Benchmark Analysis

Overview: In our 12-test suite Grok 4.20 wins 6 categories, Codestral 2508 wins 0, and 6 are ties. Below we compare each test (score shown as Codestral → Grok), cite ranks, and explain practical impact.

  • strategic_analysis: 2 → 5. Grok scores 5 and is tied for 1st of 54 models (tied with 25 others); Codestral scores 2 (rank 44 of 54). For tasks needing nuanced tradeoffs or numeric decision-making, Grok is substantially stronger in our tests.

  • constrained_rewriting: 3 → 4. Grok wins (rank 6 of 53) vs Codestral (rank 31). Grok is better at tight-length rewrites and hard character-limited transformations.

  • creative_problem_solving: 2 → 4. Grok (rank 9 of 54) produces more feasible, non-obvious ideas in our testing; Codestral lagged on originality and depth.

  • classification: 3 → 4. Grok ties for 1st in classification (tied with 29 others) while Codestral sits mid-pack (rank 31). For routing or tagging pipelines Grok is more reliable in our tests.

  • persona_consistency: 3 → 5. Grok is tied for 1st (with 36 others) meaning it better maintains character and resists injection in our prompts; Codestral’s 3 places it much lower.

  • multilingual: 4 → 5. Grok ties for 1st (with 34 others); Codestral scores 4. For non-English quality and parity, Grok has the advantage.

Ties (both models score the same):

  • structured_output: 5 → 5 (both tied for 1st). Both models reliably follow JSON/schema constraints in our testing.
  • tool_calling: 5 → 5 (both tied for 1st). Both select functions and sequence tool args accurately on our tests.
  • faithfulness: 5 → 5 (tied for 1st). Both stick to source material and avoid hallucination in our testing.
  • long_context: 5 → 5 (both tied for 1st). Both handle 30K+ token retrieval scenarios equally well in our tests.
  • safety_calibration: 1 → 1 (both rank 32 of 55). Both models are conservative in our safety calibration test results.
  • agentic_planning: 4 → 4 (both rank 16 of 54). Both are comparable at goal decomposition and recovery.

Practical interpretation: Codestral’s strengths (ties at tool_calling, structured_output, faithfulness, long_context) align with real coding tasks: schema outputs, FIM and code-correction workflows will be reliable and low-latency. Grok’s clear wins in strategic_analysis, persona_consistency, creative_problem_solving and multilingual make it better for complex reasoning assistants, multi-language products, and applications needing consistent personas or creative responses. All benchmark claims above are from our internal 12-test suite and the rankings shown in the payload.

BenchmarkCodestral 2508Grok 4.20
Faithfulness5/55/5
Long Context5/55/5
Multilingual4/55/5
Tool Calling5/55/5
Classification3/54/5
Agentic Planning4/54/5
Structured Output5/55/5
Safety Calibration1/51/5
Strategic Analysis2/55/5
Persona Consistency3/55/5
Constrained Rewriting3/54/5
Creative Problem Solving2/54/5
Summary0 wins6 wins

Pricing Analysis

Per the payload prices (assume a 50/50 split of input/output tokens for cost estimates): Codestral 2508 costs $0.60 per 1M total tokens (input $0.30/mTok, output $0.90/mTok → $0.15+$0.45), while Grok 4.20 costs $4.00 per 1M total tokens (input $2.00/mTok, output $6.00/mTok → $1.00+$3.00). At 10M tokens/month that’s $6.00 (Codestral) vs $40.00 (Grok); at 100M it’s $60.00 vs $400.00. The priceRatio in the payload is 0.15 — Codestral is ~15% of Grok’s cost. Teams with large-volume, latency-sensitive coding workloads or tight budgets should care most about the gap; teams needing the best strategic reasoning, multilingual support, or persona consistency may accept Grok’s higher price.

Real-World Cost Comparison

TaskCodestral 2508Grok 4.20
iChat response<$0.001$0.0034
iBlog post$0.0020$0.013
iDocument batch$0.051$0.340
iPipeline run$0.510$3.40

Bottom Line

Choose Codestral 2508 if you need a cost-efficient, high-throughput coding model that ties Grok on tool calling, structured outputs, faithfulness and long-context handling — ideal for FIM, code correction, test generation, and high-volume deployments where cost matters. Choose Grok 4.20 if your priority is strong strategic reasoning, multilingual parity, persona consistency, or constrained rewriting — Grok wins 6 of 12 benchmarks in our testing and ranks at or near the top for those tasks, but expect roughly 6.7x higher token costs (based on the example 50/50 input-output pricing).

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions