Question 1

Is Devstral Medium better than Grok 4?

Accepted Answer

On most benchmarks, no. In our testing, Grok 4 wins 9 of 12 tests outright and ties 2 more. Devstral Medium's only head-to-head win is agentic planning, where it scores 4 vs Grok 4's 3 and ranks 16th of 54 models. If agentic planning is your primary need, Devstral Medium is the better pick — and at $2/M output tokens vs $15/M for Grok 4, it's significantly cheaper. For most other tasks, Grok 4 outperforms it.

Question 2

Which model is cheaper, Devstral Medium or Grok 4?

Accepted Answer

Devstral Medium is substantially cheaper. It costs $0.40/M input tokens and $2.00/M output tokens. Grok 4 costs $3.00/M input and $15.00/M output — 7.5x more expensive on output. At 10M output tokens/month, that's $20 vs $150. Also note that Grok 4 uses reasoning tokens, which can push actual costs higher than the base rate suggests.

Question 3

Which is better for coding and agentic workflows?

Accepted Answer

Devstral Medium edges out Grok 4 on agentic planning in our testing — scoring 4 vs 3 and ranking 16th of 54 models, while Grok 4 ranks 42nd. Devstral Medium is described as a code generation and agentic reasoning model developed jointly by Mistral AI and All Hands AI. However, Grok 4 scores higher on tool calling (4 vs 3), which matters for function-calling pipelines and agent orchestration that rely on accurate API interactions.

Question 4

Does Grok 4 support image inputs?

Accepted Answer

Yes. Per the payload, Grok 4 supports text, image, and file inputs (modality: text+image+file->text). Devstral Medium is text-only (modality: text->text). If your workflow involves processing images or file attachments, Grok 4 is the only option of the two.

Question 5

Which model handles long documents better?

Accepted Answer

Grok 4 on both measures. It scores 5/5 on our long-context benchmark (retrieval accuracy at 30K+ tokens), tying for 1st among 55 models tested, while Devstral Medium scores 4/5 and ranks 38th. Grok 4 also has a 256K context window versus Devstral Medium's 131K — double the capacity for very long document workloads.

Question 6

Which model is better for multilingual tasks?

Accepted Answer

Grok 4 scores 5/5 on multilingual output in our testing and ties for 1st among 55 models. Devstral Medium scores 4/5 and ranks 36th. For applications requiring equivalent quality in non-English languages, Grok 4 has a clear edge — though both score above the median.

Devstral Medium vs Grok 4

Devstral Medium

Grok 4

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions