Question 1

Is Devstral Medium better than Mistral Small 3.2 24B?

Accepted Answer

Not generally. In our 12-test suite Devstral Medium wins 1 test (classification) while Mistral Small 3.2 24B wins 2 tests (tool_calling and constrained_rewriting); the other 9 tests are ties. Pick Devstral only if that classification edge matters.

Question 2

Which model is cheaper to run?

Accepted Answer

Mistral Small 3.2 24B is substantially cheaper. Pricing in the payload: Devstral Medium charges $0.40 input / $2.00 output per 1k tokens; Mistral Small 3.2 24B charges $0.075 input / $0.20 output per 1k — roughly a 10× difference in combined I/O cost.

Question 3

Which is better for coding and function calling?

Accepted Answer

Mistral Small 3.2 24B wins tool_calling in our testing (score B=4 vs A=3) and ranks 18 of 54 on tool_calling versus Devstral at rank 47 of 54. That makes Mistral Small 3.2 24B the better choice for coding tasks that rely on function selection and argument accuracy.

Question 4

How do costs scale at production volumes?

Accepted Answer

With a 50/50 input-output token split, costs are: at 1M tokens/mo Devstral ≈ $1,200 vs Mistral ≈ $137.50; at 10M: Devstral ≈ $12,000 vs Mistral ≈ $1,375; at 100M: Devstral ≈ $120,000 vs Mistral ≈ $13,750. High-volume users should prioritize the cheaper model unless Devstral’s classification score justifies the spend.

Question 5

Do they handle long context differently?

Accepted Answer

No meaningful difference in our tests: both score 4 on long_context and share similar ranks (Devstral rank 38 of 55, Mistral rank 38 of 55), so neither showed a clear long-context advantage on our benchmarks.

Question 6

Which model accepts images as input?

Accepted Answer

Mistral Small 3.2 24B supports text+image→text per the payload; Devstral Medium is text→text only. Choose Mistral Small 3.2 24B if image-to-text workflows matter.

Devstral Medium vs Mistral Small 3.2 24B

Devstral Medium

Mistral Small 3.2 24B

Benchmark Analysis

Pricing Analysis

Real-World Cost Comparison

Bottom Line

How We Test

Frequently Asked Questions