Question 1

Is the verdict based on an external coding benchmark?

Accepted Answer

No. The payload includes an external SWE-bench Verified field, but both Claude Haiku 4.5 and Devstral Medium have null external scores. Our winner call is therefore based on our internal 1–5 test proxies (tool_calling, structured_output, faithfulness, long_context, agentic_planning).

Question 2

How big is the cost difference between the models?

Accepted Answer

In the payload Haiku’s input/output costs are 1 / 5 per mTok and Devstral’s are 0.4 / 2 per mTok. That produces a price ratio of about 2.5x in the dataset—Devstral is materially cheaper for generation output.

Question 3

Do both models enforce structured output for patches or JSON?

Accepted Answer

Yes. In our testing both models score 4 on structured_output, so they offer comparable reliability for JSON schema compliance and format adherence.

Question 4

Which model is better at calling tools (e.g., code runners, linters, or custom APIs)?

Accepted Answer

Claude Haiku 4.5: tool_calling=5 vs Devstral Medium=3 in our tests, indicating Haiku is substantially better at selecting functions, sequencing calls, and producing accurate arguments.

Question 5

Does context window size matter here?

Accepted Answer

Yes. Haiku’s context window is 200,000 tokens and scores long_context=5 vs Devstral’s 131,072 tokens and long_context=4. That advantage matters for multi-file reviews, long diffs, and keeping more history in a single session.

Claude Haiku 4.5 vs Devstral Medium for Coding

Claude Haiku 4.5

Devstral Medium

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions