Question 1

How big is the creative gap between these models in our tests?

Accepted Answer

Claude Haiku 4.5 scores 4/5 on Creative Problem Solving in our testing while Devstral Medium scores 2/5 — a 2‑point gap reflected in task ranks (Haiku rank 9/52; Devstral rank 46/52).

Question 2

Will Devstral’s lower cost make it the better choice for most teams?

Accepted Answer

Not for tasks where ideation quality matters. Devstral Medium is cheaper ($0.4 input / $2 output per mTok) and suitable for budgeted prototyping or code-linked workflows, but Haiku’s higher creative and strategic scores justify its higher cost when you need non‑obvious, feasible ideas and reliable multi-step execution.

Question 3

How do tool-calling and strategic analysis affect creative outcomes?

Accepted Answer

In our testing, tool_calling and strategic_analysis correlate with idea feasibility and execution. Haiku’s 5/5 tool_calling and 5/5 strategic_analysis help produce ideas that are both novel and implementable; Devstral’s lower scores (tool_calling 3, strategic_analysis 2) make its suggestions less likely to include accurate function sequencing or thorough tradeoff reasoning.

Question 4

Does context window matter for Creative Problem Solving?

Accepted Answer

Yes. Longer context helps keep constraints, prior research, and iteration history available. Claude Haiku 4.5 has a 200,000 token window and long_context 5/5 in our tests; Devstral Medium has a 131,072 token window and long_context 4/5, which is adequate but less robust for very long, evolving briefs.

Question 5

Are there scenarios where Devstral is preferable despite lower creative scores?

Accepted Answer

Yes — when cost constraints are strict, outputs must be quickly converted into code, or you prioritize agentic planning at lower creative depth. Devstral Medium's description highlights code and agentic reasoning strengths and it scores 4/5 on agentic_planning in our tests.

Claude Haiku 4.5 vs Devstral Medium for Creative Problem Solving

Claude Haiku 4.5

Devstral Medium

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions