Question 1

Why did DeepSeek V3.1 Terminus win if Claude Haiku 4.5 is stronger at tool calling?

Accepted Answer

Our Structured Output verdict is based on the structured_output task score (DeepSeek 5 vs Claude Haiku 4) and task ranks (DeepSeek rank 1 of 52, Haiku rank 26 of 52). DeepSeek's higher score reflects better raw JSON schema compliance in our tests. Claude's superior tool_calling (5 vs 3) and faithfulness (5 vs 3) are important for tool-driven workflows, but they did not outweigh DeepSeek's schema adherence on this specific task.

Question 2

How should I decide between them for production API responses?

Accepted Answer

If your primary failure mode is parsing errors from schema noncompliance, pick DeepSeek for its higher structured_output score and lower output cost ($0.79/mTok). If you must call tools and embed their exact results into structured payloads, pick Claude Haiku 4.5 for its stronger tool_calling and faithfulness — plan for extra cost ($5.00/mTok output).

Question 3

Does the absence of an external benchmark affect confidence in the winner?

Accepted Answer

We have no externalBenchmark for this task in the payload, so the winner is determined solely by our internal structured_output score and task rank. That is explicit in our verdict: this result reflects our testing across the structured_output suite.

Question 4

Are there modality or context-window differences I should care about?

Accepted Answer

Yes. Claude Haiku 4.5 supports text+image->text, has a 200,000 token context window, and a declared max_output_tokens of 64,000 — useful for multimodal or very long outputs. DeepSeek V3.1 Terminus is text->text with a 163,840 token window and no max_output_tokens listed in the payload.

Question 5

How do costs compare for a schema-heavy production workload?

Accepted Answer

Output cost per mTok in the payload: DeepSeek $0.79, Claude Haiku $5.00. At scale, DeepSeek is materially cheaper on output tokens; factor in Claude's strengths only if they reduce other system costs (e.g., fewer validation rounds due to tool integration).

Claude Haiku 4.5 vs DeepSeek V3.1 Terminus for Structured Output

Claude Haiku 4.5

DeepSeek V3.1 Terminus

Task Analysis

Practical Examples

Bottom Line

How We Test

Frequently Asked Questions