deepseek

DeepSeek V3.1

DeepSeek V3.1 is a large hybrid reasoning model (671B parameters, 37B active) that supports both thinking and non-thinking modes via prompt templates. At $0.15 input / $0.75 output per million tokens, it is one of the most affordable production models in our tested field. In our 12-benchmark suite, it ranked 31st out of 52 models with an average score of 3.92 — delivering top-tier results on creative problem solving, faithfulness, structured output, long context, and persona consistency. The 32,768-token context window is relatively small compared to peers. It supports tool calling, structured outputs, and reasoning parameters.

Performance

DeepSeek V3.1's top benchmark scores in our testing include: creative problem solving (5/5, tied for 1st with 7 other models out of 54 tested), faithfulness (5/5, tied for 1st with 32 other models out of 55 tested), structured output (5/5, tied for 1st with 24 other models out of 54 tested), persona consistency (5/5, tied for 1st with 36 other models out of 53 tested), and long context (5/5, tied for 1st with 36 other models out of 55 tested). Notable weaknesses: tool calling scored 3/5 (rank 47 of 54) and safety calibration scored 1/5 (rank 32 of 55). The tool calling weakness is significant for agentic workflows that rely on reliable function invocation. Overall rank: 31 out of 52 tested models.

Pricing

DeepSeek V3.1 costs $0.15 per million input tokens and $0.75 per million output tokens — among the lowest in the tested model pool (range: $0.10–$25 output). At 1 million output tokens/month, that is $0.75; at 10 million output tokens, $7.50. Within the deepseek lineup, it undercuts R1 ($2.50/MTok output, avg 4.0) while scoring similarly, and is significantly cheaper than R1 0528 ($2.15/MTok output, avg 4.5). It is priced nearly identically to DeepSeek V3.1 Terminus ($0.79/MTok) but outscores it on our benchmarks (3.92 vs. 3.75 avg). For developers prioritizing cost-per-quality, DeepSeek V3.1 is one of the strongest value options in the tested field.

deepseek

DeepSeek V3.1

Overall
3.92/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
4/5
Tool Calling
3/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
1/5
Strategic Analysis
4/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.150/MTok

Output

$0.750/MTok

Context Window33K

modelpicker.net

Real-World Costs

iChat response<$0.001
iBlog post$0.0016
iDocument batch$0.041
iPipeline run$0.405

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

This modelOther models

Try It

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY",
)

response = client.chat.completions.create(
    model="deepseek/deepseek-chat-v3.1",
    messages=[
        {"role": "user", "content": "Hello, DeepSeek V3.1!"}
    ],
)

print(response.choices[0].message.content)

Recommendation

DeepSeek V3.1 is an excellent value pick for developers and teams who need strong creative, faithful, and structured output at the lowest price point in the field. At $0.75/MTok output with 5/5 on faithfulness, creative problem solving, structured output, and persona consistency, it competes with models costing 3–10x more on those specific dimensions. Avoid it for agentic workflows that depend on tool calling — it scored 3/5 (rank 47 of 54), which means function invocation reliability is below the field median. The 32,768-token context window also limits it for long-document tasks; models with larger contexts are better suited for document-level retrieval work.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions