deepseek
DeepSeek V3.2
DeepSeek V3.2 is deepseek's high-context AI optimized for retrieval-augmented generation (role: rag) and structured-output workflows. It sits between low-cost inference siblings (DeepSeek V3.1 family) and expensive top-tier bracket peers (Claude Sonnet 4.6, GPT-5.2) by offering an unusually large 163,840-token context window at a low per-token price. In our testing it trades peak multi-task averages for standout abilities in JSON/schema compliance, long-context retrieval, and multilingual fidelity — making it a fit for teams building heavy-RAG apps, document-to-JSON pipelines, or multilingual extraction/transform workloads that need lots of context without high output bills.
Performance
All scores below are from our 12-test suite. Top strengths: (1) Structured output — 5/5 and "tied for 1st with 24 other models" for JSON/schema compliance in our testing; (2) Long-context — 5/5 and "tied for 1st with 36 other models" for retrieval accuracy at 30K+ tokens; (3) Multilingual/faithfulness/persona consistency — all 5/5 (multilingual: tied for 1st with 34 others; faithfulness: tied for 1st with 32 others; persona consistency: tied for 1st with 36 others), meaning V3.2 reliably preserves source material and maintains character across languages. Additional strengths include strategic analysis and agentic planning (both 5/5, tied for 1st on those tasks). Notable weaknesses: tool calling is 3/5 and ranks 47 of 54 (shared with 5 others), so function selection and sequencing are middling in our tests; classification is 3/5 (rank 31 of 53); safety calibration is a relative weakness at 2/5 (rank 12 of 55 in our testing), so it is less conservative on harmful-content refusals than many peers. Overall, DeepSeek V3.2 places 15 of 52 in our overall ranking — strong specialty performance but not the top average scorer across all tasks.
Pricing
DeepSeek V3.2 charges $0.26 per input mtok and $0.38 per output mtok (payload values). Real-world examples: 100k input tokens = $26; 100k output tokens = $38; combined 100k in + 100k out = $64. Scale to 1M in + 1M out = $660; 10M in + 10M out = $6,600. Compared with bracket peers in the payload, V3.2 is dramatically cheaper than Claude Sonnet 4.6 ($15/mtok out) and GPT-5.2 ($14/mtok out), and price-matched with Gemma 4 31B ($0.38/mtok out). It is also less expensive than deepseek's own V3.1 ($0.75/mtok out). For frequent large-context runs, those per-mtok savings compound into substantial monthly cost differences.
deepseek
DeepSeek V3.2
Benchmark Scores
External Benchmarks
Roles
Pricing
Input
$0.260/MTok
Output
$0.380/MTok
modelpicker.net
Real-World Costs
Pricing vs Performance
Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks
Try It
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="YOUR_OPENROUTER_KEY",
)
response = client.chat.completions.create(
model="deepseek/deepseek-v3.2",
messages=[
{"role": "user", "content": "Hello, DeepSeek V3.2!"}
],
)
print(response.choices[0].message.content)Recommendation
Use DeepSeek V3.2 if you need: - Large-context RAG pipelines that ingest and reason across 100k+ token documents (context_window = 163,840 and long context 5/5). - Reliable schema/JSON extraction and format adherence (structured output 5/5 tied for 1st). - Multilingual extraction or translation-aware pipelines where faithfulness matters (multilingual and faithfulness 5/5). Avoid V3.2 for: - Tool-heavy orchestration or multi-step API function sequencing; tool calling is 3/5 and ranks low (47/54). - Safety-critical moderation or content-filtering enforcement where stricter refusal behavior is required (safety calibration 2/5). If you need a similar long-context plus stronger tool orchestration, evaluate other bracket peers; if budget is the dominant constraint, V3.2 offers high-context capability at low per-mtok cost compared with $15/mtok-class competitors.
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.