Claude Sonnet 4.6 vs Gemini 3.1 Flash Lite Preview

In our testing Claude Sonnet 4.6 is the better pick for complex developer and long‑context workflows — it wins 5 of 12 benchmarks including tool calling (5 vs 4) and long‑context (5 vs 4). Gemini 3.1 Flash Lite Preview trades some quality for a much lower price (input $0.25/mTok, output $1.50/mTok) and wins constrained rewriting and structured output.

anthropic

Claude Sonnet 4.6

Overall
4.67/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
5/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
4/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
5/5

External Benchmarks

SWE-bench Verified
75.2%
MATH Level 5
N/A
AIME 2025
85.8%

Pricing

Input

$3.00/MTok

Output

$15.00/MTok

Context Window1000K

modelpicker.net

google

Gemini 3.1 Flash Lite Preview

Overall
4.42/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
4/5
Multilingual
5/5
Tool Calling
4/5
Classification
3/5
Agentic Planning
4/5
Structured Output
5/5
Safety Calibration
5/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.250/MTok

Output

$1.50/MTok

Context Window1049K

modelpicker.net

Benchmark Analysis

Overview: across our 12‑test suite Claude Sonnet 4.6 wins 5 tests, Gemini wins 2, and 5 are ties (see win/loss/tie). Details (in our testing):

  • Tool calling: Sonnet 4.6 = 5 vs Gemini = 4. Sonnet ranks tied for 1st (tied with 16 others out of 54) — this indicates better function selection, argument accuracy and sequencing for agentic flows.
  • Long context: Sonnet 4.6 = 5 vs Gemini = 4. Sonnet is tied for 1st of 55 (36 others share), Gemini ranks 38 of 55 — Sonnet is meaningfully stronger at retrieval and coherence past 30K tokens.
  • Agentic planning: Sonnet 4.6 = 5 vs Gemini = 4. Sonnet ties for 1st (14 others) — better goal decomposition and failure recovery in our tests.
  • Classification: Sonnet 4.6 = 4 vs Gemini = 3. Sonnet ties for 1st (29 others) — more reliable routing/categorization in our runs.
  • Creative problem solving: Sonnet 4.6 = 5 vs Gemini = 4. Sonnet ties for 1st (7 others) — stronger at non‑obvious, feasible ideas.
  • Structured output: Gemini = 5 vs Sonnet = 4. Gemini ties for 1st (24 others) — better JSON/schema compliance and format adherence in our tests.
  • Constrained rewriting: Gemini = 4 vs Sonnet = 3. Gemini ranks 6 of 53 (25 models share) vs Sonnet rank 31 — Gemini compresses and obeys hard character limits more reliably.
  • Ties: strategic analysis (5/5 both), faithfulness (5/5 both), safety calibration (5/5 both), persona consistency (5/5 both), multilingual (5/5 both). Ties indicate comparable behavior on nuanced reasoning, keeping to source material, safety refusals, persona maintenance, and non‑English output quality. Supplementary external benchmarks (attributed): beyond our internal suite, Sonnet 4.6 scores 75.2% on SWE‑bench Verified and 85.8% on AIME 2025 (Epoch AI), placing it competitively on code and math external measures. Gemini has no external scores in this payload. Practical meaning: pick Sonnet when you need best‑effort tool orchestration, very long contexts, and agentic planning; pick Gemini when strict structured outputs or constrained rewrites matter and cost/throughput are primary constraints.
BenchmarkClaude Sonnet 4.6Gemini 3.1 Flash Lite Preview
Faithfulness5/55/5
Long Context5/54/5
Multilingual5/55/5
Tool Calling5/54/5
Classification4/53/5
Agentic Planning5/54/5
Structured Output4/55/5
Safety Calibration5/55/5
Strategic Analysis5/55/5
Persona Consistency5/55/5
Constrained Rewriting3/54/5
Creative Problem Solving5/54/5
Summary5 wins2 wins

Pricing Analysis

Raw token unit costs (per 1,000 tokens): Claude Sonnet 4.6 = $3 input / $15 output; Gemini 3.1 Flash Lite Preview = $0.25 input / $1.50 output (priceRatio = 10). Cost examples per 1M tokens: if all 1M are outputs, Claude = $15,000; Gemini = $1,500. If all 1M are inputs, Claude = $3,000; Gemini = $250. For a 50/50 input/output split on 1M tokens: Claude = $9,000; Gemini = $875. Scale linearly: 10M tokens (50/50) -> Claude $90,000 vs Gemini $8,750; 100M tokens -> Claude $900,000 vs Gemini $87,500. Who should care: teams doing high‑volume, cost‑sensitive inference (logs, simple chat, high throughput APIs) will prefer Gemini’s ~$875/M‑token (50/50) profile; teams running long‑context engineering, agentic workflows or priority coding workloads where Sonnet’s wins matter should budget for the ~10× higher token cost.

Real-World Cost Comparison

TaskClaude Sonnet 4.6Gemini 3.1 Flash Lite Preview
iChat response$0.0081<$0.001
iBlog post$0.032$0.0031
iDocument batch$0.810$0.080
iPipeline run$8.10$0.800

Bottom Line

Choose Claude Sonnet 4.6 if you run developer‑centric or agentic AI workloads that need top tool calling, long‑context coherence, and stronger coding/math performance (Sonnet wins tool_calling, long_context, agentic_planning, creative_problem_solving, classification). Budget for ~10× higher token costs. Choose Gemini 3.1 Flash Lite Preview if you need a much lower per‑token price (input $0.25/mTok, output $1.50/mTok), high throughput, and stronger structured output / constrained rewriting — ideal for production APIs, schema‑strict responses, or cost‑sensitive pipelines.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions