google

Gemma 4 31B

Gemma 4 31B is a 30.7 billion parameter dense multimodal model from Google DeepMind, built for production workloads requiring strong agentic capability and structured reasoning. It supports text and image and video inputs with a 256,000 token context window and up to 131,072 output tokens — an exceptionally large generation budget. At $0.13/MTok input and $0.38/MTok output, it delivers benchmark performance that ranks 8th out of 52 tested models at a price well below most high-scoring alternatives. Configurable reasoning mode (enable or disable via supported parameters) gives developers flexibility between latency and analytical depth. In the budget-to-mid tier, Gemma 4 31B competes directly with DeepSeek V3.2 ($0.38/MTok output, avg 4.25) — both are priced identically on output, but Gemma 4 31B scores higher (avg 4.42). Among open-weight-style models at this price, it is the top performer in our test set.

Performance

In our 12-test benchmark suite, Gemma 4 31B ranks 8th out of 52 models with an average score of 4.42. Top-tier scores across the board: tool calling (5/5, tied for 1st with 16 other models out of 54), agentic planning (5/5, tied for 1st with 14 others out of 54), strategic analysis (5/5, tied for 1st with 25 others out of 54), faithfulness (5/5, tied for 1st with 32 others out of 55), persona consistency (5/5, tied for 1st with 36 others out of 53), multilingual (5/5, tied for 1st with 34 others out of 55), structured output (5/5, tied for 1st with 24 others out of 54), and classification (5/5, tied for 1st with 29 others out of 53). Creative problem solving (4/5, rank 9 of 54) and constrained rewriting (4/5, rank 6 of 53) are solid. The weaker dimensions: long context (4/5, rank 38 of 55) — slightly below the top tier — and safety calibration (2/5, rank 12 of 55), indicating above-average permissiveness.

Pricing

Gemma 4 31B costs $0.13 per million input tokens and $0.38 per million output tokens. At 10 million output tokens monthly, output cost is $3.80. At 100 million output tokens, output cost is $38. This pricing positions it as one of the most cost-effective high-performing models available. Compare: DeepSeek V3.2 at the identical $0.38/MTok output scores 4.25 average in our testing — Gemma 4 31B scores 4.42. Gemma 4 26B A4B (a sparse variant) costs $0.35/MTok output and scores 4.25. For models scoring similarly (4.25-4.42), the next cheapest alternative after Gemma 4 31B is Mistral Medium 3.1 at $2/MTok output — more than 5x the output cost. Claude Haiku 4.5 at $5/MTok output scores 4.33 but costs over 13x more per output token.

google

Gemma 4 31B

Overall
4.42/5Strong

Benchmark Scores

Faithfulness
5/5
Long Context
4/5
Multilingual
5/5
Tool Calling
5/5
Classification
4/5
Agentic Planning
5/5
Structured Output
5/5
Safety Calibration
2/5
Strategic Analysis
5/5
Persona Consistency
5/5
Constrained Rewriting
4/5
Creative Problem Solving
4/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Roles

Pricing

Input

$0.130/MTok

Output

$0.380/MTok

Context Window262K

modelpicker.net

Real-World Costs

iChat response<$0.001
iBlog post<$0.001
iDocument batch$0.022
iPipeline run$0.216

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

This modelOther models

Try It

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY",
)

response = client.chat.completions.create(
    model="google/gemma-4-31b-it",
    messages=[
        {"role": "user", "content": "Hello, Gemma 4 31B!"}
    ],
)

print(response.choices[0].message.content)

Recommendation

Gemma 4 31B is one of the strongest value propositions in the tested model set. Teams building agentic systems will benefit from its 5/5 agentic planning and 5/5 tool calling scores — at $0.38/MTok output, this combination is unmatched in cost-efficiency among models we've tested. It is a clear choice for structured data extraction (5/5 structured output), classification pipelines (5/5 classification), multilingual applications (5/5 multilingual), and document analysis requiring faithfulness (5/5). The 5/5 strategic analysis score makes it viable for nuanced tradeoff reasoning tasks that most budget models cannot handle. The primary concern is safety calibration (2/5, rank 12 of 55) — this model is more permissive than most tested models and should not be deployed in consumer-facing contexts without additional content safeguards. For teams not needing video input or the large 131K output budget, Gemma 4 26B A4B at $0.35/MTok output is a slightly cheaper alternative with a 4.25 average score.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.