meta

Llama 4 Maverick

Llama 4 Maverick is a high-capacity multimodal model from meta built on a mixture-of-experts (MoE) architecture with 128 experts and 17 billion active parameters per forward pass. At $0.15 input / $0.60 output per million tokens, it offers one of the largest context windows in the tested field — 1,048,576 tokens — at a very low price. In our 11-benchmark suite (tool calling was rate-limited during testing), it ranked 47th out of 52 models with an average score of 3.36, delivering strong persona consistency but weak strategic analysis and creative problem solving.

Performance

Llama 4 Maverick's strongest benchmark in our testing is persona consistency (5/5, tied for 1st with 36 other models out of 53 tested). All other scores fall at or below the field median: faithfulness 4/5 (rank 34 of 55), multilingual 4/5 (rank 36 of 55), structured output 4/5 (rank 26 of 54), long context 4/5 (rank 38 of 55). Weaknesses include strategic analysis (2/5, rank 44 of 54) and agentic planning (3/5, rank 42 of 54). Tool calling was not scored due to a rate limit error during testing (a transient issue per the model's quirks data). Overall rank: 47 out of 52 tested models. Note: tool calling was not included in the average due to the rate-limit incident.

Pricing

Llama 4 Maverick costs $0.15 per million input tokens and $0.60 per million output tokens. At 1 million output tokens/month, that is $0.60; at 10 million output tokens, $6.00. The 1,048,576-token context window is a standout feature at this price point — most models with million-token contexts cost significantly more. Within the meta lineup, Llama 3.3 70B Instruct costs $0.32/MTok output (avg 3.50) and scores slightly higher on our benchmarks while costing less. For applications that require very long context at low cost and can tolerate below-median general benchmark performance, Llama 4 Maverick's pricing is compelling.

meta

Llama 4 Maverick

Overall
3.36/5Usable

Benchmark Scores

Faithfulness
4/5
Long Context
4/5
Multilingual
4/5
Classification
3/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
2/5
Persona Consistency
5/5
Constrained Rewriting
3/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
N/A
AIME 2025
N/A

Pricing

Input

$0.150/MTok

Output

$0.600/MTok

Context Window1049K

modelpicker.net

Real-World Costs

iChat response<$0.001
iBlog post$0.0013
iDocument batch$0.033
iPipeline run$0.330

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

This modelOther models

Try It

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY",
)

response = client.chat.completions.create(
    model="meta-llama/llama-4-maverick",
    messages=[
        {"role": "user", "content": "Hello, Llama 4 Maverick!"}
    ],
)

print(response.choices[0].message.content)

Recommendation

Llama 4 Maverick is a specialized fit for applications that require extremely long context (1M tokens) at the lowest possible price, where persona consistency matters and where strategic analysis and complex reasoning are not critical requirements. It is not recommended as a general-purpose model — its 3.36 average score across 11 benchmarks places it near the bottom of the tested field. For comparable pricing with stronger benchmark performance, Llama 3.3 70B Instruct ($0.32/MTok output, avg 3.50) offers better general results at even lower cost.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.

Frequently Asked Questions