meta

Llama 3.3 70B Instruct

Llama 3.3 70B Instruct is a pretrained and instruction-tuned 70-billion parameter multilingual text model from Meta. It is a text-in, text-out model with a 131,072 token context window and a maximum output of 16,384 tokens. At $0.10/MTok input and $0.32/MTok output, it is one of the cheapest models by output cost in the tested set. However, it ranks 43rd out of 52 overall with an average score of 3.5 — toward the lower end of the performance distribution. Within Meta's lineup in our test set, its only sibling is Llama 4 Maverick ($0.60/MTok output, avg 3.36), which scores slightly lower at a higher price. Llama 3.3 70B Instruct competes most directly with budget text-only alternatives like Ministral 3 8B 2512 ($0.15/MTok output, avg 3.67) and Mistral Small 3.2 24B ($0.20/MTok output, avg 3.25) — both cheaper and in the same general performance tier.

Performance

In our 12-test benchmark suite, Llama 3.3 70B Instruct ranks 43rd out of 52 models with an average score of 3.5. The standout score is long context at 5/5 (tied for 1st with 36 other models out of 55 tested). Classification (4/5, tied for 1st with 29 others out of 53), tool calling (4/5, rank 18 of 54), faithfulness (4/5, rank 34 of 55), multilingual (4/5, rank 36 of 55), and structured output (4/5, rank 26 of 54) are mid-range. Notable weaknesses: persona consistency (3/5, rank 45 of 53), agentic planning (3/5, rank 42 of 54), strategic analysis (3/5, rank 36 of 54), constrained rewriting (3/5, rank 31 of 53), creative problem solving (3/5, rank 30 of 54), and safety calibration (2/5, rank 12 of 55). External math benchmarks from Epoch AI show MATH Level 5 at 41.6% (rank 14 of 14 — last among tested models) and AIME 2025 at 5.1% (rank 23 of 23 — last among tested models). Math reasoning is a clear weak point.

Pricing

Llama 3.3 70B Instruct costs $0.10 per million input tokens and $0.32 per million output tokens. At 10 million output tokens monthly, output cost is $3.20. At 100 million output tokens, output cost is $32. At $0.32/MTok output, it is among the cheapest models available, though its average benchmark score of 3.5 places it in the lower tier of tested models. Gemma 4 26B A4B at $0.35/MTok output scores 4.25 — substantially higher performance for marginally more cost. Ministral 3 8B 2512 at $0.15/MTok output scores 3.67, slightly outperforming Llama 3.3 70B Instruct at less than half the output price. For teams primarily optimizing for lowest possible cost on text-only workloads, Llama 3.3 70B Instruct offers a price point that few models undercut.

meta

Llama 3.3 70B Instruct

Overall
3.50/5Strong

Benchmark Scores

Faithfulness
4/5
Long Context
5/5
Multilingual
4/5
Tool Calling
4/5
Classification
4/5
Agentic Planning
3/5
Structured Output
4/5
Safety Calibration
2/5
Strategic Analysis
3/5
Persona Consistency
3/5
Constrained Rewriting
3/5
Creative Problem Solving
3/5

External Benchmarks

SWE-bench Verified
N/A
MATH Level 5
41.6%
AIME 2025
5.1%

Pricing

Input

$0.100/MTok

Output

$0.320/MTok

Context Window131K

modelpicker.net

Real-World Costs

iChat response<$0.001
iBlog post<$0.001
iDocument batch$0.018
iPipeline run$0.180

Pricing vs Performance

Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks

This modelOther models

Try It

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_KEY",
)

response = client.chat.completions.create(
    model="meta-llama/llama-3.3-70b-instruct",
    messages=[
        {"role": "user", "content": "Hello, Llama 3.3 70B Instruct!"}
    ],
)

print(response.choices[0].message.content)

Recommendation

Llama 3.3 70B Instruct is best suited for simple, high-volume text tasks where cost is the primary constraint: basic classification, long-context retrieval on structured documents, and straightforward question-answering workloads. Its 5/5 long context score is a genuine strength. Teams requiring stronger reasoning, persona handling, or agentic capability should look at budget alternatives that score higher overall — Gemma 4 26B A4B at $0.35/MTok output (avg 4.25) costs only $0.03/MTok more but delivers dramatically better performance across most dimensions. Llama 3.3 70B Instruct is not recommended for math-heavy tasks (ranked last in both external math benchmarks in our data), complex agentic workflows (rank 42 of 54 on agentic planning), or persona-critical applications (rank 45 of 53). The 2/5 safety calibration also limits consumer-facing use without external filtering.

How We Test

We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.