Meta
Meta offers a small Llama-centered lineup focused on cost-efficient multimodal and long-context capability. In our testing their three active models target developers and teams who need large context windows and image+text inputs at low output cost; their market position is budget-oriented on price with mid-range benchmark performance across our 12-test suite.
Models
3
Cheapest Output
N/A
Avg Score
3.40/5
Price Range
N/A
Model Lineup
Meta exposes three models in our dataset: Llama 3.3 70B Instruct, Llama 4 Maverick, and Llama 4 Scout. Tiering in our testing is based on average benchmark score: Llama 3.3 70B Instruct is the top performer in the lineup (highest avg_score 3.5) and is the practical flagship for text-heavy, long-context and tool-calling workflows. Llama 4 Maverick is the high-capacity multimodal option (context_window 1,048,576; modality text+image->text) with the highest output cost ($0.60/mtok) — use it when you need extreme context and image+text inputs. Llama 4 Scout is the budget multimodal pick (output $0.30/mtok, context_window 327,680) for cost-sensitive long-context or image+text tasks. Pricing per model (output_cost_per_mtok): Llama 4 Scout $0.30, Llama 3.3 70B Instruct $0.32, Llama 4 Maverick $0.60. Note: Llama 4 Maverick experienced a transient tool-calling 429 rate limit on OpenRouter during one test run (quirk noted).
Strengths and Weaknesses
Strengths — In our testing Meta's models excel at long-context and multimodal scenarios: Llama 3.3 and Llama 4 Scout score 5 on long context, and Llama 4 Maverick and Scout accept text+image->text inputs and offer very large context windows (327,680 to 1,048,576). Llama 3.3 shows strong tooling support (tool calling 4) and is the highest-scoring model in the lineup (avg_score 3.5). Costs are a clear strength: output costs $0.30–$0.60/mtok, substantially below competitors listed in the payload. Weaknesses — Across the lineup safety calibration is low (all three models score 2 on safety calibration in our tests), and strategic analysis is uneven (scores as low as 2 on Maverick and Scout). On third-party math benchmarks, Llama 3.3 scored 41.6% on MATH Level 5 and 5.1% on AIME 2025 (Epoch AI), indicating math-olympiad performance is limited compared with top-tier competitors in our dataset. Overall, Meta offers cost-effective multimodal long-context models but lags the highest-performing providers on average benchmark scores (provider avg 3.39899 vs Anthropic/OpenAI ~4.67).
Pricing
Meta's output pricing ($0.30–$0.60 per mTok across the three models) is far lower than the major incumbents in our competitor comparison. For reference, competitor output costs in the payload are: Anthropic $15, OpenAI $14, Google $3, Deepseek $2.15, xAI $6, and Mistral $2. That positions Meta as a budget provider on cost. However, their provider average benchmark score (3.39899) is below competitors like Anthropic (4.6667) and OpenAI (4.6667) in our dataset, so Meta trades higher cost-efficiency for mid-tier raw benchmark performance.
Pricing vs Performance
Output cost per million tokens (log scale) vs average score across our 12 internal benchmarks
How We Test
We test every model against our 12-benchmark suite covering tool calling, agentic planning, creative problem solving, safety calibration, and more. Each test is scored 1–5 by an LLM judge. Read our full methodology.