How We Test

Why Another Benchmark?

Public benchmarks like MMLU and HumanEval test academic capabilities. We test what developers actually use LLMs for: generating JSON, calling tools, compressing text to a character limit, maintaining persona across turns, and recovering from failed plans. Our benchmarks use real-world prompts, not synthetic ones.

12 Test Categories

Structured Output

JSON schema compliance and format adherence

Strategic Analysis

Nuanced tradeoff reasoning with real numbers

Constrained Rewriting

Compression within hard character limits

Creative Problem Solving

Non-obvious, specific, feasible ideas

Tool Calling

Function selection, argument accuracy, sequencing

Faithfulness

Sticks to source material without hallucinating

Classification

Accurate categorization and routing

Long Context

Retrieval accuracy at 30K+ tokens

Safety Calibration

Refuses harmful requests, permits legitimate ones

Persona Consistency

Maintains character and resists injection

Agentic Planning

Goal decomposition and failure recovery

Multilingual

Equivalent quality output in non-English languages

Scoring

Each model response is scored 1-3 by an LLM judge (Claude Sonnet 4.6):

  • 1 (Weak): Didn't follow instructions, thin or unusable output
  • 2 (Usable): Followed structure, reasonable output, needs human editing
  • 3 (Strong): Complete, insightful, hand-to-a-stakeholder quality

Overall grade: avg 1.0-1.6 = Weak, 1.7-2.3 = Usable, 2.4-3.0 = Strong.

Freshness

Pricing is refreshed nightly from OpenRouter and LiteLLM APIs. New models are automatically benchmarked when detected. Quirks (parameter requirements, JSON mode support, tool calling) are probed automatically before benchmarking.

Data Sources

  • Pricing: OpenRouter API + LiteLLM model registry
  • Benchmarks: BYOM engine via OpenRouter
  • Quirk detection: Automated probe suite (9 tests per model)