Mmlu
What Is It?
MMLU (Massive Multitask Language Understanding) is a standardized test that measures how well an AI model answers multiple-choice questions across 57 academic and professional subjects, including history, law, medicine, mathematics, and computer science. Think of it as a comprehensive university entrance exam covering nearly every discipline, administered all at once. A model that scores well on MMLU has demonstrated broad factual recall and reasoning ability across domains, not just depth in one area. Because the benchmark spans so many fields, it has become one of the most widely cited signals of general AI capability.
Why It Matters
MMLU scores tell you whether a model has the general knowledge base to handle unpredictable, cross-domain tasks — the kind of work that shows up in real jobs. If you need an AI for legal research, medical summarization, or technical writing, a model with a strong MMLU score is more likely to have the underlying knowledge those tasks require. However, MMLU is a multiple-choice benchmark, so it measures knowledge retrieval and reasoning more than open-ended generation or coding ability. Use it as one signal among several: a model with high MMLU but weak tool-calling scores may know a lot but struggle to act on that knowledge in an agentic workflow. On ModelPicker, we track both internal benchmark scores (on a 1–5 scale) and third-party external benchmarks to give you a fuller picture of where each model actually excels.
How It Applies
MMLU is not one of ModelPicker's 12 internal benchmark tests — our suite focuses on task-specific capabilities like tool calling, structured output, faithfulness, and multilingual performance. However, external benchmarks we reference (sourced from Epoch AI) cover related territory: the MATH Level 5 benchmark, for example, maps closely to the STEM reasoning that MMLU's science and math sections probe. Across the 52 models we track from 8 providers, MATH Level 5 scores (Epoch AI) range widely, with a median of 94.15% and a 25th percentile of 73.25%, showing meaningful separation between models even at the top tier. When you see a model profile on ModelPicker referencing strong general reasoning, our internal classification and strategic analysis scores (median 4/5 across tracked models) are the closest proxies for the broad knowledge MMLU is designed to measure. Input costs across our tracked models range from $0.05 to $5.00 per million tokens, so high MMLU-caliber performance is available at very different price points — making it worth checking our benchmarks to find where value peaks.