Hallucination
What Is It?
Hallucination is when an AI model produces output that sounds plausible and confident but is factually incorrect, made up, or unsupported by any real source. Think of it like a student who doesn't know the answer but writes a convincing-sounding essay anyway — the grammar is perfect, the tone is authoritative, and the facts are wrong. Hallucinations can range from subtle errors (a slightly wrong date, a misattributed quote) to wholesale fabrications (citing a study that doesn't exist, inventing a legal precedent). The problem is structural: LLMs generate text by predicting likely next tokens, not by retrieving verified facts — so confident delivery and factual accuracy are not the same thing.
Why It Matters
Hallucination risk is one of the most consequential factors in model selection for any task where accuracy matters. If you are using an AI to draft legal documents, summarize medical research, generate code, or answer customer questions, a model that halluccinates 5% of the time is not 95% useful — it is potentially dangerous, because you cannot easily tell which 5% is wrong. Two of our internal benchmark dimensions bear directly on this: faithfulness (does the model stay grounded in provided source material?) and safety calibration (does the model acknowledge uncertainty rather than confabulate?). In our testing across 52 models, faithfulness scores cluster high — with a median of 5/5 and a 25th percentile of 4/5 — but safety calibration tells a starker story: the median is just 2/5, meaning most models will assert false confidence rather than say "I don't know." That gap is the hallucination risk in practice. For developers building RAG pipelines or document QA systems, faithfulness scores are a direct proxy for how reliably a model will stick to what it was given. For consumers, safety calibration predicts whether a model will hedge appropriately or present fiction as fact.
How It Applies
On ModelPicker, hallucination risk maps to two benchmark dimensions you will see on every model profile. Faithfulness (scored 1–5) tests whether a model accurately represents provided source material without inventing details. Safety calibration (scored 1–5) tests whether a model expresses appropriate uncertainty when it does not know something, rather than generating a confident but wrong answer. Across the 52 models we track, faithfulness has a 25th-percentile score of 4/5 — most models do reasonably well when grounded in source text. Safety calibration is the weak point: the 25th percentile is 1/5 and the median is just 2/5, meaning the majority of models we test lean toward false confidence over honest uncertainty. When you filter or compare models on ModelPicker, sorting by safety calibration score is the most direct way to surface models that are less likely to hallucinate in open-ended, fact-dependent tasks. Pairing that with a high faithfulness score gives you the strongest signal for low-hallucination deployments.