
Why Do Large Language Models Hallucinate?
Large Language Model (LLM) hallucinations stem from three main factors: data quality issues, model training methodologies, and architectural limitations.
Sajjad Khazipura
Amazon Employee
Published May 13, 2025
Following our previous exploration of what LLM hallucinations are, we now tackle the pressing question: why do these systems sometimes generate convincing yet entirely fabricated information? The better we understand this malady the greater the likelihood of discovering remedies. Generally speaking, LLM Hallucinations are a class of problems that arise from a combination of data quality issues, model training methodologies, and architectural limitations. Let’s unpack these factors contributing to hallucinations, grounded in research and informed by industry practices, and real-world examples.
Data Quality Issues in training data manifest themselves via LLM hallucinations. If the model is trained on inaccurate, outdated, or biased data, it will produce outputs that reflect those inaccuracies. For example, if an LLM is trained on social media data that contains misinformation, it may generate false statements. The axiom “garbage in, garbage out” holds true for LLMs.
Occurences of misinformation in social media which winds its way into training datasets has led to a rise in imitative falsehoods, where LLMs generate convincing but false statements based on the data they have been trained on. A proclivity of LLMs to learn from their data also leads LLMs to perpetuate and amplify socio-cultural biases present in the training data, leading to stereotypical responses or hallucinations. A 2024 Stanford study1**** evaluating cultural bias in popular LLMs found they surfaced racial biases. Also, given that a vast majority of internet data is in English, these models default to Western-centric perspectives.
LLMs struggle with topics underrepresented in their training data leading to Knowledge Gaps. Infrequent or rare facts in training data lead to higher hallucination rates. Models struggle with long-tail knowledge, where entities appearing <10% of the time in training show 3× higher hallucination rates compared to common entities2. For example, a model trained primarily on general web content might hallucinate when asked about niche academic subjects whose content is inaccessible behind paywalls or regional dialects inaccessible to general LLM training. The ROOTS dataset, designed for multilingual training, addresses this by curating 1.6TB of text across 59 languages—yet gaps persist, especially for the long tail of low-resource languages like Tulu, Hokkein, Yoruba or Guarani.
When real world data is hard to acquire, on account of regulation and costs, synthetic data is used in domains like healthcare (e.g., generating synthetic patient records to preserve privacy). Over reliance on such data risks model collapse, where models trained on AI-generated data degrade over generations.
Lastly, Large language models that are architecturally dependent on predicting the next word based on surface-level statistical patterns in text, often generate plausible-sounding but factually incorrect answers because they rely on correlations rather than genuine understanding or reasoning about the underlying facts2. This leads to difficulties distinguishing between true knowledge and spurious correlations.
Recognizing this need for high quality data, new technologies and startups have emerged for curating data. They address various stages of the data supply chain that supports LLM development with tools and services for data labelling, data curation, identifying knowledge gaps to improving data quality hybrid datasets, blending real-world and synthetic data with human oversight to minimize hallucination risks. Finally, newer approaches such as Confident Learning3 and Contrastive Learning4 are exploring data-centric AI approaches designed to identify label errors, characterize label noise, disambiguate between similar and dissimilar data points, and improve model training in datasets with imperfect labels.
Even if the data problems were addressed and one had access to curated high quality data, training methodologies introduce vulnerabilities.
Classical supervised learning regimes invite the curse of Hard Labels i.e. a type of label used in machine learning where each data point is assigned a single, definitive class. The assignment is binary: an example either belongs to a class or it does not. LLMs are typically trained to predict a single “correct” next word—a method that ignores linguistic ambiguity. For instance, in case of completing the sentence “The nurse prepared the _,” both “medicine” and “bandages” are valid, but models penalized for “wrong” guesses learn to demonstrate certainty on account of the Hard Labels. This fosters overconfidence, turning gaps in knowledge into fabricated answers.
Another key challenge with LLM training regimes, is that Large language models (LLMs) perform lossy compression of knowledge during training by approximating patterns from their training data rather than storing exact information. This process prioritizes efficiency and generalization at the cost of precise fidelity, leading to trade-offs in factual accuracy. For example, the phrase "Paris is the capital of France" is stored as a probabilistic association between "Paris" and "capital," not as an explicit fact.
It has been established that the ordering of training data significantly impacts LLMs' propensity to hallucinate, as evidenced by controlled studies of learning dynamics and knowledge acquisition. If a model sees easy, common facts first, it learns those well but might struggle with rare or unusual facts later, leading to more mistakes (hallucinations) with those. Mixing up the order (shuffling) helps the model learn more evenly and can reduce hallucinations, but it may take longer for the model to learn. If similar facts are grouped together, the model memorizes those quickly but may get confused when it sees new or different information later, causing more hallucinations. Spreading out similar facts helps the model generalize better and reduces the chance of making things up. If new information is always presented after old information (like in chronological order), the model can “forget” earlier facts and start making up answers about the past. Mixing old and new facts during training helps the model keep both in mind.
RLHF aligns models with human preferences but it can also backfire. In pursuit of “helpfulness,” models learn sycophancy—agreeing with users even when incorrect. For example, Anthropic’s 2023 research7 found models would endorse flawed medical advice if users insisted, prioritizing agreeability over accuracy.
To mitigate hallucinations caused by training methodologies, focus on curating high-quality, diverse datasets and integrating graph RAG and its variants for explicit knowledge representation that helps ground model outputs in factual sources. Embedding structured verification steps into the training process further reduces the likelihood of hallucinated responses.
Transformer architecture based LLMs have limited working memory (e.g., 4k-32k tokens), causing older context to be overwritten leading to factual drift. The Attention mechanism struggles with long sequences as attention weights become diluted, leading to incoherent or invented content. Similarly, Tokenizers may split rare words into meaningless subwords, causing semantic distortion (e.g., "cardiomegaly" → "cardio" + "megaly"). Poor vector representations in embedding layers can cluster unrelated concepts (e.g., "Paris" ≈ "France" in vectors despite different meanings. Lastly, architectural constraints force models to answer even when the LLM is not certain of the answer, leading to fabrication. These answers "snowball" into gibberish with error accumulation on account of each token inferred.
Model decoding strategies and architectural limits amplify errors during text generation. Higher “temperature” settings, Stochastic Sampling, boost creativity but also hallucinations. For instance, setting temperature=0.7 for brainstorming might yield novel ideas, but the same setting in legal document drafting could invent non-existent precedents.
LLMs prioritize fluency over factual fidelity, especially in long responses. A model summarizing a 10-page report might omit critical details, inventing connections to maintain coherence—a phenomenon researchers term faithfulness hallucination. Also, these LLMs may, over extended interactions, lose coherence or alignment, called Context Drift. Smaller LLMs often lack the knowledge depth to calibrate confidence accurately leading to overconfidence in their inferences.
When language models try to predict the next word in a sentence, they use a mathematical function called softmax to turn their predictions into probabilities for each word in the vocabulary. The softmax function uses a “hidden state” (a summary of what the model knows so far) to decide how likely each word is. But this hidden state is much smaller than the number of words in the vocabulary (limited expressiveness). Because of this size difference, the model can’t represent all possible patterns of language, Softmax Bottleneck. It’s like trying to describe a detailed picture using only a few crayons-you just can’t capture all the details. As a result, the model might not be able to give the right probabilities for words, especially when there are many possible correct answers. Even if you train the model really well, this bottleneck means it can’t fully use what it has learned.
This post builds on ongoing research into AI reliability and attempts to highlight the various pathways leading to LLM hallucinations. While hallucinations remain a challenge, advances in data quality, training frameworks, and decoding logic are steadily closing the gap between fluency and reliability in the short term. In the medium to longer term, Subbarao Kambhampati’s LLM modulo framework8, the use of world models to construct neuro-symbolic architectures and Yann LeCun’s Joint Embeddings Predictive Architecture9 hold promise. Narrow task domains may benefit from Neuro-Symbolic architectures that augment LLMs with domain specific of Knowledge Graphs for training, semantic query augmentation and verification. Lastly, Bayesian Causal Models also offer another promising pathway, in narrow task specific domains, to mitigate the risk of hallucinations in large language models (LLMs) by combining probabilistic reasoning with causal inference.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.