AWS Logo
Menu
Reliable AI - Powering Trustworthy Enterprise Intelligence

Reliable AI - Powering Trustworthy Enterprise Intelligence

Hallucinations in Large Language Models (LLM) are not a manifestation of bugs, but a consequence of the next token prediction architecture. LLM Model users need to incorporate hallucination mitigation strategies order to deliver reliable, trustworthy inferences.

Sajjad Khazipura
Amazon Employee
Published Oct 5, 2024
Authored by: Sajjad H Khazipura, Amlan Chakraborty and Jonathan Cham
Welcome to the first in a series of conversations on Reliable AI. As businesses turn to AI to unlock productivity, the need for reliable and trustworthy AI systems has never been greater. AI has the potential to transform how organizations operate, but only if these systems can be trusted to deliver accurate, reliable and consistent results. By prioritizing these capabilities enterprises can safeguard their reputation, and earn their customers' trust.
In this blogpost and subsequent posts, we will focus on the topic of accuracy on account of LLM hallucinations - the occasional factual inaccuracies and logical inconsistencies that occur in inferences delivered. Such hallucinations, are a source of concern on account of the potential for reputational loss, financial liabilities accompanied by an erosion of customer trust.
Enterprises such as DPD, AirCanada, among many others have been challenged by LLM hallucinations leading to poor customer experience. Technology companies such as ServiceNow, SalesForce.com, and Microsoft, to name a few, are dealing with LLM hallucinations using various mitigation mechanisms.
Forrester Inc. analyst Rowan Curran said concerns about erratic AI model behavior are causing most organizations to keep applications internal until reliability improves. Ritu Jyoti, group vice president for worldwide AI at International Data Corp, said: “It (hallucination) comes up with 100% of the clients I speak to. If something goes wrong, it can be very detrimental to an organization.” Gartner analyst, Avivah Litan points to the unpredictability of such hallucination occurrences.
Origins of LLM hallucinations may be traced to the next token prediction mechanism which is the basis of LLM architectures. Knowledge gaps in an LLMs’ understanding of a subject are one of the key reasons for these hallucinations. Techniques such as Retrieval Augmented Generation (RAG) are useful in mitigating this risk. While RAG does reduce the risk of Hallucinations, the risk does not go down to zero. Efforts to characterize such failures are pointing towards a lack of math, reasoning and planning skills beyond the knowledge gaps identified. In-context learning, Multi-shot chains of thought, augmented prompting techniques can mitigate these risks, but then it is the prompting that contributes to the outcomes and that does not generalize well.
The conversation on hallucination mitigation may not be complete if we did not discuss a means for measuring the propensity for hallucinations among LLMs. A number of benchmarks and corresponding leaderboards exist prominent among them is Vectara’s HHEM, the evaluation model, and the HHEM based leaderboard at HuggingFace. Likewise Galileo has built and maintains a hallucination index for leading models. Leading open source evaluation benchmarks include HaluEval and another one from Huggingface. The Vectara leaderboard, illustrates the point that no LLM is immune from the risk of hallucinations and even though the propensity may vary.
Aside from the conventional techniques prevalent today such as improved prompt engineering, in-context learning, zero shot and few shot chains of thought, self-reflection and RLHF, we explore both existing and other emerging techniques.
LLM in the Loop and Ensemble Methods:
A common and prevalent design pattern is to use another LLM, from a different family of LLMs than the LLM used for inferencing, to act as a verifier or judge or a critic of the inferences delivered. This is an effective technique which will help minimize hallucinations, but we’re still dependent on another next token prediction engine.
Ensembles of LLMs, described as follows, can mitigate the risk of hallucinations:
  • Model Voting where the majority vote wins
  • Model Stacking wherein the outputs of multiple models are fed into a secondary model to choose the right answer
  • Employing Diverse Models for cross verification
  • Prompt Chaining in which Model A generates an output, Model B verifies and comments on the output and Model C delivers the final answer
A strategy used by a leading ISV customer involves a cascade of progressively more capable models. In this approach, the least capable model is used to generate the inference and the quality of the inference is scored. If the model output does not meet the quality threshold, the inference process is repeated using the next, larger and more capable model in the cascade. This process of inference and quality checks continues until the desired quality goals are met.
Retrieval Augmented Generation, Re-Rankers & Rank Fusion
Aside from serving as a context expansion tools leveraging an external knowledge, we’re seeing emergence of variants to the RAG architecture such Contextual RAG from Anthropic that adds addtional document context to the document chunks to aid in better inferencing. AWS OpenSearch Hybrid Search and Sparse-Dense vector Retrievals aim to exploit the best of semantic and keyword search followed by a rank fusion and reranking of results to deliver improved accuracy and relevance. Late Interaction Models such as ColBERT aim to improve document rankings. Lastly, emerging paradigms such as Corrective RAG (CRAG) aim to evaluate the quality of retrievals whose outcomes may drive expanding scope of retrieval. Taking it one step further is Dynamic Retrieval Augmented Generation that depends on the dynamic information needs of LLMs.
LLMs Augmented with World Models and GraphRAG
World Models describe the environment in which the AI agents operate. This knowledge may manifest itself as Knowledge Graphs, Physics based models and other symbolic representations such as code, rules, workflows, etc.
Explicit knowledge representation via Knowledge Graphs can help mitigate the risk of hallucinations. Such Graph Models can combine with LLMs to surface explicit knowledge that may be used to:
a) Train, fine-tune and continually pre-train LLMs with explicit knowledge
b) Run inference time verification by comparing the inferences with knowledge encoded in Graphs
c) Semantic Query augmentation to enhance the user’s query to be more explicit
Such systems are frequently referred to as Neuro-Symbolic systems as they combine neural networks (LLMs) with symbolic knowledge representation.
GraphRAG converts the input corpus into an explicit knowledge graph, clusters the key topics into related communities and runs classical RAG tasks over such knowledge for improved inferences.
Building semantic models of the world is labor intensive and efforts are on to automatically construct such knowledge bases and ontologies. Amazon, Google, Microsoft, Cyc, Diffbot and organizations such as Wikpedia, are building and curating knowledge represented as knowledge graphs for use by AI assistants such as Alexa, Google Home, etc.
LLMs and World Models augmented with Reasoners
LLM inferences may be channeled through logical reasoning engines for logical consistency checks and/or Factual consistency checks against known knowledge modelled as World Models/Knowledge Graphs. Logical Reasoning may also be applied to verify factual and logical consistency across multiple sources of truth.
Essential to reasoning are mathematical and logic tools such as Formal Logic, Theorem Provers, SAT Solvers, SMT Solvers and Bayesian Methods which may be applied to reason with multiple sources of knowledge to arrive at factual accuracy and logical consistencies.
Large Reasoner Models (LRMs)
The recent launch of Large Reasoner Models such as GPT-o1 promised some welcome news. These models are believed to posses enhanced reasoning delivered through test-time chains of thought and train-time reinforcement learning. However, they rest on the same LLM foundation of a next token predictor and are therefore exposed to the same drawbacks illustrated earlier.
LLM Modulo Framework for Planning
This is a planning framework proposed by the researchers at Arizona State University that combines the strengths of large language models (LLMs) and external verifiers, critics and judges in a tight, bidirectional protocol providing a more integrated neuro-symbolic approach.
In our upcoming posts, we’ll dive deeper into topics ranging from the nature of hallucinations, delve into the reasons why they occur, evaluate benchmarks and examine hallucination measurement techniques, explore steps to mitigate the risks they pose, call out examples from real-world case studies and from time to time peer over the horizon at frontier research.
Stay tuned for more in this series on Reliable AI.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments