On the Performance of Large Language Models, Hallucinations, and Mitigation Techniques - PART 3
Retrieval-Augmented Generation (RAG) and Low-Rank Adaptation (LoRA) techniques for data-related hallucination mitigation
Published Aug 6, 2024
There are a number of mitigations technique that work better depending on the type and root cause of the hallucinations. A Taxonomy breakdown is presented in the following table:
In Part 3 of the series, I will focus on two main techniques that can be used to mitigate data-related hallucinations, specialiy the ones related to Knowledge Boundary: Retrieval-Augmented Generation (RAG) and Low-Rank Adaptation (LoRA).
In this section, Retrieval-Augmented Generation (RAG) for knowledge intensive NPL tasked in evaluated as a mechanism to mitigate data-related hallucinations (”Lewis et al. (2020)”). Pre- trained Large Language Models (LLMs) have demonstrated the ability to acquire a significant depth of knowledge from data, functioning as a parameterized implicit knowledge base without external memory. However, these models come with drawbacks: they face challenges in expanding or revis- ing their memory, struggle to transparently explain their predictions, and may generate ”hallucinations.” Hybrid models that integrate parametric memory with non-parametric (e.g., retrieval-based) memories offer a potential solution to these issues. This approach allows for direct revision and expansion of knowledge, as well as the inspection and interpretation of accessed information.
A model with Retrieval-Augmented Generation (RAG) capabilities consists of a parametric memory utilizing a pre-trained seq2seq transformer and a non-parametric memory with a dense vector index of an external corpus (e.g., Wikipedia, enterprise documents, best practices procedures, etc), accessible through a pre-trained neural retriever. These components are integrated into an end- to-end trained probabilistic model, as depicted in Figure 3. The architecture incorporates a Dense Passage Retriever (DPR) that provides latent documents conditioned on the input. The seq2seq model then conditions on these latent documents, along with the input, to generate the output. Document selection is achieved through marginalization and a top-K approximation, either on a per-output basis (assuming the same document is responsible for all tokens) or a per-token basis (where different documents are responsible for different tokens). By utilizing pre-trained access mechanisms, the capability to access knowledge is readily available without additional training. The subsequent sections of this review will elaborate on the methodology proposed by this architecture.
As shown in Figure 3, the RAG model uses the input sequence x to retrieve text documents z and use them as additional context when generating the target sequence y. The model has two components: (i) a retriever p(z—x) with parameters that returns (top-K truncated) distributions over text passages given a query x and (ii) a generator p(yi—x, z, y1:i1) parametrized by that generates a current token based on a context of the previous i 1 tokens y1:i1, the original input x and a retrieved passage z.
For the end-to-end training of the retriever and generator, the retrieved document is regarded as a latent variable. This methodology introduces two models that employ different marginalization techniques over the latent documents to generate a distribution over the produced text. (i) RAG- Sequence, where the model utilizes the same document to predict each target token. (ii) RAG-Token, which has the capability to predict each target token based on different documents.
Figure 3 shows the architecture of the pre-trained retriever (Query Encoder + Document Index) with a pre-trained seq2seq model (Generator) and fine-tune end-to-end.
Performance evaluation of the proposed RAG architecture highlights the benefits of combin-ing parametric and non-parametric memory for knowledge-intensive tasks. Enabling automation of tasks that humans could not reasonably be expected to perform without access to an external knowledge source. The RAG models achieve state-of-the-art results on open Natural Questions and strongly outperform approaches that use specialised pre-training. For knowledge-intensive genera- tion, the RAG based models generate responses that are more factual, specific, and diverse than the LLM baseline results, thus showing promising mitigation of data-related hallucinations. The RAG paper evaluated on this section concludes that the non-parametric memory can be replaced to update the models’ knowledge as the world evolves.
Numerous applications in natural language processing depend on customizing a single large-scale, pre-trained language model for various downstream tasks. Typically, this adaptation is achieved through fine-tuning, a process that updates all parameters of the pre-trained model. However, a significant drawback of fine-tuning is that the resulting model retains as many parameters as in the original model.
Efficient fine-tuning techniques aim at adapting LLMs to downstream tasks by optimizing a small fraction of parameters in multiple ways, i.e., addition-based, specification-based, and repa- rameterization based. Addition-based methods introduce extra trainable parameters or modules not present in the original model.Specification-based methods specify certain inherent model parame- ters to be tuned while freezing others. Reparameterization methods transform model weights into more parameter-efficient forms for tuning. The key idea is that model adaptation is low-rank, so weights can be reparameterized into low-rank factors or a low-dimensional subspace, which is the technique evaluated in this section - Low-Rank Adaptation (LoRA)
LoRA (”Hu et al. (2021)”) enables efficient adaptation of LLMs using low- rank updates. LoRA use DeepSpeed (”Rasley et al. (2020)”), as the training backbone. The key insight of LoRA is that the actual change in LLMs’ weights required for new task adaptation lies in a low- dimensional subspace. Specifically, for a pretrained weight matrix W0, the authors model the adapted weight matrix as W0 +W, where W is a low rank update. W is parameterized as W = BA, where A and B are much smaller trainable matrices. The rank r of W is chosen to be much smaller than the dimensions of W0.
The concept behind this approach is to refrain from directly training all parameters in W0. Instead, the authors opt to train low-dimensional matrices A and B, as shown in Figure 4, which indirectly trains W0 in a low-rank subspace of directions that matter for the downstream task. Con- sequently, this method involves substantially fewer trainable parameters compared to complete fine-tuning. In the case of GPT-3, LoRA achieves a reduction of 10,000 times in trainable parameters and a 3 times decrease in memory usage compared to full fine-tuning.
LoRA possesses several key advantages.
Reduced Storage requirements: A pre-trained model can be shared and used to build many
Reduced Storage requirements: A pre-trained model can be shared and used to build many
small LoRA modules for different tasks and efficiently switch tasks by replacing the matrices A and B in Figure 4. Thus significantly reducing the storage requirement and task-switching overhead.
Training efficiency: LoRA makes training more efficient and lowers the required hardware and compute power by a factor of x3 since there is no need to calculate the gradients or maintain the optimizer states for most parameters. Instead the injected, much smaller low-rank matrices are optimized.
Linear design: the simple linear design allows training matrices with frozen weights to be deployed, introducing no inference latency compared to a fully fine-tuned model.
Despite their impressive performance on diverse tasks, large language models (LLMs) still struggle with tasks requiring rich world knowledge, implying the difficulty of encoding a wealth of world knowledge in their parametric memory. Also, LLMs struggle with less popular or domain-specific knowledge. Thus, enterprise applications leveraging LLMs that need to accomplish tasks in a certain domain can potentially generate sentences/words out of context and provide this information back to the user as facts (i.e Hallucinations). This issue poses a real challenge for LLM users unable to differentiate facts from fiction, particularly in industries that are highly regulated, such as Finance, Insurance, and Healthcare, where specificity and context are vital.
LLMs are currently ‘stuck’ in the time frame they were trained. if you asked for a summary of the latest insurance rates for New York, you won’t get anything back – or worse, you will be presented with rates from 2021.
Because of their size and complexity, today’s LLMs are not reviewing their sources on a regular basis and so obtaining up-to-date information may limit some aspects of commercial use.
LLMs are currently probabilistic and generative – meaning, if the exact same question was asked 10 times, the answer may very well change each time. This poses a significant challenge for financial services companies operating in highly regulated or protocol-driven environments in which predictability of outcome is crucial for compliance, or simply repeatability.
Figure 5 shows a series of techniques that can be employed to improve LLM performance. In this review, Retrieval Augmented Generation (RAG) and Fine-Tuning using LoRA were evaluated. In order to implement RAG, an application would need to leverage a vector database such as Pinecone, where documents embedding can be created and persisted. From a cost point of view, it requires less investment compared to Pre-training and Fine-tuning.
Pre-training and fine-tuning enormous language models is prohibitively expensive in terms of the hardware required and the storage cost for hosting independent instances for different tasks. The LoRA technique, offers an efficient adaptation strategy with superior performance for domain adaptation, but compared to RAG techniques, the cost and complexity is higher while achieving better accuracy.
Pre-training and fine-tuning enormous language models is prohibitively expensive in terms of the hardware required and the storage cost for hosting independent instances for different tasks. The LoRA technique, offers an efficient adaptation strategy with superior performance for domain adaptation, but compared to RAG techniques, the cost and complexity is higher while achieving better accuracy.
This literature review has explored the topic of Large Language Models (LLMs), focusing on the instances where trust in such models may be compromised. The analysis delved deeply into the phenomenon of hallucination within language models, unraveling its nuanced causes and proposing effective mitigation strategies, notably through techniques like fine-tuning using LoRA and RAG. Despite acknowledging the existence of certain limitations and challenges associated with LLMs, it is imperative to underscore the immense potential inherent in these models. Through optimization strategies, particularly in achieving domain adaptation and mitigating hallucinations, LLMs can indeed be tailored for robust and reliable enterprise-grade applications.