AWS | Community | On the Performance of Large Language Models, Hallucinations, and Mitigation Techniques

Introduction

Natural Language Processing (NLP) have been increasing the size of the Language Models (LMs) over billions of parameters and most recently trillion of parameters. Given the stochastic nature of these large models, there is a risk that humans can take nonsensical or unfaithful response from the models (i.e. Hallucinations) as facts. Also, some Language Models can pick up subtle biases and toxic expressions patterns from the training data, which can lead to potential biased and harm. This review aims to understand performance limitation of large language models, understand hallucinations and how to mitigate this phenomenon. First understand what are Large Language Models, and will use BERT as a reference (”Regers et al. (2020)”). Then, will look into parametric memories and when not to trust language models ( ”Mallen et al. (2022)”). Next, on Part 2, will take a deeper look into one of those scenarios when not to trust a model and evaluate the concept of hallucinations, what causes these hallucinations, and different type of hallucinations (”Huang et al. (2023)”).

Part 3 and the final part of the review will evaluate how to mitigate hallucination using two techniques: Retrieval Augmented Generation (”Lewis et al. (2020)” and fine tuning (”Hu et al. (2021)”), and provide some analyze on the trade-off between complexity and performance.

Background - Large Language Models: How BERT Works

Before delving into the causes of hallucination, let's first review the concept of Large Language Models (LLMs) (”Regers et al. (2020)”). Typically, LLMs refer to a series of general-purpose models that leverage the Transformer-based language model architecture and undergo extensive training on massive textual corpora with notable examples like BERT, GPT-3, LLaMA and more. For this literature review, I will focus on BERT to evaluate basic concepts.

Essentially BERT is a stack of Transformer encoder layers (”Vaswani et al. (2017)") with multiple self-attention ‘heads’. For every input token in a sequence, each head computes key K, value V, and query vectors Q, used to create a weighted representation. The outputs of all heads in the same layer are combined and run through a fully connected layer. Each layer is wrapped with a skip connection and followed by layer normalization. See Figure 1 for a reference architecture.

Figure 1: BERT: Multi-Head attention, with Input and Output Encoder components

The basic implementation for BERT consists of two stages: pre-training and fine-tuning. Pre- training uses two self-supervised tasks: masked language modeling (MLM, prediction of randomly masked input tokens) and next sentence prediction (NSP, predicting if two input sentences are adjacent to each other). For fine-tuning, one or more fully connected layers are typically added on top of the final encoder layer.

The input representations are computed as follows: Each word in the input is first tokenized into word-pieces, and then three embedding layers (token, position, and segment) are combined to obtain a fixed-length vector.

An important capability achieved by BERT, it the capacity to capture Syntactic, Semantic, and World knowledge encoded in its weights (i.e parametric knowledge). Syntactic analysis involves analyzing the grammatical syntax of a sentence to understand its meaning. In the case of Semantic analysis, a Language Model can understands the meaning of a text by analyzing the text as a whole and not just looking at individual words. The context in which a word is used is very important when it comes to Semantic analysis. World Knowledge refers facts and concepts the model can acquire through self- supervised training on extensive textual corpora.

Let's review this.

Syntactic Knowledge

BERT representations are hierarchical rather than linear (see Figure 2 taken from the original paper). A syntactic tree structure can be use to represent word order information. Also BERT embeddings encode information about parts of speech, syntactic chunks, and roles. Enough syntactic information seems to be captured in the token embeddings themselves to recover syntactic trees. Further model testing showed the following:

- Classifiers could not recover the labels of distant parent nodes in syntactic trees.

- Syntactic structure is not directly encoded in self-attention weights, but syntactic information can be recovered from BERT token representations.

- Transformation matrices successfully recovered syntactic dependencies in PennTreebank data from BERT’s token embeddings.

Figure 2: syntactic knowledge: words sharing syntactic subtrees have larger impact on each other

Semantic Knowledge

Probing classifiers have shown that BERT encodes information about entity types, relations, semantic roles, and proto-roles. Also, BERT struggles with representations of numbers. Addition tasks showed that BERT does not form good representations for floating point numbers and fails to generalize from the training data. This is due to BERT’s word-piece tokenization, since numbers of similar values can be divided up into substantially different word chunks thus creating an architecture shortcoming for BERT.

World Knowledge

Model evaluation of BERT about how it captures common sense world knowledge can be summarized as follows:
- BERT struggles with pragmatic inference and role-based event knowledge.
- BERT also struggles with abstract attributes of objects, as well as visual and perceptual properties that are likely to be assumed rather than mentioned.
- The MLM component (masked language modeling - prediction of randomly masked input tokens) of BERT is easy to adapt for knowledge induction by filling in the blanks (e.g., “Babies like to eat [ ]”).
- For some relation types and open-domain Question Answering , BERT is competitive with methods relying on knowledge bases and it generalizes better to unseen data.
- Good prompt engineering with template sentences is needed in order to retrieve BERT’s knowledge.
- Syntax information is not encoded in Self-Attention weights, but can be obtained through BERT Token representations.
- Certain BERT attention heads attend to different syntactical structures.
- The final layers of BERT are the most task-specific

However, BERT cannot reason based on its world knowledge. BERT attempts to guess the potential action, meaning, and properties of many objects, but cannot reason about the relationship between properties and actions. For example, it ”knows” that people can fly planes, and that planes are big, but it cannot infer that planes are bigger than people.

BERT’s world knowledge success comes from learning stereotypical associations, for example, a person with an female-sounding name is predicted to be a woman, even when it is incorrect, but performance drops as the number of inference steps increases

BERT Layers

As mentioned before, BERT is essentially a stack of Transformer encoder layers. The first layer of BERT receives inputs with a combination of tokens, segments and positional embeddings. The subsequent lower layers, have the most information about linear word order. This is accompanied by an increased knowledge of hierarchical sentence structure. The middle layers are most prominent in capturing syntactic information, including subject-verb agreements. Semantic information and features appear at the higher layers, drawing parallels between this order and the order of components in a typical NLP pipeline from POS-tagging to dependency parsing to semantic role labeling. Lastly, the final layers of BERT are the most task-specific. This means specificity to the MLM task in pre-training, which explains why the middle layers are more transferable and in fine-tuning, it explains why the final layers change the most, and why restoring the weights of lower layers of fine-tuned BERT to their original values does not dramatically hurt the model performance.

When Not to Trust Large Language Models

Parametric and non-parametric knowledge.

This section of the review aims to understand Large Language’s strengths and limitations in memorizing factual knowledge ( ”Mallen et al. (2022)”).

As discussed in the previous section, large pre-trained LLMs such as BERT (”Devlin et al. (2019)") can memorize significant amount of world knowledge in their parameters (parametric knowledge), and can achieve competitive performance on open-domain QA. More recent and powerful LLMs further improve performance on diverse knowledge-intensive tasks, leveraging their strong parametric memories. However, relying solely on their parameters to encode a wealth of world knowledge requires a prohibitively large number of parameters and the knowledge can become obsolete quickly, given the fact that LLM are stuck in the time in which the supervise training took place. Recent work shows that augmenting LLMs with non- parametric memories (i.e., retrieval-augmented generation RAG) enables much smaller models to match the performance of larger models.

Memorization performance

There is a positive relationship between string frequency in pre-training corpora and memorization. Co-occurrence of question and answer relations in pre-training corpora has a positive correlation with models’ Q&A accuracy on popular open-domain specific knowledge. Furthermore, model evaluation results show that memorization has a strong correlation with entity popularity and that scaling up models may only provide marginal improvements. However, there is a relationship between scaling, popularity, relationship type, and performance of LLM with respect to capturing world knowledge, but this depends on a definition of popularity that is time-dependent and may not perfectly reflect how frequently entities are discussed or updated on the web. Conversely, LLM model accuracy is lower over relationship types with low popularity, where the model may try to "guess" the answer to questions of certain relationship type.

Non-parametric Memory Complements Parametric Memory

Current state-of-the-art LLMs still struggle with less popular subjects or certain relationship types, and increasing the model size does not lead to further performance improvements. As a result, non-parametric sources of knowledge can be used to improve model performance for less popular facts (or specific domain facts). Specifically, retrieval-augmented generation (RAG), which leverage non-parametric memories (i.e., retrieved text chunks from vector databases) improve performance and mitigate Hallucinations problems, which will be further explored in the next section. However, Non-paremtric memories can mislead Language Models in more popular entities. Thus a hybrid architecture that uses both approached (i.e Parametric and Non-parametric memories) will be evaluated also in part 3 (Hallucination Mitigation Techniques)

Site Terms, Privacy, and more.

On the Performance of Large Language Models, Hallucinations, and Mitigation Techniques - PART 1

This multi-part study evaluates the performance of Large Language Models (LLMs), conducts a taxonomy on hallucinations and assess various mitigation techniques.