Mastering Amazon Bedrock Custom Models Fine-tuning (Part 1): Getting started with Fine-tuning

In the world of large language models (LLMs) like Meta’s Lllama, Cohere’s Command, Amazon’s Titan, and Anthropic’s Claude, these models have revolutionized the way we approach language tasks. They are pre-trained on vast amounts of text data and can be adapted to various downstream tasks through a process known as fine-tuning.

Fine-tuning is a technique that involves further training a pre-trained language model on a specific task or domain, using a smaller dataset relevant to that task. By doing so, the model can learn to better understand and generate text tailored to the particular context, resulting in improved performance and accuracy.

However, there are situations where fine-tuning may not be the most appropriate approach. In such cases, a retrieval-augmented generation (RAG) approach can be more suitable. RAG combines the power of foundation models with external knowledge sources, allowing them to access and incorporate relevant information from databases or document collections during the generation process.

In this blog post, we'll explore the fundamentals of fine-tuning and RAG, provide guidance on choosing the right approach, either fine-tuning or RAG, for your use case. We’ll cover:

Brief understanding of fine-tuning
Brief background on RAG (Retrieval-Augmented Generation)
The criteria for choosing between fine-tuning and RAG
Getting started with fine-tuning

Understanding Fine-tuning

Fine-tuning is a powerful technique when you need to adapt a foundation model to a specialized task or domain. For instance, if you're building a customer service chatbot for a particular industry, fine-tuning a pre-trained model on relevant customer service data from that industry can significantly enhance its understanding of domain-specific terminology, jargon, and context.

One key advantage of fine-tuning over retrieval-augmented generation (RAG) approaches is its potential for improved performance and lower latency during inference, as there is no additional retrieval step involved. This makes fine-tuned models well-suited for scenarios where low latency and high throughput are critical, such as real-time conversational AI applications.

However, fine-tuning comes with its own set of challenges. It typically requires a larger investment compared to RAG, as it necessitates labeled, curated data for training, as well as additional computational resources for the fine-tuning process itself. Additionally, fine-tuned models may struggle with rapidly changing data, as the model would need to be periodically retrained to incorporate new information effectively.

The fine-tuning approach can be illustrated with the following high-level diagram:

foundation model fine-tuning — Illustration diagram of fine-tuning generated by Claude 3 Sonnet in Amazon Bedrock

A practical example of fine-tuning can be found in the "Generative AI on AWS" book. The authors provide a sample code that guides users through the process of fine-tuning the Llama 2 model on a subset of the Dolly dataset using Amazon SageMaker JumpStart.

The example covers various aspects, including data preparation, defining fine-tuning hyperparameters, creating a SageMaker estimator, launching the fine-tuning job, evaluating the fine-tuned model's performance, and deploying it to a SageMaker endpoint. This notebook leverages Amazon SageMaker's capabilities for efficient and scalable fine-tuning of large language models, providing a comprehensive workflow from start to finish.

Understanding Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is an approach that combines the power of large language models (LLMs) with information retrieval techniques. In a RAG setup, the language model generates text based on the provided prompt, and a separate retrieval component fetches relevant information from a knowledge base or corpus to augment the model's output.

RAG is particularly useful when working with frequently changing data or when the domain knowledge is too broad to be effectively captured by a fine-tuned model alone. News agencies, media outlets, and organizations dealing with rapidly evolving information often benefit from RAG approaches, as they can easily update the knowledge base without retraining the entire language model.

One of the key advantages of RAG is its flexibility and ease of implementation. Since no extensive training is required, RAG systems can be set up relatively quickly and at a lower initial cost compared to fine-tuning. However, RAG models tend to be slower than fine-tuned models due to the additional retrieval step, and they can become complex due to the involvement of multiple components like vector databases, embedding models, and document loaders.

The Retrieval-Augmented Generation (RAG) approach is illustrated in the following high-level diagram:

model RAG — Illustration diagram of RAG generated by Claude 3 Sonnet in Amazon Bedrock

The Amazon Bedrock workshop provides a compelling example of RAG implementation by leveraging several years of Amazon’s Letters to Shareholders as a text corpus. This external knowledge base allows the RAG system to achieve better question-answering results by retrieving relevant information from the corpus. By augmenting the language model's output with this retrieved knowledge, the foundation models can generate more context-specific and accurate responses without requiring continuous retraining.

A notable advantage of this RAG implementation is the source attribution for retrieved information, which improves transparency and minimizes the risk of hallucinations, ensuring that the generated responses are grounded in factual data.

The workshop illustrates a customized RAG workflow, where the language model and the retrieval component work together to generate augmented responses. The following diagram depicts this workflow:

In this workflow, the language model generates an initial response based on the input prompt, while the retrieval component simultaneously fetches relevant information from the corpus of Amazon's Letters to Shareholders. The retrieved knowledge is then integrated with the language model's output, resulting in an augmented response that incorporates both the model's capabilities and the external knowledge source.

The full code for this RAG implementation, including the customized workflow and the integration with Amazon's Letters to Shareholders corpus, can be accessed at the following GitHub link: https://github.com/aws-samples/amazon-bedrock-workshop/blob/main/02_KnowledgeBases_and_RAG/3_Langchain-rag-retrieve-api-claude-3.ipynb

Choosing the Right Approach

So, when should you choose fine-tuning over RAG, or vice versa? The decision ultimately depends on your specific requirements and use case. Here are some general guidelines:

Fine-Tuning

When to Use:

Specialized Tasks: Fine-tuning is ideal for narrow, specialized tasks where precision and performance are paramount. For example, if you're developing a medical diagnosis model, fine-tuning on a curated dataset of medical records will yield highly accurate results.
High Performance and Low Latency: If your application demands low latency and high throughput, fine-tuning is the better option. Fine-tuned models do not require an additional retrieval step, making them faster in inference.
Curated Datasets: If you have access to a well-defined, labeled, and curated dataset relevant to your specific task, fine-tuning can leverage this data to optimize performance.
Quality of Prediction: For tasks where the quality and accuracy of predictions are critical, fine-tuning allows you to tailor the model closely to your specific requirements.

Pros:

High Performance: Optimized for specific tasks, leading to better accuracy and performance.
Low Latency: Faster inference times as there is no need for an additional retrieval step.
Task Specificity: Tailored to perform exceptionally well on the specific task it is trained on.

Trade-Offs:

Cost: Fine-tuning requires substantial initial investment in training, including preprocessing costs for scraping, transforming, and cleaning the data.
Loses Generalization: Fine-tuned models are highly specialized, meaning different models are needed for different tasks.
Not Ideal for Frequently Changing Data: As the model is trained on a static dataset, it does not adapt well to dynamic data environments.

Retrieval-Augmented Generation (RAG)

When to Use:

Frequently Changing Data: RAG is preferred when the data changes frequently, such as in news agencies or media outlets. The model can retrieve up-to-date information without retraining.
Broad Domain Knowledge: If your application covers a wide range of topics or domains, RAG can efficiently handle the diversity by retrieving relevant information dynamically.
Limited Labeled Data: RAG is advantageous when you lack a substantial labeled dataset. It uses pre-trained models and retrieves context from external sources, reducing the need for extensive training data.
Cost and Time Efficiency: RAG can be implemented quickly with lower initial costs since it avoids the extensive training process.

Pros:

Flexibility: Handles a wide variety of tasks by retrieving relevant information on-the-fly.
Lower Initial Costs: Avoids the costs associated with training, making it more accessible and faster to deploy.
Retains Generalization: The base model remains unaltered, maintaining its ability to generalize across different tasks.

Trade-Offs:

Slower Inference: The retrieval step adds latency, making RAG slower compared to fine-tuned models.
Complexity: Involves multiple components, such as a vector database, embedding models, and document loaders, which can complicate the system.
Higher Token Usage: Requires parsing the query along with the context, leading to increased token usage per prompt.

General Guidelines

Performance Sensitivity: If your application demands high performance, low latency, and high-quality predictions for a narrow domain, fine-tuning is the recommended approach.
Dynamic Data Environments: For applications dealing with frequently updated information or broad domain knowledge, RAG is often the more practical and cost-effective solution.

By carefully evaluating your use case and requirements, you can choose the most suitable approach, balancing the trade-offs between cost, performance, and complexity. Whether you opt for fine-tuning or RAG, each method offers unique advantages that can be leveraged to meet your specific needs.

Getting Started with Fine-tuning

If you've determined that fine-tuning is the appropriate approach for your use case, the next step is to prepare your data and set up the fine-tuning process. Here are some key considerations:

1. **Data Preparation**: Fine-tuning requires a high-quality, labeled dataset relevant to your task. This may involve collecting data from various sources, cleaning and transforming it, and annotating it with the appropriate labels. Data quality is crucial, as poor-quality data will lead to suboptimal model performance.

In the Llama 2 model fine-tuning code implementation from the book "Generative AI on AWS,", a subset of the Dolly dataset, a large-scale dataset for open-domain conversations, is utilized. The code snippets demonstrate pre-processing and filtering the data to create a smaller subset suitable for fine-tuning. However, in a real-world scenario, you would need to carefully curate and preprocess your own dataset to ensure high-quality and relevant data for the fine-tuning task. The related code snippets are shown below:

2. **Model Selection**: Choose a suitable pre-trained language model as your starting point. In this example, we have chosen to fine-tune the Llama 2 model, a powerful language model developed by Meta AI. The Llama model is made available through Amazon SageMaker JumpStart, which simplifies the process of accessing and fine-tuning the model using AWS resources.

3. **Fine-tuning Hyperparameters**: After choosing the model, before fine-tuning it with Amazon SageMaker, we need to define the instance we will use for fine-tuning. Then, we can experiment with different hyperparameters, such as learning rate, batch size, epochs, maximum input length, to optimize the fine-tuning process for your specific task and dataset.

The provided code snippets in this Llama 2 fine-tuning example demonstrate how to set various fine-tuning hyperparameters, including enabling instruction-tuned mode, setting a maximum input length of 1024, and running 5 epochs. However, it’s essential to note that these hyperparameters value may not be optimal for all tasks or datasets. It is crucial to experiment with different configurations to find the best setup for your specific use case. The related code snippets are shown below:

4. **Evaluation and Iteration**: Regularly evaluate your fine-tuned model's performance using appropriate metrics and test datasets. Fine-tuning is an iterative process, and you may need to adjust your data, hyperparameters, or even the pre-trained model to achieve optimal results. Continuous monitoring and refinement are crucial to improving the model’s performance and ensuring it meets your requirements.

In the fine-tuning Llama 2 example, it includes code for evaluating the fine-tuned model's performance on the validation set. Based on the evaluation results, you can determine if the fine-tuned model meets your requirements or if further iteration is needed by adjusting the data, hyperparameters, or trying a different pre-trained model.

Additionally, it is recommended to evaluate the fine-tuned model’s performance on a separate test dataset to get an unbiased estimate of its real-world performance. The test data should be representative of the target domain and unseen during the fine-tuning process. You can compare the performance of the fine-tune model with the pre-trained model on this test dataset using relevant metrics. The results may be presented in a tabular or graphical format, as shown in the provided screenshot, allowing you to access the improvement achieved through fine-tuning.

Evaluation of pre-trained and fine-tuning models

If you are interested in exploring the details of fine-tuning the Llama 2 model yourself, you can access the complete code for the fine-tuning example by visiting the following GitHub repository: https://github.com/generative-ai-on-aws/generative-ai-on-aws/blob/main/05_finetune/03_fine_tune_dolly_llama2_sagemaker_jumpstart.ipynb

This repository contains a Jupyter Notebook that walks you through the process of fine-tuning the Llama 2 model using Amazon SageMaker JumpStart. The notebook provides a step-by-step guide, including code snippets and explanations, to help you understand and replicate the fine-tuning process. Additionally, you can find instructions on setting up the required environment, preparing the data, and configuring the fine-tuning parameters.

By accessing this GitHub repository, you can gain hands-on experience and deeper insights into fine-tuning foundation models like Llama 2. Feel free to explore the code, experiment with different settings, and adapt it to your specific use case or dataset.

Summary

In this blog post, we delve into fine-tuning and Retrieval-Augmented Generation (RAG) techniques, offering an overview and recommendations for choosing the appropriate approach based on specific use cases. We provide insights on getting started with fine-tuning and present an example of fine-tuning the Llama 2 model using Amazon SageMaker, demonstrating data preprocessing, hyperparameter tuning, evaluation, and more. This will assist developers in understanding the fine-tuning process.

In the upcoming blog post, we'll explore fine-tuning foundation models using Amazon Bedrock, which simplifies the deployment of generative AI on AWS. Amazon Bedrock offers data privacy, network security, flexible billing for model customization, storage, and inference via provisioned throughput. It enables running custom model inference with guaranteed throughput levels and facilitates customized model deployment. Stay tuned for more details.

Note: The cover image for this blog post was generated using the SDXL 1.0 model on Amazon Bedrock. The prompt given was as follows:

“two developers sitting in the cafe discussing model fine-tuning, comic, graphic illustration, comic art, graphic novel art, vibrant, highly detailed, colored, 2d minimalistic”

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Site Terms, Privacy, and more.