Mastering Amazon Bedrock Custom Models Fine-tuning (Part 2): Data Preparation for Fine-Tuning Claude 3 Haiku

In the previous blog post, we explored fine-tuning and Retrieval-Augmented Generation (RAG) techniques, providing an overview and recommendations for choosing the appropriate approach based on specific use cases. We offered insights on getting started with fine-tuning and presented an example of fine-tuning the Llama model using Amazon SageMaker, demonstrating data preprocessing, hyperparameter tuning, evaluation, and more. This helped developers understand the fine-tuning process.

In this blog post, we will guide you through the process of creating the necessary resources and preparing the datasets for fine-tuning the Claude 3 Haiku model using Amazon Bedrock. By the end, you will have created an IAM role, an S3 bucket, and training, validation, and testing datasets in the required format for the fine-tuning process.

Prerequisites

Before diving into the data preparation process, ensure that you have the required permissions to create and manage IAM roles, S3 buckets, and access Amazon Bedrock. If you are not running with an Admin role, you will need the following managed policies:

IAMFullAccess
AmazonS3FullAccess
AmazonBedrockFullAccess

You can also create a custom model in the Bedrock console following the instructions here.

Setup

First, ensure that the necessary Python packages are installed or upgraded to the required versions for the project to run correctly.

Next, import all the needed libraries and dependencies:

Then, set up various AWS clients and services that will be used:

You may print the important settings to ensure everything is accessible when you need to check them at any time:

Create an S3 Bucket

Create the S3 bucket, which will be used to store data for Claude 3 Haiku fine-tuning:

Create a Role and Policies

Then, create the role and policies required to run customization jobs with Amazon Bedrock.

This JSON object below defines the trust relationship that allows the Amazon bedrock service to assume a role, giving it the ability to communicate with other required AWS services. The conditions set restrict the assumption of the role to a specific account ID and a specific component of the bedrock service (model_customization_job).

This JSON object below defines the permissions of the role that Amazon bedrock will assume to allow access to the S3 bucket created to hold our fine-tuning datasets and enable certain bucket and object manipulations:

You can print them out if you want to know the details:

Finally, we need to attach the defined policy to the specified role:

Prepare CNN News Article Dataset for Claude 3 Haiku fine-tuning and Evaluation

The dataset that will be used is a collection of new articles from CNN and their associated highlights. More information can be found on HuggingFace:

https://huggingface.co/datasets/cnn_dailymail

First, load the CNN News Article dataset from HuggingFace:

The provided dataset contains three different splits: `train`, `validation`, and `test`:

For the `train` split, there are 287,113 examples
For the `validation` split, there are 13,368 examples
For the `test` split, there are 11,490 examples

To fine-tune the Claude 3 Haiku model, the training data must be in JSONL format, where each line represents a single training record. Specifically, the training data format aligns with the MessageAPI:

In each line, the system message is optional information, which provides context and instructions to the Haiku model, such as specifying a particular goal or role, also known as the system prompt.

The user input corresponds to the user’s instruction, and the assistant input is the desired response that the fine-tuned Haiku model should provide.

A common prompt structure for instruction fine-tuning includes a system prompt, an instruction, and an input that provides additional context.

Here we define the system prompt, which will be added to the MessageAPI, and an instruction header that will be added before each article and together will be the user content of each data point.

For the assistant component, we will refer the summary/highlights of the article. The transformation of each data point is performed with the code below:

An example of a processed data point can be printed below:

The same processing is done for the validation and test datasets as well.

Next, we will define some helper functions to process our data points further by modifying the number of data points we want to include in each dataset and the maximum string length of the data points we want to include. The final function will convert our datasets into JSONL files.

Claude 3 Haiku model fine-tuning has the following requirements on your datasets:

Context length can be up to 32,000 tokens
Training dataset cannot have greater than 10,000 records
Validation dataset cannot have great than 1,000 records

For simplicity, we will process the datasets as follows:

Create a Local Directory for Datasets

Save the processed data locally and convert them into JSONL formats.

Save the processed data with JSONL formats

Upload Datasets to S3 Bucket

The following code blocks will upload the created training, validation and test datasets to the S3 bucket. The training and validation datasets will be used for Haiku fine-tuning job, and the testing dataset will be used to evaluate the performance between the fine-tuned Haiku and base Haiku models.

Summary

If you're interested in the details of preparing data for fine-tuning Claude 3 Haiku model, you can refer to the following GitHub link:

https://github.com/aws-samples/amazon-bedrock-samples/blob/main/bedrock-fine-tuning/claude-haiku/00_Setup_and_DataPrep_Haiku.ipynb

By following the steps outlined in this blog post, you should have successfully prepared the necessary resources and datasets for fine-tuning the Claude 3 Haiku model on news article summarization using Amazon Bedrock. With the IAM role, S3 bucket, and processed datasets in place, you're ready to proceed with the fine-tuning process, which will be covered the next blog post. Stay tuned.

Note: The cover image for this blog post was generated using the SDXL 1.0 model on Amazon Bedrock. The prompt given was:

”A young lady developer and a handsome gentleman data scientist sitting in a café, laptop without a logo, excitedly discussing model fine-tuning, comic, graphic illustration, comic art, graphic novel art, vibrant, highly detailed, colored, 2d”

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Site Terms, Privacy, and more.