Deploying Falcon-40B model on Amazon SageMaker: A Comparative Guide
Dive into the intricacies of deploying the Falcon-40B, an open source Large Language Model, on Amazon SageMaker. We'll contrast two deployment avenues: SageMaker JumpStart, for those seeking swift and straightforward deployment, and SageMaker Notebook, tailored for enthusiasts who desire a hands-on approach with granular configuration control.
1. Falcon-40B Open Source LLM Overview
2.1 Deploying Falcon-40B quickly, with sensible defaults
2.2 Deploying Falcon-40B, using the SageMaker SDK
2.3 Generating text with Falcon-40B deployed to via SageMaker JumpStart
3. Deploy Falcon-40B from Hugging Face into a SageMaker Endpoint
3.1 Setup development environment
3.2 Retrieve the Hugging Face LLM DLC
3.3 Deploy Falcon-40B to Amazon SageMaker Endpoint
3.4 Run Inference and Chat with the Model
- SageMaker JumpStart - with sensible defaults.
- SageMaker JumpStart - using the SageMaker SDK.
- SageMaker Studio Notebook - a hands-on approach, allowing for detailed configuration and granular control.
- Positional embeddings: rotary (Su et al., 2021);
- Multi-query attention (Shazeer et al., 2019) and Flash Attention (Dao et al., 2022);
- Decoder-block: parallel attention/MLP with two-layer norms.
!! Cost Warning !! - While Falcon-40B may not be the biggest LLM out there, it is still a production scale LLM. Running a model such as this in your account requires large compute instances to be run, such as theml.g5.12xlarge
(or larger). Before deploying any resources in your account always be sure to check the pricing first, and plan how you will decommission the resources when they are no longer needed. Note that ALL the methods described in this post deploy large compute instances.
AWS Service Quotas - Your AWS account has region-specific quotas for each AWS service. These quotas help protect the account from unexpected costs caused by unintended usage. You can request increases for quotas, and may have to, for the large instances used in this post.
- Navigate to JumpStart - First from the home page of SageMaker Studio, or from the left hand menu, select 'JumpStart':
- Search for 'Falcon' - Now using the search box, search for
Falcon
: - Choose a version - There are a few versions of Falcon available in JumpStart, with different sizes and some that are instruction fine tuned. Instruction fine-tuning is where the model has been further refined from it's base training to follow instructions or to chat. Developers deploying and using a model may well want to choose this version, where as ML engineers looking to further refine the model (fine-tune) will probably want to start from the non-instruction tuned model. As well as that, JumpStart has a 7 and a 180 billion parameter version as well as the 40 billion parameter version we are using. To deploy and use the Falcon-40B right now, select 'Falcon 40B Instruct BF16'.
- You can review the deployment configuration options (see next), and or just click the deploy button - That's it. You're done! (Deployment may take between 15 and 20 minutes.)
- But wait, what sensible defaults were used? If you navigate to the
Deployment Configuration
you will see the options. The main default to be aware of, is the instance size used to deploy the model. As you can see here, you can change this, but the minimum size ismlg5.12xlarge
, which is quite a large instance. - You will see how to use the deployed SageMaker model endpoint later in this post. In the meantime, to review what has been deployed, and when the time comes to shut down the endpoint when, use the left side menu to navigate to
SageMaker JumpStart
>Launched JumpStart assets
then select theEndpoints
tab. From there select the endpoint that was deployed, scroll down and selectDelete
.
- As per the instructions above, navigate to the Falcon-40B model card in SageMaker JumpStart.
- This time, rather than clicking
Deploy
, scroll down and selectOpen notebook
from the 'Run in notebook' section. This will open a Jupyter notebook up in SageMaker Studio with all the code ready to be used, reused or modified.- If you are prompted to 'Set up notebook environment' then selecting the default options will be fine. This instance being run here is to run the SDK code, the model itself will launch in a different container.\
- You may need to wait for the notebook to start the kernel.
- When the notebook is up and running, you will see that the code is split into 3 main parts with
Deploy Falcon model for inference
being the section we want, to use the SDK for deploying Falcon-40B as a SageMaker endpoint. - Review the code and configuration in each of the cells, and run the code to deploy the endpoint.
- If it's not already loaded, follow the steps above to open up the Falcon-40B notebook.
- Scroll through the notebook and find the section
1.3 About the model
. In there you will see cells to query the endpoint using thepredictor
object. If you have just used this notebook to create the endpoint then thepredictor
object will be set. If you created the endpoint in another session, or you created the endpoint using the JumpStart deploy button, we will need to create apredictor
object ourselves. - To create a
predictor
object we need to locate the name of the endpoint we have deployed. To do this, navigate toLaunched JumpStart assets
in the left hand menu. SelectModel endpoints
from the tab, and note theTitle
of the deployed endpoint. - Beck in the notebook, to create a
predictor
add a new cell just before the cell that defines thequery_endpoint
function, and add the following code, making sure to use the name of the endpoint that is deployed in your account:
File
> New
> Notebook
)The content of this section is partly based on Philipp Schmid's blog. Philipp is a Technical Lead at Hugging Face, and an AWS Machine Learning Hero.
uri
and provide it to the HuggingFaceModel
model class, and point to that image using image_uri
.get_huggingface_llm_image_uri
method provided by the Amazon SageMaker SDK. This method allows us to retrieve the URI of the required Hugging Face LLM DLC based on the specified backend
, session
, region
, and version
. The example code is as follows:HuggingFaceModel
model class and define related endpoint configurations, including the hf_model_id
, instance_type
, etc. For this demo, we'll be using the g5.12xlarge
instance type with 4 NVIDIA A10G GPUs and 96GB of GPU memory.4-bit Float
and 4-bit NormalFloat
. QLoRA is an efficient fine-tuning approach that reduces memory usage of LLMs while maintaining solid performance.bitsandbytes
library to do 4-bit quantization based on QLoRA technology, you can check out this blog post.HuggingFaceModel
, we can deploy it to the Amazon SageMaker endpoint using the deploy
method. We'll deploy the model with the ml.g5.12xlarge
instance type. Text Generative Inference (TGI) will automatically distribute and shard the model across all GPUs, as shown in the following code.predict
method from the predictor
to begin model inference.parameters
attribute. The Hugging Face LLM Inference Container supports various generation parameters, including top_p
, temperature
, stop
, and max_new_token
. You can find the full list of supported parameters here.Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.