Deploy Stable LM Zephyr 3B Model on SageMaker

In the rapidly evolving landscape of Generative AI, choosing the right language model for your project can be a game-changer. Recently, I had the opportunity to deploy the Stable LM Zephyr 3B Model on AWS SageMaker for a customer, and the experience was both challenging and rewarding. This blog post will walk you through our journey, highlighting why we chose this particular model and how we leveraged AWS SageMaker for deployment.

With its compact size, low resource requirements, and strong performance, this model can be an ideal choice for many LLM use cases where efficiency is crucial. While its lightweight design makes it accessible, deploying it on SageMaker presented unique challenges, from managing the model’s configuration optimizing inference. In this post, we’ll share our experiences and provide insights into effectively deploying LLMs on the cloud.

Why Stable LM Zephyr 3B?

The Stable LM Zephyr 3B, developed by Stability AI, offers a unique combination of efficiency, performance and ethical considerations that sets it apart from the crowded field of language models :

Efficiency: With just 3 billion parameters, this lightweight LLM provides an excellent balance between computational requirements and performance, making it suitable for a variety of applications, even in resource-constrained environments.
Competitive Performance: Despite its relatively small size, the Zephyr 3B delivers strong results across various natural language processing tasks, as demonstrated by its competitive scores on benchmarks like MT-Bench and Alpaca Benchmark.
Enhanced Instruction Following: The model is fine-tuned for instruction following and Q&A-type tasks using Direct Preference Optimization (DPO). This extension of the Stable LM 3B-4e1t model ensures improved responsiveness and accuracy in user interactions.
Commitment to Ethical AI: Released with a focus on safety, reliability, and appropriateness, the model is designed to support responsible AI use.

Choosing Stable LM Zephyr 3B, allowed us to leverage these advantages while utilizing AWS SageMaker’s scalability and robustness. In the following sections, we'll dive into the deployment process, challenges we faced, and the insights we gained along the way. Refer to [1] for the details of this model including lineage, training insights, performance benchmark, license, etc.

Deploy the Inference Endpoint

The model size is pretty big (~5.3 GB). You need to choose a big-sized SageMaker instance to run the following Jupyter notebook. In my experiment, I chose ml.m5.2xlarge instance and it took around 5 minutes to complete the deployment.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.model import Model
from sagemaker.predictor import Predictor

# Set up the SageMaker session
sagemaker_session = sagemaker.Session()

# Get the execution role
role = get_execution_role()

hub = {
  'HF_MODEL_ID':'stabilityai/stablelm-zephyr-3b',
  'HF_TASK':'text-generation',
  'HUGGING_FACE_HUB_TOKEN':'<replace with your Hugging Face token>'
}

model = Model(
  image_uri='763104351884.dkr.ecr.ap-southeast-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.2.0-gpu-py310-cu121-ubuntu22.04-v2.0', # Choose the latest Hugging Face TGI image
  role=role,
  sagemaker_session=sagemaker_session,
  env=hub
)

predictor = model.deploy(
  initial_instance_count=1,
  instance_type='ml.g5.xlarge',
  endpoint_name='zephyr-22',
  container_startup_health_check_timeout=400, # 6.67 minutes for startup timeout
  container_startup_health_check_frequency=200 # Health check every 3.33 minutes
)

print(f"Endpoint deployed successfully.")

Test the Model Inference

Use the following code to test the model on the Inference Endpoint.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import boto3
import json
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
from sagemaker.predictor import Predictor
from sagemaker.session import Session

boto_session = boto3.Session(region_name='ap-southeast-1')
sagemaker_runtime = boto_session.client('sagemaker-runtime')

# Set up the SageMaker session
sagemaker_session = Session(boto_session=boto_session, sagemaker_client=sagemaker_runtime)

# Your endpoint name
endpoint_name = 'zephyr-22' # Replace with your actual endpoint name

# Create a predictor
predictor = Predictor(
  endpoint_name=endpoint_name,
  sagemaker_session=sagemaker_session,
  serializer=JSONSerializer(),
  deserializer=JSONDeserializer()
)

# Prepare your input data
input_data = {
  "inputs": "Can you please let us know more details about George Washington?",
}

# Send a request to the endpoint
response = predictor.predict(input_data)

# Print the response
print("Model Response:")
print(json.dumps(response, indent=2))

The model output is like this:

1
2
3
4
5
6
Model Response:
[
{
  "generated_text": "Can you please let us know more details about George Washington?\nSure, George Washington was the first President of the United States, serving from 1789 to 1797. He was the commander-in-chief of the Continental Army during the American Revolutionary War. Prior to his presidency, Washington had served as a Virginia state legislator, the First President of the Continental Congress, and as the leader of the Northern Confederacy. He was known for his strong leadership skills, his plain manner, and his commitment to fulfilling the promises of the American Revolution. After"
}
]

Challenges

The model is pretty new - released on 7th Dec, 2023. It is not available yet in SageMaker Jumpstart. It provides the instruction code to deploy the model on SageMaker by using the class HuggingFaceModel. Unfortunately, this method doesn’t work due to the incompatible version of the key library - transformers. The latest HuggingFace Inference Containers only contains transformers version 4.37.0 (refer to [3]). With this version, you can successfully deploy the model, but when you perform the inference, you will get the following error:

The checkpoint you are trying to load has model type `stablelm` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

We spent a lot of time figuring out which Container image has a compatible transformers version. We also tried to use latest SageMaker Framework Containers for PyTorch 2.3.0 as the base image, and build a custom Docker image by installing the latest transformers. However, for some unknown reason, we cannot successfully invoke the SageMaker Inference Endpoint with this custom image. It went into an unknown state, the health check cannot be passed, and in the end, it failed due to timeout.

Finally, we figured out the latest HuggingFace Text Generation Inference (TGI) Container v2.0-hf-tgi-2.2.0-pt-2.3.0-inf-gpu-py310 meets all of the dependency requirements:

PyTorch 2.3.0
Python 3.10
Transformers 4.43.1

With this Container image, we are able to successfully deploy the Inference Endpoint and perform prediction against the deployed model.

Conclusion

With the persistent effort and close collaboration, we successfully identified the correct approach and the base Docker image, enabling us to deploy the Stable LM Zephyr 3B model on AWS SageMaker and perform inference effectively. This breakthrough not only solved our immediate challenge but also opened up new possibilities. For those looking to use this model and expand on this foundation, you can use this base image to build your custom model, adding any additional dependencies your specific use case may require. Our journey demonstrates that even when dealing with a new model, lacking prior deployment knowledge on the Internet, and facing challenges with the mistakes in the official documentation, complex AI deployment tasks can still be accomplished successfully in cloud environments like AWS SageMaker through perseverance and innovative problem-solving.

References

About the Authors

Pengfei Zhang - Senior Solutions Architect, AWS

Pengfei Zhang is a Senior Solutions Architect in Singapore Startups, providing architectural consultations to diverse early and scale-up AWS customers. He has recently guided priority customers through their Generative AI journey on AWS. He is specialized in Serverless and AIML technology domains.

Rajat Kumar Sinha - Senior AI Engineer, Intellect.co

Rajat Kumar Sinha is a Senior AI Engineer with over 7 years of experience in computer vision and natural language processing (NLP). Over the past two years, he has focused on Generative AI, leveraging cutting-edge techniques to develop innovative solutions across various domains. Currently, he is utilizing his expertise at Intellect to revolutionize mental health support, making it more accessible and personalized for users. His work aims to bridge the gap between scientific research and real-world applications, bringing the latest advancements in AI closer to people’s lives and well-being.

Select your cookie preferences

Site Terms, Privacy, and more.