
Deploy Stable LM Zephyr 3B Model on SageMaker
How to Deploy a Light-weight LLM on SageMaker
- Efficiency: With just 3 billion parameters, this lightweight LLM provides an excellent balance between computational requirements and performance, making it suitable for a variety of applications, even in resource-constrained environments.
- Competitive Performance: Despite its relatively small size, the Zephyr 3B delivers strong results across various natural language processing tasks, as demonstrated by its competitive scores on benchmarks like MT-Bench and Alpaca Benchmark.
- Enhanced Instruction Following: The model is fine-tuned for instruction following and Q&A-type tasks using Direct Preference Optimization (DPO). This extension of the Stable LM 3B-4e1t model ensures improved responsiveness and accuracy in user interactions.
- Commitment to Ethical AI: Released with a focus on safety, reliability, and appropriateness, the model is designed to support responsible AI use.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.model import Model
from sagemaker.predictor import Predictor
# Set up the SageMaker session
sagemaker_session = sagemaker.Session()
# Get the execution role
role = get_execution_role()
hub = {
'HF_MODEL_ID':'stabilityai/stablelm-zephyr-3b',
'HF_TASK':'text-generation',
'HUGGING_FACE_HUB_TOKEN':'<replace with your Hugging Face token>'
}
model = Model(
image_uri='763104351884.dkr.ecr.ap-southeast-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.3.0-tgi2.2.0-gpu-py310-cu121-ubuntu22.04-v2.0', # Choose the latest Hugging Face TGI image
role=role,
sagemaker_session=sagemaker_session,
env=hub
)
predictor = model.deploy(
initial_instance_count=1,
instance_type='ml.g5.xlarge',
endpoint_name='zephyr-22',
container_startup_health_check_timeout=400, # 6.67 minutes for startup timeout
container_startup_health_check_frequency=200 # Health check every 3.33 minutes
)
print(f"Endpoint deployed successfully.")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import boto3
import json
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
from sagemaker.predictor import Predictor
from sagemaker.session import Session
boto_session = boto3.Session(region_name='ap-southeast-1')
sagemaker_runtime = boto_session.client('sagemaker-runtime')
# Set up the SageMaker session
sagemaker_session = Session(boto_session=boto_session, sagemaker_client=sagemaker_runtime)
# Your endpoint name
endpoint_name = 'zephyr-22' # Replace with your actual endpoint name
# Create a predictor
predictor = Predictor(
endpoint_name=endpoint_name,
sagemaker_session=sagemaker_session,
serializer=JSONSerializer(),
deserializer=JSONDeserializer()
)
# Prepare your input data
input_data = {
"inputs": "Can you please let us know more details about George Washington?",
}
# Send a request to the endpoint
response = predictor.predict(input_data)
# Print the response
print("Model Response:")
print(json.dumps(response, indent=2))
1
2
3
4
5
6
Model Response:
[
{
"generated_text": "Can you please let us know more details about George Washington?\nSure, George Washington was the first President of the United States, serving from 1789 to 1797. He was the commander-in-chief of the Continental Army during the American Revolutionary War. Prior to his presidency, Washington had served as a Virginia state legislator, the First President of the Continental Congress, and as the leader of the Northern Confederacy. He was known for his strong leadership skills, his plain manner, and his commitment to fulfilling the promises of the American Revolution. After"
}
]
- PyTorch 2.3.0
- Python 3.10
- Transformers 4.43.1
- AWS DLC Releases for TGI, https://github.com/aws/deep-learning-containers/releases?q=tgi+AND+gpu&expanded=true