
Deploying Alibaba's Qwen-2.5 Model on AWS using Amazon SageMaker AI
Step-by-step guide: Deploying Qwen2.5-14B-Instruct Model on Amazon SageMaker AI
Qwen2.5-14B-Instruct
model, on AWS infrastructure. This deployment will involve deploying the model on Amazon SageMaker AI, enabling you to harness Qwen 2.5 capabilities within the cloud.Qwen2.5-14B-Instruct
model on the ml.g5.12xlarge
instance type. ml.g5.12xlarge for endpoint usage
is 0, you will need to raise it.






Notebook
, click on Python 3 (ipykernel)
to create a new notebook.1
%pip install "sagemaker>=2.163.0"
sagemaker_session
which we use to determine the current region and execution role.1
2
3
4
5
6
7
import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import time
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()
get_huggingface_llm_image_uri()
to generate the appropriate image URI for the Hugging Face Large Language Model (LLM) inference.The function takes a required parameter backend
and several optional parameters. The backend
specifies the type of backend to use for the model. Input the value as "huggingface"
which refers to using Hugging Face TGI inference backend.1
2
3
4
image_uri = get_huggingface_llm_image_uri(
backend="huggingface",
region=region
)
model
object by specifying a unique name, the image_uri
for the managed TGI container, and the execution role for the endpoint. Additionally, we specify a number of environment variables including the HF_MODEL_ID
which corresponds to the model from the HuggingFace Hub that will be deployed, and the HF_TASK
which configures the inference task to be performed by the model.You should also define SM_NUM_GPUS
, which specifies the tensor parallelism degree of the model. Tensor parallelism can be used to split the model across multiple GPUs, which is necessary when working with LLMs that are too big for a single GPU. Here, you should set SM_NUM_GPUS
to the number of available GPUs on your selected instance type.1
2
3
4
5
6
7
8
9
10
11
12
13
14
model_name = "qwen-14b-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
hub = {
'HF_MODEL_ID':'Qwen/Qwen2.5-14B-Instruct',
'HF_TASK':'conversational',
'SM_NUM_GPUS':'4'
}
model = HuggingFaceModel(
name=model_name,
env=hub,
role=role,
image_uri=image_uri
)
deploy()
function.To efficiently deploy and run large language models, it is important to choose an appropriate instance type that can handle the computational requirements. Here we use an ml.g5.12xlarge
instance which comes with 4 GPUs.Please refer to the guide provided by Amazon SageMaker on large model inference instance type selection.1
2
3
4
5
predictor = model.deploy(
initial_instance_count=1,
instance_type="ml.g5.12xlarge",
endpoint_name=model_name
)
1
2
3
4
5
6
7
8
9
10
11
12
13
# Advanced generation parameters
generation_params = {
"do_sample": True,
"top_p": 0.9,
"temperature": 0.7,
"max_new_tokens": 512
}
# Sample request
predictor.predict({
"inputs": "Explain quantum computing in simple terms",
"parameters": generation_params
})
1
[{'generated_text': "Explain quantum computing in simple terms.\nQuantum computing is a type of computing that uses quantum mechanics, a branch of physics, to perform operations on data. Unlike classical computers that use bits (1s and 0s) to store and process information, quantum computers use quantum bits, or qubits. \n\nQubits have some unique properties that make them very powerful for certain types of computations. One of these properties is called superposition, which means that a qubit can be in multiple states at the same time. Another property is entanglement, which means that qubits can be linked together in a way that their states become dependent on each other.\n\nBecause of these properties, quantum computers can perform certain calculations much faster than classical computers. For example, they can quickly factor large numbers, which is important for cryptography and secure communication. They can also simulate complex quantum systems, which is useful for chemistry and materials science.\n\nIn summary, quantum computing is a way of processing information using quantum mechanics, which can solve certain problems much faster than classical computing. However, building and operating quantum computers is still a challenging task and they are not yet widely available. \n\nI hope that helps! Let me know if you have any more questions. \n\n(Note: I tried to avoid technical jargon and oversimplification in the explanation.) \n\nLet me know if you need more details or if there's anything else I can help with! 😊"}]
1
2
predictor.delete_model()
predictor.delete_endpoint()
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.