Deploying Alibaba's Qwen-2.5 Model on AWS using Amazon SageMaker AI

Introduction

Alibaba, a Chinese technology company, has recently made headlines with the introduction of Qwen 2.5-Max, an advanced AI model that challenges existing industry leaders. Unveiled on January 31, 2025, Qwen 2.5-Max claims to outperform models from DeepSeek, OpenAI, and Meta, particularly in areas such as coding, mathematics, and vision-language tasks.

Qwen 2.5-Max is accessible through the Qwen Chat platform and is not fully open-source at the time of writing, with some components available on open-source platforms such as Qwen 2.5 on Huggingface. Note that this blog post focuses on deploying Qwen 2.5 on AWS.

Hosting Qwen 2.5 on AWS offers unparalleled scalability and flexibility, ensuring you can seamlessly leverage its powerful AI capabilities for your specific use case - whether for research, business intelligence, or development projects.This blog post will guide you through a step-by-step process for hosting Qwen 2.5, specifically the Qwen2.5-14B-Instruct model, on AWS infrastructure. This deployment will involve deploying the model on Amazon SageMaker AI, enabling you to harness Qwen 2.5 capabilities within the cloud.

Instructions

Section 1. Raising endpoint usage service quota

In this tutorial, you will deploy the Qwen2.5-14B-Instruct model on the ml.g5.12xlarge instance type.

Because the default quota for ml.g5.12xlarge for endpoint usage is 0, you will need to raise it.

In the AWS Console, search for Service Quotas

Search for ml.g5.12xlarge for endpoint usage, then click on Request increase at account level

Finally, wait until your request is processed, you should see a success message in a few minutes

Section 2. Setting up SageMaker AI Domain

In this section, you will need to set up your SageMaker AI domain in the region which you desire. Follow this guide for a fuss-free creation of your domain. This can take a while, so be patient!

Once the domain has been successfully created, browse to your user profile, then click on Studio. This will bring you to the SageMaker AI Studio console.

Click on User Profiles, and under Launch, select Studio

You will then be brought to SageMaker AI Studio console. Click on JupyterLab, and follow this guide to create a new space.Once you have opened JupyterLab, you should see the Launcher tab. If not, click on the + button. Under Notebook, click on Python 3 (ipykernel) to create a new notebook.

Section 3. Deploying Qwen-2.5

We are finally ready to deploy Qwen-2.5 on Amazon SageMaker AI!

Copy and paste each of the code blocks below into a single cell and press the Play button to run each cell. At the end of this, you will deploy your Qwen-2.5 model as a SageMaker Endpoint.

Install the SageMaker Python SDK

First, make sure that the latest version of SageMaker SDK is installed.

1
%pip install "sagemaker>=2.163.0"

Setup account and role

Then, we import the SageMaker python SDK and instantiate a sagemaker_session which we use to determine the current region and execution role.

1
2
3
4
5
6
7
import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import time

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

Retrieve the LLM Image URI

We use the helper function get_huggingface_llm_image_uri() to generate the appropriate image URI for the Hugging Face Large Language Model (LLM) inference.The function takes a required parameter backend and several optional parameters. The backend specifies the type of backend to use for the model. Input the value as "huggingface" which refers to using Hugging Face TGI inference backend.

1
2
3
4
image_uri = get_huggingface_llm_image_uri(
  backend="huggingface",
  region=region
)

Create the Hugging Face Model

Next we configure the model object by specifying a unique name, the image_uri for the managed TGI container, and the execution role for the endpoint. Additionally, we specify a number of environment variables including the HF_MODEL_ID which corresponds to the model from the HuggingFace Hub that will be deployed, and the HF_TASK which configures the inference task to be performed by the model.You should also define SM_NUM_GPUS, which specifies the tensor parallelism degree of the model. Tensor parallelism can be used to split the model across multiple GPUs, which is necessary when working with LLMs that are too big for a single GPU. Here, you should set SM_NUM_GPUS to the number of available GPUs on your selected instance type.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
model_name = "qwen-14b-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

hub = {
    'HF_MODEL_ID':'Qwen/Qwen2.5-14B-Instruct',
    'HF_TASK':'conversational',
    'SM_NUM_GPUS':'4'
}

model = HuggingFaceModel(
    name=model_name,
    env=hub,
    role=role,
    image_uri=image_uri
)

Creating a SageMaker Endpoint

Next we deploy the model by invoking the deploy() function.To efficiently deploy and run large language models, it is important to choose an appropriate instance type that can handle the computational requirements. Here we use an ml.g5.12xlarge instance which comes with 4 GPUs.Please refer to the guide provided by Amazon SageMaker on large model inference instance type selection.

1
2
3
4
5
predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.12xlarge",
    endpoint_name=model_name
)

Sample Inference Usage

Once the model has been deployed successfully, in the next cell, you may specify hyperparameters and make an inference to call the model:

1
2
3
4
5
6
7
8
9
10
11
12
13
# Advanced generation parameters
generation_params = {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.7,
    "max_new_tokens": 512
}

# Sample request
predictor.predict({
    "inputs": "Explain quantum computing in simple terms",
    "parameters": generation_params
})

Here is a sample response:

1
[{'generated_text': "Explain quantum computing in simple terms.\nQuantum computing is a type of computing that uses quantum mechanics, a branch of physics, to perform operations on data. Unlike classical computers that use bits (1s and 0s) to store and process information, quantum computers use quantum bits, or qubits. \n\nQubits have some unique properties that make them very powerful for certain types of computations. One of these properties is called superposition, which means that a qubit can be in multiple states at the same time. Another property is entanglement, which means that qubits can be linked together in a way that their states become dependent on each other.\n\nBecause of these properties, quantum computers can perform certain calculations much faster than classical computers. For example, they can quickly factor large numbers, which is important for cryptography and secure communication. They can also simulate complex quantum systems, which is useful for chemistry and materials science.\n\nIn summary, quantum computing is a way of processing information using quantum mechanics, which can solve certain problems much faster than classical computing. However, building and operating quantum computers is still a challenging task and they are not yet widely available. \n\nI hope that helps! Let me know if you have any more questions. \n\n(Note: I tried to avoid technical jargon and oversimplification in the explanation.) \n\nLet me know if you need more details or if there's anything else I can help with! 😊"}]

Cleaning Up

After you've finished using the endpoint, it's important to delete it to avoid incurring unnecessary costs.

1
2
predictor.delete_model()
predictor.delete_endpoint()

FAQ

When should I consider using Amazon SageMaker over Amazon Bedrock Custom Model Import?

One reason you would consider using Amazon SageMaker over importing a customized model into Amazon Bedrock is if your target region does not support said feature yet. View the available regions for Amazon Bedrock Custom Model Import here.

References

About the Authors

Germaine Ong - Startup Solutions Architect, AWS

Germaine is a Startup Solutions Architect in the AWS ASEAN Startup team covering Singapore Startup customers. She is an advocate for helping customers modernise their cloud workloads and improving their security stature through architecture reviews.

Jarrett Yeo - Associate Cloud Architect, AWS

Jarrett Yeo Shan Wei is a Delivery Consultant in the AWS Professional Services team covering the Public Sector across ASEAN and is an advocate for helping customers modernize and migrate into the cloud. He has attained five AWS certifications, and has also published a research paper on gradient boosting machine ensembles in the 8th International Conference on AI. In his free time, Jarrett focuses on and contributes to the generative AI scene at AWS.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.