AWS Logo
Menu
Deploying Alibaba's Qwen-2.5 Model on AWS using Amazon SageMaker AI

Deploying Alibaba's Qwen-2.5 Model on AWS using Amazon SageMaker AI

Step-by-step guide: Deploying Qwen2.5-14B-Instruct Model on Amazon SageMaker AI

Jarrett
Amazon Employee
Published Jan 31, 2025

Introduction

Alibaba, a Chinese technology company, has recently made headlines with the introduction of Qwen 2.5-Max, an advanced AI model that challenges existing industry leaders. Unveiled on January 31, 2025, Qwen 2.5-Max claims to outperform models from DeepSeek, OpenAI, and Meta, particularly in areas such as coding, mathematics, and vision-language tasks.
Qwen 2.5-Max is accessible through the Qwen Chat platform and is not fully open-source at the time of writing, with some components available on open-source platforms such as Qwen 2.5 on Huggingface. Note that this blog post focuses on deploying Qwen 2.5 on AWS.
Hosting Qwen 2.5 on AWS offers unparalleled scalability and flexibility, ensuring you can seamlessly leverage its powerful AI capabilities for your specific use case - whether for research, business intelligence, or development projects.This blog post will guide you through a step-by-step process for hosting Qwen 2.5, specifically the Qwen2.5-14B-Instruct model, on AWS infrastructure. This deployment will involve deploying the model on Amazon SageMaker AI, enabling you to harness Qwen 2.5 capabilities within the cloud.

Instructions

Section 1. Raising endpoint usage service quota

In this tutorial, you will deploy the Qwen2.5-14B-Instruct model on the ml.g5.12xlarge instance type.
Because the default quota for ml.g5.12xlarge for endpoint usage is 0, you will need to raise it.
In the AWS Console, search for Service Quotas
In the AWS Console, search for Service Quotas
Next, search for SageMaker
Next, search for SageMaker
Search for ml.g5.12xlarge for endpoint usage, then click on Request increase at account level
Search for ml.g5.12xlarge for endpoint usage, then click on Request increase at account level
Enter 1 under Increase quota value
Enter 1 under Increase quota value
Finally, wait until your request is processed, you should see a success message in a few minutes
Finally, wait until your request is processed, you should see a success message in a few minutes

Section 2. Setting up SageMaker AI Domain

In this section, you will need to set up your SageMaker AI domain in the region which you desire. Follow this guide for a fuss-free creation of your domain. This can take a while, so be patient!
Click on the newly created domain
Click on the newly created domain
Once the domain has been successfully created, browse to your user profile, then click on Studio. This will bring you to the SageMaker AI Studio console.
Click on User Profiles, and under Launch, select Studio
Click on User Profiles, and under Launch, select Studio
You will then be brought to SageMaker AI Studio console. Click on JupyterLab, and follow this guide to create a new space.Once you have opened JupyterLab, you should see the Launcher tab. If not, click on the + button. Under Notebook, click on Python 3 (ipykernel) to create a new notebook.

Section 3. Deploying Qwen-2.5

We are finally ready to deploy Qwen-2.5 on Amazon SageMaker AI!
Copy and paste each of the code blocks below into a single cell and press the Play button to run each cell. At the end of this, you will deploy your Qwen-2.5 model as a SageMaker Endpoint.

Install the SageMaker Python SDK

First, make sure that the latest version of SageMaker SDK is installed.

Setup account and role

Then, we import the SageMaker python SDK and instantiate a sagemaker_session which we use to determine the current region and execution role.

Retrieve the LLM Image URI

We use the helper function get_huggingface_llm_image_uri() to generate the appropriate image URI for the Hugging Face Large Language Model (LLM) inference.The function takes a required parameter backend and several optional parameters. The backend specifies the type of backend to use for the model. Input the value as "huggingface" which refers to using Hugging Face TGI inference backend.

Create the Hugging Face Model

Next we configure the model object by specifying a unique name, the image_uri for the managed TGI container, and the execution role for the endpoint. Additionally, we specify a number of environment variables including the HF_MODEL_ID which corresponds to the model from the HuggingFace Hub that will be deployed, and the HF_TASK which configures the inference task to be performed by the model.You should also define SM_NUM_GPUS, which specifies the tensor parallelism degree of the model. Tensor parallelism can be used to split the model across multiple GPUs, which is necessary when working with LLMs that are too big for a single GPU. Here, you should set SM_NUM_GPUS to the number of available GPUs on your selected instance type.

Creating a SageMaker Endpoint

Next we deploy the model by invoking the deploy() function.To efficiently deploy and run large language models, it is important to choose an appropriate instance type that can handle the computational requirements. Here we use an ml.g5.12xlarge instance which comes with 4 GPUs.Please refer to the guide provided by Amazon SageMaker on large model inference instance type selection.

Sample Inference Usage

Once the model has been deployed successfully, in the next cell, you may specify hyperparameters and make an inference to call the model:
Here is a sample response:

Cleaning Up

After you've finished using the endpoint, it's important to delete it to avoid incurring unnecessary costs.

FAQ

When should I consider using Amazon SageMaker over Amazon Bedrock Custom Model Import?

One reason you would consider using Amazon SageMaker over importing a customized model into Amazon Bedrock is if your target region does not support said feature yet. View the available regions for Amazon Bedrock Custom Model Import here.

References

About the Authors

Germaine Ong - Startup Solutions Architect, AWS
Germaine is a Startup Solutions Architect in the AWS ASEAN Startup team covering Singapore Startup customers. She is an advocate for helping customers modernise their cloud workloads and improving their security stature through architecture reviews.
Jarrett Yeo - Associate Cloud Architect, AWS
Jarrett Yeo Shan Wei is a Delivery Consultant in the AWS Professional Services team covering the Public Sector across ASEAN and is an advocate for helping customers modernize and migrate into the cloud. He has attained five AWS certifications, and has also published a research paper on gradient boosting machine ensembles in the 8th International Conference on AI. In his free time, Jarrett focuses on and contributes to the generative AI scene at AWS.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments