Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

AWS Logo
Menu
Leveraging DeepSeek-R1 on AWS

Leveraging DeepSeek-R1 on AWS

A compilation of resources on how to leverage DeepSeek AI models on AWS, covering inference and fine-tuning across serverless, CPU and GPU options

Daniel Wirjo
Amazon Employee
Published Jan 28, 2025
Last Modified Feb 18, 2025
Cover image generated with Amazon Nova Canvas.
Updated - 21st February 2025: Updated with reference to vLLM v0.71 optimization on DeepSeek AI models and example for Ray on EKS.

Introduction to DeepSeek-R1

DeepSeek-R1 is a groundbreaking generative AI foundation model that combines reinforcement learning with a highly efficient Mixture of Experts architecture, delivering a high-level performance at a fraction of the cost. The model is remarkably resource-efficient, while maintaining capabilities in reasoning, mathematics, and coding.
This post covers how you can best leverage variants of the model in your AWS environment, leveraging resources and code samples from AWS experts globally.

Serverless inference using Amazon Bedrock Custom Model Import

DeepSeek-R1 released distilled versions compatible with the Llama Grouped Query Attention (GQA) architecture with 8B and 70B parameters: DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Llama-70B. As a result, you can export the model weights and use Amazon Bedrock Custom Model Import. This process enables you to leverage Bedrock's serverless infrastructure, unified API, and tooling such as Guardrails for responsible AI safeguards. Another advantage is cost-efficiency as you are billed in 5-minute windows based on the custom model units required to service your inference volume, with 2 units for the 8B model, and 8 Units for the 70B model. Importing the model can take up to 30 minutes, and bear in mind a cold start latency of 10 seconds.

Real-time inference on CPU using AWS Graviton and Amazon SageMaker AI

Using quantization, you can deploy DeepSeek-R1 for real-time inference without the need for GPUs. To do this, you will need to export the model to a framework compatible with CPU-based inference such as LLama.cpp. Typically, the 4 or 5-bit quantization provides optimal speed/accuracy ratio. While you can quantize the model yourself, many quantized versions of the model have been published in the open-source community on Hugging Face. For example: bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF from the creators of LMStudio or collections/unsloth/deepseek-r1-all-versions from the creators of Unsloth.
Learn more on:

Real-time inference on GPU using Amazon Bedrock Marketplace

For more advanced and/or scaled use cases, you can deploy the DeepSeek-R1 model using Amazon Bedrock Marketplace. Bedrock Marketplace offers over 100 popular, emerging, and specialized foundation models alongside the current selection of industry-leading models in Amazon Bedrock. Deploying the model with Bedrock Marketplace to managed endpoints only takes a few clicks. The recommended instance for serving the 671B parameter model is p5e.48xlarge. The recommended instances to run distilled variants can be found on the AWS console when you deploy the model.
Learn more on DeepSeek-R1 model now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart by Vivek Gangasani, Niithiyn Vijeaswaran, Jonathan Evans, Banu Nagasundaram.

Real-time inference on GPU using Amazon SageMaker AI

For more advanced and/or scaled use cases and greater flexibility, you can deploy DeepSeek-R1 for real-time inference on GPU-based instances on Amazon SageMaker AI. Here, SageMaker provides the an inference optimized stack with Large Model Inference (LMI) containers that supports popular inference optimization libraries such as vLLM and NVIDIA's Tensor-RT LLM. SageMaker is consistently optimized for scalability and efficiency with container caching and scale to zero (as announced at re:Invent 2024). Depending on the size of the model, you will need to request for and select the appropriate GPU instance and appropriate configuration such as tensor parallelization (TP) and batching. For example:
Deepseek ModelAWS GPU InstanceEstimated Monthly Cost
deepseek-ai/DeepSeek-R1-Distill-Qwen-7Bg4dn.xlarge$310
deepseek-ai/DeepSeek-R1-Distill-Qwen-14Bg6e.xlarge$880
deepseek-ai/DeepSeek-R1-Distill-Qwen-32Bg6e.12xlarge (TP=4)$4,100
deepseek-ai/DeepSeek-R1-Distill-Llama-70Bg6e.12xlarge (TP=4)$5,900
deepseek-ai/DeepSeek-R1p5e.48xlarge (TP=8)$30,000
*Note: The instance and costs are estimates only. Contact your AWS account team to learn more.
Learn more on example notebook by Vivek Gangasani or Deployment on Amazon SageMaker - DeepSeek R1 on AWS by Davide Gallitelli or aws-samples/deepseek-on-sagemaker by Sungmin Kim.

Real-time inference on AWS Trainium using Amazon SageMaker AI

In addition to (NVIDIA) GPUs, AWS also offers instances that utilizes AWS Trainium AI chips. These are purpose-built for specific generative AI workloads and can be leveraged by using the AWS Neuron SDK. While popular open-source inference library vLLM supports AWS Neuron, it may not yet have comprehensive inference optimization capabilities available with (NVIDIA) GPUs. For example, as at time of writing, AWS Neuron What's New states:
Currently, we support continuous batching and streaming generation in the NxD Inference vLLM integration. We are working with the vLLM community to enable support for other vLLM features like PagedAttention and Chunked Prefill on Neuron instances through NxD Inference in upcoming releases.

Scaling real-time inference on using Ray on EKS or Amazon EKS Auto Mode

If you have Kubernetes expertise or prefer to have greater flexibility and control on your inference infrastructure: You can use:
  1. Amazon EKS Auto Mode: As released at re:Invent 2024, Amazon EKS Auto Mode fully automates Kubernetes cluster management for compute, storage, and networking on AWS. Here, you can set up an inference container using a library such as vLLM, and orchestrate scaling using Kubernetes.
  2. Ray on EKS: Ray is an open-source framework that helps you easily scale and manage AI workloads.
Learn more on:

Deploy and fine-tune from Hugging Face

The release of DeepSeek demonstrates the power of open-source models, in just a few days since release, there has been more than 500 derivative models created on Hugging Face. Hugging Face collaborates with AWS to make it easier for developers to deploy the latest Hugging Face models on AWS services to build generative AI applications. In addition, Hugging Face also has its own training and inference container in Hugging Face Text Generation Inference (TGI).

Fine-tuning on Amazon SageMaker

Due to its efficiency, DeepSeek R1 is a good candidate base model for fine-tuning, where you customize the model for specialised tasks. You can use SageMaker to orchestrate efficient fine-tuning, using popular libaries such as PyTorch FSDP for data parallelism, and QLoRa for reducing memory usage.

Conclusion

DeepSeek's open-source models have been making waves in the AI and startup ecosystem recently, primarily due to its reasoning capabilities and efficiency. This expands the toolkit for builders looking to create solutions with AI, and highlights the importance of designing systems that can adapt to evolving AI models. This is where evals and open-source tools such as Promptfoo for offline evals, and LangFuse for observability can assist.
It is also worth nothing that DeepSeek models have two distinct model architectures:
  • DeepSeek-R1-Distill models with LLama and Qwen variants. These variants leverage the Group Query Attention (GQA) model architecture, and can be leveraged immediately on AWS.
  • DeepSeek-R1, a 671B parameter model, with a new and innovative model architecture. To fully benefit from its efficiency, specific inference optimization is required (see DeepSeek Model Optimizations on SGLang or vLLM v0.71 optimizations).
For optimal price-performance, consider starting with DeepSeek distilled (or quantized) variants and evaluate whether its price-performance meets your needs.
For high-volume use cases and the DeepSeek-R1 671B parameter model, consider:

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments

Log in to comment