Leveraging DeepSeek-R1 on AWS

Cover image generated with Amazon Nova Canvas.

Update on 11th March 2025: DeepSeek-R1 now available as a fully managed serverless model in Amazon Bedrock. The resources below refer to alternative options.

Introduction to DeepSeek-R1

DeepSeek-R1 is a groundbreaking generative AI foundation model that combines reinforcement learning with a highly efficient Mixture of Experts architecture, delivering a high-level performance at a fraction of the cost. The model is remarkably resource-efficient, while maintaining capabilities in reasoning, mathematics, and coding.

This post covers how you can best leverage variants of the model in your AWS environment, leveraging resources and code samples from AWS experts globally.

Serverless inference using Amazon Bedrock Custom Model Import

DeepSeek-R1 released distilled versions compatible with the Llama Grouped Query Attention (GQA) architecture with 8B and 70B parameters: DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Llama-70B. As a result, you can export the model weights and use Amazon Bedrock Custom Model Import. This process enables you to leverage Bedrock's serverless infrastructure, unified API, and tooling such as Guardrails for responsible AI safeguards. Another advantage is cost-efficiency as you are billed in 5-minute windows based on the custom model units required to service your inference volume, with 2 units for the 8B model, and 8 Units for the 70B model. Importing the model can take up to 30 minutes, and bear in mind a cold start latency of 10 seconds.

Learn more on Deploy DeepSeek-R1 distilled Llama models with Amazon Bedrock Custom Model Import.

Real-time inference on CPU using AWS Graviton and Amazon SageMaker AI

Using quantization, you can deploy DeepSeek-R1 for real-time inference without the need for GPUs. To do this, you will need to export the model to a framework compatible with CPU-based inference such as LLama.cpp. Typically, the 4 or 5-bit quantization provides optimal speed/accuracy ratio. While you can quantize the model yourself, many quantized versions of the model have been published in the open-source community on Hugging Face. For example: bartowski/DeepSeek-R1-Distill-Llama-70B-GGUF from the creators of LMStudio or collections/unsloth/deepseek-r1-all-versions from the creators of Unsloth.

Learn more on:

DeepSeek-R1 Distill Model on CPU with AWS Graviton4 by Vincent Wang and Yudho Ahmad Diponegoro
Deploy Small Language Models Cost-efficiently with Amazon SageMaker and AWS Graviton by Andrew Smith
Run LLMs on CPU with Amazon SageMaker Real-time Inference by Alex Tasarov, Aleksandra Jovovic, and Karan Thanvi

Real-time inference on GPU using Amazon Bedrock Marketplace

For more advanced and/or scaled use cases, you can deploy the DeepSeek-R1 model using Amazon Bedrock Marketplace. Bedrock Marketplace offers over 100 popular, emerging, and specialized foundation models alongside the current selection of industry-leading models in Amazon Bedrock. Deploying the model with Bedrock Marketplace to managed endpoints only takes a few clicks. The recommended instance for serving the 671B parameter model is p5e.48xlarge. The recommended instances to run distilled variants can be found on the AWS console when you deploy the model.

Learn more on DeepSeek-R1 model now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart by Vivek Gangasani, Niithiyn Vijeaswaran, Jonathan Evans, Banu Nagasundaram.

Real-time inference on GPU using Amazon SageMaker AI

For more advanced and/or scaled use cases and greater flexibility, you can deploy DeepSeek-R1 for real-time inference on GPU-based instances on Amazon SageMaker AI. Here, SageMaker provides the an inference optimized stack with Large Model Inference (LMI) containers that supports popular inference optimization libraries such as vLLM and NVIDIA's Tensor-RT LLM. SageMaker is consistently optimized for scalability and efficiency with container caching and scale to zero (as announced at re:Invent 2024). Depending on the size of the model, you will need to request for and select the appropriate GPU instance and appropriate configuration such as tensor parallelization (TP) and batching. For example:

Deepseek Model	AWS GPU Instance	Estimated Monthly Cost
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	g4dn.xlarge	$310
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B	g6e.xlarge	$880
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	g6e.12xlarge (TP=4)	$4,100
deepseek-ai/DeepSeek-R1-Distill-Llama-70B	g6e.12xlarge (TP=4)	$5,900
deepseek-ai/DeepSeek-R1	p5e.48xlarge (TP=8)	$30,000

*Note: The instance and costs are estimates only. Contact your AWS account team to learn more.

Learn more on example notebook by Vivek Gangasani or Deployment on Amazon SageMaker - DeepSeek R1 on AWS by Davide Gallitelli or aws-samples/deepseek-on-sagemaker by Sungmin Kim.

Real-time inference on AWS Trainium using Amazon SageMaker AI

In addition to (NVIDIA) GPUs, AWS also offers instances that utilizes AWS Trainium AI chips. These are purpose-built for specific generative AI workloads and can be leveraged by using the AWS Neuron SDK. While popular open-source inference library vLLM supports AWS Neuron, it may not yet have comprehensive inference optimization capabilities available with (NVIDIA) GPUs. For example, as at time of writing, AWS Neuron What's New states:

Currently, we support continuous batching and streaming generation in the NxD Inference vLLM integration. We are working with the vLLM community to enable support for other vLLM features like PagedAttention and Chunked Prefill on Neuron instances through NxD Inference in upcoming releases.

Learn more on Get started with DeepSeek R1 on AWS Inferentia and Trainium.

Scaling real-time inference on using Ray on EKS or Amazon EKS Auto Mode

If you have Kubernetes expertise or prefer to have greater flexibility and control on your inference infrastructure: You can use:

Amazon EKS Auto Mode: As released at re:Invent 2024, Amazon EKS Auto Mode fully automates Kubernetes cluster management for compute, storage, and networking on AWS. Here, you can set up an inference container using a library such as vLLM, and orchestrate scaling using Kubernetes.
Ray on EKS: Ray is an open-source framework that helps you easily scale and manage AI workloads.

Learn more on:

Scaling DeepSeek with Ray on EKS by Vincent Wang and Faisal Masood.
Hosting DeepSeek-R1 on Amazon EKS by Tiago Reichert and Lucas Duarte.

Deploy and fine-tune from Hugging Face

The release of DeepSeek demonstrates the power of open-source models, in just a few days since release, there has been more than 500 derivative models created on Hugging Face. Hugging Face collaborates with AWS to make it easier for developers to deploy the latest Hugging Face models on AWS services to build generative AI applications. In addition, Hugging Face also has its own training and inference container in Hugging Face Text Generation Inference (TGI).

Learn more on Hugging Face: How to deploy and fine-tune DeepSeek models on AWS.

Fine-tuning on Amazon SageMaker

Due to its efficiency, DeepSeek R1 is a good candidate base model for fine-tuning, where you customize the model for specialised tasks. You can use SageMaker to orchestrate efficient fine-tuning, using popular libaries such as PyTorch FSDP for data parallelism, and QLoRa for reducing memory usage.

See examples for Fine-tune DeepSeek-R1-Distill-Qwen-32B, DeepSeek-R1-Distill-Llama-70B and DeepSeek-R1-Distill-Llama-8B by Bruno Pistone.

Conclusion

DeepSeek's open-source models have been making waves in the AI and startup ecosystem recently, primarily due to its reasoning capabilities and efficiency. This expands the toolkit for builders looking to create solutions with AI, and highlights the importance of designing systems that can adapt to evolving AI models. This is where evals and open-source tools such as Promptfoo for offline evals, and LangFuse for observability can assist.

It is also worth nothing that DeepSeek models have two distinct model architectures:

DeepSeek-R1-Distill models with LLama and Qwen variants. These variants leverage the Group Query Attention (GQA) model architecture, and can be leveraged immediately on AWS.
DeepSeek-R1, a 671B parameter model, with a new and innovative model architecture. To fully benefit from its efficiency, specific inference optimization is required (see DeepSeek Model Optimizations on SGLang or vLLM v0.71 optimizations).

For optimal price-performance, consider starting with DeepSeek distilled (or quantized) variants and evaluate whether its price-performance meets your needs.

For high-volume use cases and the DeepSeek-R1 671B parameter model, consider:

Deploying DeepSeek-R1 models from the Amazon Bedrock Marketplace
Contacting your AWS Account Team to enquire about inference optimization and additional support for DeepSeek AI models

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.