
Leveraging DeepSeek-R1 on AWS
A compilation of resources on how to leverage DeepSeek AI models on AWS, covering inference and fine-tuning across serverless, CPU and GPU options
Serverless inference using Amazon Bedrock Custom Model Import
Real-time inference on CPU using AWS Graviton and Amazon SageMaker AI
Real-time inference on GPU using Amazon Bedrock Marketplace
Real-time inference on GPU using Amazon SageMaker AI
Real-time inference on AWS Trainium using Amazon SageMaker AI
Scaling real-time inference on using Ray on EKS or Amazon EKS Auto Mode
Deploy and fine-tune from Hugging Face
- DeepSeek-R1 Distill Model on CPU with AWS Graviton4 by Vincent Wang and Yudho Ahmad Diponegoro
- Run LLMs on CPU with Amazon SageMaker Real-time Inference by Alex Tasarov, Aleksandra Jovovic, and Karan Thanvi
p5e.48xlarge
. The recommended instances to run distilled variants can be found on the AWS console when you deploy the model.Deepseek Model | AWS GPU Instance | Estimated Monthly Cost |
---|---|---|
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | g4dn.xlarge | $310 |
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | g6e.xlarge | $880 |
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | g6e.12xlarge (TP=4) | $4,100 |
deepseek-ai/DeepSeek-R1-Distill-Llama-70B | g6e.12xlarge (TP=4) | $5,900 |
deepseek-ai/DeepSeek-R1 | p5e.48xlarge (TP=8) | $30,000 |
Currently, we support continuous batching and streaming generation in the NxD Inference vLLM integration. We are working with the vLLM community to enable support for other vLLM features like PagedAttention and Chunked Prefill on Neuron instances through NxD Inference in upcoming releases.
- Amazon EKS Auto Mode: As released at re:Invent 2024, Amazon EKS Auto Mode fully automates Kubernetes cluster management for compute, storage, and networking on AWS. Here, you can set up an inference container using a library such as vLLM, and orchestrate scaling using Kubernetes.
- Ray on EKS: Ray is an open-source framework that helps you easily scale and manage AI workloads.
- Scaling DeepSeek with Ray on EKS by Vincent Wang and Faisal Masood.
- Hosting DeepSeek-R1 on Amazon EKS by Tiago Reichert and Lucas Duarte.
- DeepSeek-R1-Distill models with LLama and Qwen variants. These variants leverage the Group Query Attention (GQA) model architecture, and can be leveraged immediately on AWS.
- DeepSeek-R1, a 671B parameter model, with a new and innovative model architecture. To fully benefit from its efficiency, specific inference optimization is required (see DeepSeek Model Optimizations on SGLang or vLLM v0.71 optimizations).
- Contacting your AWS Account Team to enquire about inference optimization and additional support for DeepSeek AI models
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.