AWS Logo
Menu
8 ways to deploy models using Amazon SageMaker

8 ways to deploy models using Amazon SageMaker

Learn how to choose between 8 different model deployment options in Amazon SageMaker

Published May 12, 2025
Machine learning (ML) models have become an integral part of a growing number of applications, tackling diverse challenges from forecasting and classification to predictive maintenance and recommendation engines. Transitioning these models from experimentation to reliable production systems can sometimes feel like navigating a maze, demanding careful consideration of latency, traffic management, and cost efficiency. Fortunately, Amazon SageMaker can help us streamline this process, providing a fully-managed & highly available infrastructure that caters for all needs. In this blog post, you will learn about eight methods for deploying ML models using Amazon SageMaker so you can choose the optimal approach for your use case. 
Depending on the required latency, throughput, request frequency, and payload size, one might choose between real-time, asynchronous, serverless, or batch inference. In practice, there are some additional variables we need to consider in our equation such as: how many models we plan to deploy and whether they share an underlying framework, whether we need to add a pre or post processing logic for the model inputs and outputs, and how to handle A/B testing and traffic split between multiple models or multiple versions of the same model. To address these additional challenges, Amazon SageMaker provides a very granular way for configuring the deployment strategy for our ML workloads. 

How does the model deployment process work? 

Regardless of the model deployment strategy, the process of deploying a model has two prerequisites: the inference code and the model artifacts. After a training job has completed, your model artifacts will be available in an Amazon S3 bucket. Concerning the inference code, if you are using a built-in SageMaker algorithm, this is already handled for you, and you don’t need to worry about writing any custom code. However, if you have built your own algorithm, you either need to supply your inference script or your own model container image. In all scenarios, the model will be deployed in a SageMaker container that will have both the model artifacts and the inference logic. With this process in mind, let’s now explore the different options for deploying models.
Model Deployment Simplified
Model Deployment Simplified

(1) Starting Simple: One Model, One Container

This foundational deployment strategy involves packaging a single machine learning model within its own dedicated container and serving it via a real-time endpoint. It offers simplicity in setup and management, making it an excellent starting point for deploying individual models and understanding the basic inference process on Amazon SageMaker. This strategy is suitable for experimentation and testing or if you are working on a simple ML use case, and it is the easiest and the fastest configuration to spin up 

(2) Multiple Production Variants

In practice, you will frequently need to experiment with different model versions and potentially split the traffic between them. In SageMaker, this process can be streamlined by adding multiple production variants for a single endpoint, each variant hosting its own model. Using production variants, you have the flexibility in deciding how to split the traffic between multiple models and the possibility to invoke a target variant by specifying the TargetVariant header. Furthermore, each production variant has its own separate autoscaling policy which means you can host models with different resource requirements. 
You should use the concept of multiple production variants for A/B testing and traffic splitting between different models and model versions with the flexibility of defining separate auto-scaling policies for each variant. 

(3) Inference Pipeline Models

Machine learning models typically require the input data in a certain format and can produce the output in a format that might not be directly suitable for your end users or backend application. Therefore, another frequently encountered scenario is the need to execute some pre- and postprocessing steps for the model inputs and outputs. These steps can be encapsulated and executed in separate containers that are linked together to form an inference pipeline model.  In an inference pipeline model, you can define a series of at least 2 and up to 15 containers that are executed in a certain order to produce the desired output, offering you the possibility to handle a complex inference logic. 
The steps you include as containers in an inference pipeline model can either be processing scripts for the model inputs and outputs, or you can even use a combination of built-in and custom algorithms, all executed in a sequence. For each inference request, the output of the first container becomes the input for the second container, and so on. The limitation in this case is that you cannot invoke the containers separately, you can only control the order in which the containers are executed and the payload you provide to the first container in the series.

(4) Maximize resource efficiency: Multiple models, One Container

For some ML use cases, you may need to work with multiple models and have the possibility to execute them separately (not in a sequence). If your models are built using the same underlying framework, you can maximize the resource efficiency by adding multiple models on a single container. Then, using your custom application business logic, you can control how the models are invoked via the inference endpoint. 
This strategy is suitable if you are working with multiple models built using the same ML framework that are used infrequently and do not require separate autoscaling policies (since the models share the same container and underlying compute resources, they will be scaled together, as a single unit).

(5) Multi-container models with direct invocation

If you are working with multiple models that do not share the same framework, but have similar resource requirements, you can deploy them in a multi-container endpoint with direct invocation. This means that each container from the endpoint can be invoked separately when making a request to the SageMaker endpoint using a special header, TargetContainerHostname. 
If your models have the same resource requirements, it is definitely easier and more cost effective to have a single endpoint serve multiple models, even if they were built using different frameworks. However, the limitation is that the models cannot be scaled separately as the containers will be copied on each EC2 instance that serves the endpoint. In terms of infrastructure, this approach is very similar to the inference pipeline concept, the main difference being that in this situation you have more control and can invoke each container separately.

(6) Queue requests with Asynchronous Inference

If your payload size is higher than 4 MB, and you still require near real-time latency, you can use asynchronous inference. With asynchronous inference, you can handle payloads of up to 1 GB and long processing times of up to one hour. Furthermore, you have the possibility to scale down your endpoint to 0 when there are no requests to process. The process of creating an asynchronous inference endpoint is similar to real-time, the main difference being that the input data would typically come from an S3 bucket location that you need to specify when invoking the endpoint. Furthermore, you can optionally choose to receive error or success notifications by adding Amazon SNS topics in the endpoint configuration.  

(7) Batch Transform: High-throughput Offline Inference

When low-latency real-time inference is not a strict requirement and you need to process large datasets for tasks like generating predictions in bulk, SageMaker Batch Transform offers a cost-effective and scalable solution. This feature allows you to run inference jobs on entire datasets stored in Amazon S3, delivering high throughput for offline prediction tasks. You can trigger such a Batch Transform Job on a schedule (daily, hourly, weekly) using an Amazon EventBridge rule, using an Amazon S3 notification event, when a new dataset is uploaded or as part of a workflow using Amazon SageMaker Pipelines or AWS Step Functions

(8) Going serverless

The serverless inference option on Amazon SageMaker provides a highly scalable and cost-efficient way to deploy models without managing underlying infrastructure. This deployment strategy is easy to set up and is particularly suitable in scenarios where you have unpredictable traffic spikes followed by periods with no traffic. To avoid cold starts after a period with no traffic, you can configure provisioned concurrency for your serverless endpoint. 
A limitation you may encounter with serverless inference is the number of concurrent invocations the endpoint can handle, which can be set to a maximum of 200 per endpoint. 

Model Deployment Choice Diagram

The best deployment strategy for your model depends on several factors, including latency, traffic patterns, complexity of the inference workload, model experimentation needs, and overall resource efficiency and cost optimization. The following diagram provides you with comprehensive guidance in how to choose between the different deployment options available in Amazon SageMaker.
Model Deployment Choice Diagram
Model Deployment Choice Diagram

Conclusion

In this blog post, I showed you eight distinct ways to deploy machine learning models using Amazon SageMaker, each tailored to different operational needs. Understanding how to navigate these various model deployment options is essential for efficiently transitioning your machine learning workload from experimentation to production. By carefully analyzing your specific scenario, you can leverage Amazon SageMaker's flexibility to seamlessly integrate your models into real-world applications.
 

Comments