AWS Logo
Menu
Implement cross-region inference with Amazon SageMaker AI Endpoints privately inside your VPC

Implement cross-region inference with Amazon SageMaker AI Endpoints privately inside your VPC

This blog post showcases how to implement multi-region (cross-region) access to LLM serving with Amazon SageMaker AI to enable more flexible access to accelerated inference capacity while maintaining traffic within network isolation.

Oussama Kandakji
Amazon Employee
Published May 7, 2025

Why This Matters

In today's AI-driven business landscape, organizations face a critical challenge: how to access sufficient compute capacity for demanding machine learning workloads while maintaining strict security controls. This architecture delivers crucial advantages for enterprises that:
  • Need flexible compute capacity expansion: Access specialized ML instances (like GPUs and custom accelerators) in secondary regions when your primary region faces capacity constraints, without relocating your entire application stack
  • Maintain centralized application architecture: Keep your core application infrastructure in one region while extending AI capabilities across regions, simplifying operations and reducing management overhead
  • Enforce strict network isolation: Ensure sensitive AI models and data never traverse the public internet, addressing critical security and compliance requirements for highly regulated industries
  • Enable model redundancy without duplication: Deploy the same model across regions (primary or secondary) with centralized management, reducing operational complexity
  • Optimize for cost-efficiency: Leverage instance pricing advantages across regions for inference while maintaining consistent application architecture
In this blog post, we will walk through a sample multi-region implementation where we host an application stack in one region and use Amazon SageMaker AI as an LLM serving component in another region while maintaining private and secure access and connectivity.

Network Architecture

High Level Architecture
High Level Architecture
As shown in the architecture diagram, the architecture consists of:
  1. Two AWS Regions: Region A and Region B, each containing their own VPC
  2. VPC A: Contains an sample Lambda Function that sends inference requests from within the VPC
  3. VPC B: Contains the SageMaker AI runtime VPC endpoint that enables secure access to a SageMaker AI endpoint
  4. VPC Peering Connection: Used to connect the two VPCs. However, this can also be replaced by a Transit Gateway.
  5. Private Hosted Zone: Enables DNS resolution in VPC B for the SageMaker AI runtime endpoint for VPC A
The architecture enables secure communication between the ressource inside VPC A in Region A and the SageMaker endpoint in Region B, all within private network boundaries without going over the public internet.

Network Configuration

Here're the key resource configurations to implement:
  1. VPC: VPC A and VPC B must have DNS resolution and DNS hostnames set to Enabled. More details can be found here.
  2. VPC Peering Connection: This connection enables network communication and DNS resolution between the two VPCs.
    1. The VPC Peering Connection must be correctly referenced as a target in the private subnet(s) Route Table as a target for the corresponding neighbor VPC CIDR. More details can be found here.
  3. SageMaker AI Runtime VPC Endpoint: This is an Interface VPC Endpoint powered by AWS PrivateLink.
    1. The Service Name for this end point is com.amazonaws.<Region B>.sagemaker.runtime.
    2. The Security Group attached to the VPC Endpoint must allow HTTPS traffic from VPC A or the Security Group attached to the application resources in VPC A
  4. Lambda Function: Deployed in VPC A to generate inference requests
    1. The Lambda function must be deployed inside VPC A
    2. The Security Group attached to the Lambda function must allow outbound traffic to VPC B or the Security Group attached to the VPC Endpoint
  5. Route 53 Private Hosted Zone: Configured with domain name runtime.sagemaker-<Region B>.amazonaws.com

DNS Configurations

DNS is a critical component in this architecture since the application stack needs to resolve the domain name of the VPC Endpoint in Region B. To configure DNS resolution for cross-region connectivity, do the following steps
  1. Create a Private Hosted Zone in Route53 with domain name runtime.sagemaker.<Region B>.amazonaws.com
  2. Associate the Priate Hosted Zone to VPC A in Region A.
  3. Create an A Record with a target Alias record pointing to the SageMaker AI Runtime VPC endpoint in Region B

Test Code

Here's a simple Python code snippet to test the cross-region inference setup:
When run from a VPC-attached Lambda Function in Region A, this code will connect to the SageMaker endpoint in Region B through the private network.

Conclusion

This architecture provides an implementation for organizations needing to extend AI inference capabilities across regions while maintaining control over their inference traffic. By implementing cross-region SageMaker inference within VPCs, you can:
  1. Access additional compute capacity: Utilize ML instances in secondary regions without relocating your entire application stack.
  2. Preserve network isolation: Keep all traffic within your private VPC networks, preventing your AI workloads from traversing the public internet.
  3. Maintain operational simplicity: Continue managing your core application in one region while extending AI capabilities to another with minimal architectural complexity.
By following the implementation steps outlined in this post, you can create a path to additional compute resources while keeping your network traffic private and your application architecture streamlined.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments