Re:Infrastructure: for NextGen AI/ML and Beyond
Drawing from AWS re:Invent 2024 announcements, we explore how AI infrastructure is evolving from traditional stack-based approaches to a comprehensive ecosystem perspective. This analysis offers AI/ML practitioners practical insights for building next-generation infrastructure with the latest AI/ML related AWS updates.
Published Dec 17, 2024
Abstract
This paper explores the changing landscape of AI infrastructure, inspired by AWS re:Invent 2024, and shifts from a stack-based to an ecosystem perspective. We discuss how AWS announcements like the Trainium2 UltraServer and next-gen AI services reshape AI infrastructure concepts and introduce a framework viewing it as a dynamic ecosystem of "hard" (computing, storage, networking) and "soft" (frameworks, tools, processes) components.
By reviewing key re:Invent updates, we address bottlenecks like compatibility, agility, and gaps in AI-optimized infrastructure. The paper highlights trends in AIOps and infrastructure transformation, offering insights for organizations adopting advanced AI systems. AI/ML practitioners and architects can learn how AWS innovations support building scalable, efficient, and future-ready AI infrastructure through a holistic ecosystem approach.
1. General Cloud Infrastructure: Stacks, Blocks and Process
The traditional understanding of cloud infrastructure, particularly from the perspective of Cloud Infrastructure System Integrators (SIers), has been fundamentally tied to Infrastructure as a Service (IaaS). AWS's canonical definition presents this as a hierarchical structure: network appliances at the foundation, followed by storage devices, server machines, virtualization technology, and operating systems that host applications. But in fact, Cloud Infrastructure SIers increasingly find their responsibilities extending beyond traditional boundaries to encompass various aspects of system configuration and optimization.
However, the emergence of Generative AI has catalyzed a shift toward a more streamlined conceptual framework. From a market perspective, the stack has been distilled into four essential layers: infrastructure, tools, models, and applications.
These facts reflect that clear boundaries between stacks, blocks, and processes have become increasingly fluid. Rather than adhering to rigid definitions, next generation cloud infrastructure demands a dynamic approach to system design and implementation. This shift is particularly evident in AI infrastructure, where the interplay between different components becomes as crucial as the components themselves.
As we delve deeper into the specific challenges and opportunities presented by this evolution, it becomes clear that success in next generation cloud infrastructure requires a fundamental rethinking of our traditional approaches. The rigid boundaries of the past are giving way to more fluid, adaptable frameworks that is better aligned with the dynamic nature of AI and ML workloads.
2. The Signal and the Reality: Bottlenecks and Chances on the way
As AWS announced, new data center components are designed to support the next generation of AI innovation and customers’ evolving needs, we noticed that the industry has long focused on cloud infrastructure as a virtualization and distribution layer, but now it is time to return to hardware innovation as a key driver of AI advancement.
At the same time, AWS announced AWS Trainium2 UltraServer, a completely new computing offering features 64x Trainium2 chips connected with a high bandwidth, low-latency NeuronLink interconnect, for peak inference and training performance on frontier foundation models at Monday Night Live, 2nd Dec, Las Vegas. As it is the very first call and the beginning of keynote sessions of AWS re:Invent 2024, it is more than just a product launch but a clear indicator that AWS has reached a crucial understanding: the next frontier of AI innovation requires fundamental infrastructure reimagination. This signal becomes even more pronounced with the preview of AWS Trainium3, promising advancement to a 3-nanometer process node, and the announcement of Project Rainier in collaboration with Anthropic, the world class leading AI company, aimed at revolutionizing AI model training efficiency.
These developments suggest a spiral model of service and product development in AWS's AI/ML strategy. As each layer of the stack reaches its performance limits, innovation cycles back to foundational infrastructure, but at progressively higher levels of sophistication. This pattern reveals several critical bottlenecks that Cloud Infrastructure SIers must navigate:
- Components Version Compatible
While AWS offers comprehensive AI/ML development services, organizations often seek to combine these with open-source solutions to avoid vendor lock-in. This hybrid approach introduces complexity, particularly in version management.
A telling example is the current impossibility of running PyTorch with NVIDIA CUDA Toolkit on Amazon Linux 2023 with GPU and ECS optimization all in latest version of each, due to PyTorch's lack of stable support for Amazon Linux 2023 containers. This situation is further complicated by the retirement of Amazon Linux 2 in June 2025, forcing organizations to make difficult choices between operating system versions and customization capabilities.
- Infrastructure Agility
The rapid pace of AI/ML application development demands equally raise the needs of agile infrastructure responses. While high-level Infrastructure as Code (IaC) and Agile methodologies offer potential solutions, their effective implementation requires more than just technical tools. It necessitates a comprehensive transformation in technical skill sets and mindset across entire project teams—a particularly daunting tasks in large-scale implementations.
- The Real AI Infrastructure
Perhaps most significantly, we face the challenge of implementing True AI Infrastructure. Current cloud AI/ML infrastructure often relies on traditional cloud resources adapted for AI workloads, rather than purpose-built AI infrastructure. The performance metrics—GPU/CPU utilization, latency, and response accuracy—tell a clear story of the gap between adapted and purpose-built solutions. This disparity persists due to various factors: cost considerations, strategic decisions, and knowledge limitations among both Cloud Infrastructure SIers and their clients.
According to AWS, the following are the services they defined as the members of comprehensive, secure and price-performant AI infrastructure category.
- Compute: Amazon EC2 Tn1/Inf1/, Amazon EC2 P5/G5 Instances, Amazon EC2 Capacity Blocks, AWS Neuron
- Networking: Elastic Fabric Adapter, Amazon EC2 UltraClusters, Amazon Direct Connect
- Storage: Amazon FSx for Lustre, Amazon S3, Amazon S3 Express One Zone
- Security: AWS Nitro System, AWS Nitro Enclaves, AWS Key management Service
- Managed Service: Amazon SageMaker, Amazon Elastic Kubernetes Service, Amazon Elastic Container Service
3. NextGen AI Infrastructure: New Definition and Categories with re:Invent 2024 Recap
Our analysis thus far points to a fundamental shift in how we should conceptualize AI infrastructure. Rather than viewing it through the traditional lens of stacks and building blocks, NextGen AI infrastructure is better understood as an ecosystem.
- Redefining AI Infrastructure
NextGen AI infrastructure represents a dynamic and interconnected network where each component contributes to and draws from the whole. This ecosystem approach emphasizes not just the individual components, but their interactions and collective evolution. It supports the development, deployment, operation, and continuous evolution of AI solutions through a sustainable and efficient environment.
The ecosystem naturally divides into two complementary domains: Hard Infrastructure and Soft Infrastructure. Hard Infrastructure encompasses the physical and traditional components—computing resources, storage systems, networks, and cloud facilities. Soft Infrastructure, equally crucial, focuses on the functional aspects: data processing and analysis capabilities, model deployment and training frameworks, and specialized operational tools designed specifically for AI workloads.
Based on these definitions and categorization, some related updates from AWS re:Invent 2024 are really eye-catching.
- Hard Infrastructure: Compute
- Amazon EC2 Trn2 & AWS Trainium2 UltraServer
Although go serverless is a trendy thing when designing the architecture for AI/ML systems but still, in some cases, applications need to run on EC2 or other computing nodes.
Amazon EC2 Trn2 instances are purpose-built for high performance ML model training using AWS Trainium2 chips. Key Features of it we would like to talk are Powered by AWS Trainium2 that can delivers high throughput and low latency for training complex models; high scalability that can supports distributed training across multiple instances, which can be an exciting announcement for cluster users and cost efficiency which may make it easier for us to pursue our customers to use it and step into the real AI infrastructure.
AWS Trainium2 UltraServer is a Data Center component that AWS provided. Before, users could only choose which region and availability zone to run their instances but now, they can choose if they want their nodes to run on UltraServer or not. When deploy Trainium2 instances, choose the one with .u mark, then the node will run on UltraServer.
- Amazon EKS Hybrid Nodes & Amazon Elastic VMware Service
AWS published Hybrid Machine Learning White Paper years ago and several updates announced at re:Invent 2024 can be seen as part of it if combined with the signal discussed in previous chapter. Although not directly associated with engaging AI/ML innovation, Amazon EKS Hybrid Nodes and Amazon Elastic VWware Service provides more possibilities on running cloud computing anywhere.
Briefly recap, Amazon EKS Hybrid Nodes allow users to extend EKS clusters across AWS and on-premises through a single EKS control plane including AWS’ native on-premises solution, AWS Outposts. On the other hand, Amazon Elastic VMware Service allows users to bring their existing license and deploy their workloads on AWS with fully managed experience just like EC2.
- Soft Infrastructure: AI/ML Service
- Amazon Nova
During AWS re:Invent, there were several GenAI use cases workshops and chalk talks featuring with Langchain, Anthropic and NVIDIA. Unfortunately, the cases that had been talked about were basically the same as the ones at AWS AI Day.
In “Practical generative AI using Amazon Nova” talk, several actions in Amazon’s internal use supported by Amazon Nova were shared including AWS Support technical issue RCA, Amazon Prime Video Season Recap, Amazon Ads Video Generation and Amazon Q Developer Assistance. We hope these releases can provide more practical use cases in the future.
- Amazon SageMaker Unified Studio & Amazon Bedrock IDE
For AWS native development, Cloud9 used to be the natural choice but as its EOS is not a secret, Amazon Sagemaker Studio has been talked as the replacement. As Amazon Sagemaker is not designed for this mission, this solution is like when you go to a Michelin restaurant but only have their side menu or appetizer.
As AI/ML users mostly like to start with Amazon Bedrock with its foundation models, it may be the good news for having Amazon Bedrock IDE as an AWS native development tool. Furthermore, it is integrated with Amazon SageMaker Unified Studio, the overall AL/ML innovation strategy and upgrade process may come smoother in the future, start with basic and advanced at scale.
- Amazon Bedrock Intelligent Prompt Routing & Amazon Bedrock Prompt Caching
One thing we learned from Meta’s keynote, Empowering Industries with Llama Models and its powering engine PyTorch at AWS AI Day is that make your model usage as an ecosystem for cost efficiency, smaller models for simpler queries, larger models for complex tasks, and consider industry or use cases specified models for special requests. Amazon Bedrock Intelligent Prompt Routing embodies this approach, intelligently directing queries to appropriate models based on complexity and requirements. This aligns perfectly with the multi-model trend observed in successful AI deployments, where organizations leverage different models for different types of queries. When combined with Bedrock Prompt Caching, these features create an efficient and cost-effective environment for AI operations.
4. Conclusion
The insights gained from AWS re:Invent 2024 signal a paradigm shift in AI infrastructure, moving beyond traditional stack-based architectures toward a more integrated ecosystem approach. This evolution represents not just technological advancement, but a fundamental reimagining of how we conceptualize and implement AI infrastructure.
Key Takeaway
- The emergence of a unified ecosystem that seamlessly integrates both Hard and Soft components. This integration transcends the traditional boundaries between stacks, building blocks and process to create a more cohesive and efficient environment for AI workloads.
- The shift toward purpose-built AI infrastructure, exemplified by innovations like the Trainium2 UltraServer, enhanced Amazon SageMaker, and Amazon Bedrock services, marks a departure from adapted traditional cloud resources toward truly AI-optimized solutions.
Future Tasks
- AIOps: observability, resiliency, scalability
During re:Invent, several talks and workshops related to non-functional requirements like observability, resiliency, scalability, have GenAI as their keyword, but it is aimed for attention rather than providing practical solutions engaged with AI.
It is kind of disappointment but also a chance for AWS users to figure out how to use AI services/products for improving these requirements for AI systems or all systems in GenAI era.
Here are some promising opportunities:
- Leveraging Automated Reasoning Check & Multimodal toxicity detection with image support of Amazon Bedrock and Amazon Bedrock Knowledge Bases LLM-as-a-judge for enhanced APM analysis
- Implementing AI-driven operations through Amazon Q for comprehensive system insights
- Migration and then re:Infrastructure
As AWS announced VMware solution and pointed a little bit on Mainframe, infrastructure evolution and cloud migration involves not just existing systems go to the cloud, but fundamentally reimagining them for the AI era. Following are some hypotheses:
- Leveraging new hybrid capabilities like EKS Hybrid Nodes to bridge on-premises and cloud environments
- Implementing intelligent workload routing and caching strategies to optimize model serving
- Developing migration patterns that embrace cloud-native principles while maintaining operational efficiency
The developments showcased at AWS re:Invent 2024 represent not just incremental improvements, but foundational changes in how we approach AI infrastructure. As the field continues to evolve, success will depend on our ability to embrace this ecosystem perspective while developing practical solutions to emerging challenges in AIOps and infrastructure modernization.