Petite SRE：An Observation on Cloud Infrastructure Design/Delivery in GenAI Era

Abstract

As Generative AI applications evolve rapidly, infrastructure design and delivery must keep pace while maintaining reliability and scalability. This paper proposes an integrated modeling approach that places Site Reliability Engineering (SRE) at its core, enabling automated, agile, and observable infrastructure development. Through analysis of current research and industry practices, combined with a real-world case study, we demonstrate how SRE principles can accelerate infrastructure delivery while ensuring long-term sustainability for GenAI workloads.

1. The Edge of GenAI Research and Report: Agentic AI and its Next

One of the most important AI-related research papers, the Artificial Intelligence Index 2024, provides a comprehensive overview of the current state of AI, covering various aspects such as research and development, technical performance, design theory, and industry practices. Here are the key highlights:

For the notable Machine Learning Models, the industry produced 51 while academia contributed only 15. At the same time, 21 notable models resulted from industry-academia collaborations. This data shows that the Research and Development of AI is still dominated by industry.
For Foundation Model development, the number of releases in 2023 was more than double that of 2022, and 65.7% were open source.
Regarding technical performance, which is related to real-life integration, AI has surpassed human performance in several benchmarks, such as image classification and language understanding. However, it still trails behind on more complex tasks, including competition-level mathematics and visual commonsense reasoning.
At the same time, Responsible AI is another perspective to focus on as standardization is still lacking, and political deepfakes are already affecting elections worldwide. The number of AI incidents continues to rise with 123 incidents reported in 2023, a 32.2% increase of 2022.

As everything has cons and pros, some researchers have begun to rethink Generative AI and provide some Frameworks as a reference for practice. As a result, the following framework and key points are presented.

Process: Motivating AI use by specific needs
Formulation: Describing the problem to solve with AI
Tools and Technologies: Assessing AI affordances
Data: Informing AI use with Appropriate Information
Context: Shaping AI Use, Benefits, and Risks in Situated Practice

Due to the omnipresence of Generative AI, the blue sea after the above reflection is also obvious.

Agentic AI represents a new generation of increasingly powerful foundation models that act as operating systems for autonomous, action-taking, digital agents capable of enhanced reasoning and decision-making, as well as increasingly disruptive chatbots and copilots. It can be a positive expectation that Generative AI will play a main role in consumer & enterprise task adoption shortly and software's AI moment also coming soon. Cheaper and safer AI tools will come to daily life more as they are aligned with capital-intensive foundation models.

2. The Review of AWS AI Day: Infrastructure and SRE in GenAI Era

The above research paper and industry report conclusions give the public a fact-based overview of Generative AI. AWS AI Day, on the other hand, demonstrates how leading technology companies are implementing GenAI solutions, highlighting three critical areas for infrastructure architects:

Application Development Integration including understand deployment patterns, related function structures, layered and well-structured resource allocation strategies.
Infrastructure Design/Delivery Model should be function/service-based RACI frameworks instead of product/layer-based ones and also need to embrace changing service mesh requirements.
SRE Implementation for deployment and improvement automation with performance monitoring at an early stage.

As the clients' IT environment transforms so fast, advanced services must be provided with a delivery-assisted/ requirements-driven approach based on data-driven operations. This makes SRE come to the center when thinking of Infrastructure/Cloud design in the Generative AI Era.

On the other hand, the issue is obvious: how to integrate AI with SRE into project delivery in the Generative AI Era in a reproducible way no matter if the system infrastructure is cloud-native or hybrid, and how to set up an effective professional team/ process to support it.

3. Business Case-based Modeling: Integrated SRE for Advanced Infrastructure Delivery in the GenAI Era

Our proposed model focuses on 5 interconnected categories, each implemented through SRE's core principles of automation, agility, and observability. This model focused on how to prepare and set up a valid infrastructure that supports AI applications. Integrated into a real-world AWS hybrid case are as follows.

3.1 Enterprise Solution Design/Delivery

This is the first category that should be discussed before anything else. The final goal, overall strategy, roadmap, and solution should be deeply communicated with all stakeholders and make sure every key performer is at least on the same page with the decisions at a high level. Besides 7R methodologies, commonly used discussion layering and structured strategies for this stage are listed below. These work like a bridge between high-level theory and detailed designs.

Complete Cloud Native: replace existing on-premises systems with a new architecture design for cloud environments.
Semi-Cloud Native: modify existing on-premises systems by implementing AWS agents or AWS Outposts. This type of on-premises system should be interactive with cloud-native ones closely but for some reason cannot be totally transformed by now.
On-premises Remains: This kind of system is the one waiting for retirement.

After categorizing all target systems, splitting them into groups to fit specific the roadmap discussed from multiple angles, such as cost, time, governance, etc.

3.2 Agile Product Design/Delivery

This is the second category that should be discussed, and it is also where the SRE cores, automation, agility, and observability, come to the front. Leveraging the decisions into standard design/delivery and tailored design/delivery and applying them to components at a high level. Make sure every function/component of the decided architecture has its layers settled.

The standard design/delivery is the foundation and initial component of Advanced Infrastructure Delivery. It is focused on setting up the basic functions such as networking, computing, and load balancing as fast as possible. They are pre-designed/built IaC modules after several large-scale projects across multiple industries. The general lineup should include the following.

Individual Service-based Modules: e.g. EC2 launch template containing pre-defined user data for OS settings. These are usually used for single-point deployment and service implementation PoC.
Technical Solution-based Modules: e.g. Fault Injection Simulator template for incident handling training. These are generally based on daily system use cases including backup/restore, and disaster recovery.
Industry Business case-based Modules: e.g. Baseline Environment on AWS for Financial Services Institute (BLEA for FSI). These are tailored for specific industry features.
Observability Modules: These modules work as a set with the above modules to collect source data for module refinement and basic system operation.

3.3 Team/Technical Agility

As the previous category brings SRE cores to the stage, this category is the real engine to keep them alive. The time and human resources saved by using standard design/delivery modules can be invested in specified functions/components development tailored based on project needs. To make infrastructure delivery have compatible speed align with application development, 3 things should be done as a precondition.

Break down and define each task clearly enough for its worker: set up definitions for all tasks and make sure relative members are on the same page with at least, the task goal, output image, and DDL. Group tasks into different skill level tracks to align with teams. Coordinating all tasks into pipelines with relentless improvement.
Leveraging teams with different goals and technical skill sets: in general, 3 types of teams are effective and practical. The L1 team focuses on basic operations such as taking care of existing automation workflow, and handling on-demand operations with the procedure; the L2 team focuses on advanced operations including standardizing the workflow and creating new automation solutions based on it. L3 team for high-level operations such as GPU-based container infrastructure and APM with 3rd products. Member and skill transformation between teams runs in a spiral track. This will be discussed later in 3.5 Continuous Input/Output.
Using AI-integrated services as a companion: Amazon Q Developer and Amazon CodeCatalyst can accelerate module refinement and development while Amazon CloudWatch Insight Query generated from natural language can help with log analysis.

3.4 Continuous Input/Output

The basic logic of adapting the above categories lean on SRE cores for infrastructure delivery is the same as the data science circle: collecting source data, data processing, modeling, deploying, monitoring, and analysis/reporting. Every module and people in these processes are data providers and users as they are all under monitoring from the start. This makes the situation a natural continuous input/output machine. To make it more efficient, the following are the key points:

Transform people while transforming systems: as mentioned in 3.3 Team/technical Agility, leveraging teams with different goals and technical skill sets can keep the whole project productive. The definition and scope for each team are not static but dynamically changing sensitively. Regularly refining the team mission and shifting team structure is important to keep the motivation.
Data-driven improvement: as metrics, traces, and logs are collected, use them to make decisions for the next step lean on the goals settled at the beginning and the approach is realistic for execution. Use Amazon QuickSight, AWS Glue Databrew, and Amazon SageMaker as support for solution visualization and simulation.

3.5 Lean Portfolio Management

While project members work with passion and eagerness for improvement, it is crucial for leaders to keep portfolio management lean and well-discussed.

Keep aware of micromanagement: micromanagement can be exhausting for both sides. On the one hand, members may feel not trusted by the team resulting in a lack of productivity and collaboration or even worse, leaving the team. On the other hand, leaders will not have enough time to focus on what they should do but just burn out in endless checking.
Encourage innovation and empower leadership: smart leaders use their member's wisdom as a database for improvement. Encourage innovation and empower leadership can bring innumerous vitality and its underlying logic is making people feel responsible and having control.

4. Conclusion: Key Takeaways and Future Tasks

The successful delivery of advanced infrastructure in the GenAI Era requires a balanced approach combining SRE principles, clear and practical modeling frameworks that come from real-life business practices, and continuous improvement cycles. By maintaining focus on automation, agility, and observability, organizations can build and maintain infrastructure that evolves with their AI applications while ensuring reliability and scalability.

Although AI services provided by Cloud vendors are implemented as assistants for advanced infrastructure, real AI infrastructure still has a long way to go. Key differentiators from the current cloud infrastructure are emergent intelligence capabilities, self-organizing systems, minimal human intervention, and advanced meta-learning frameworks. To make it closer, the following future tasks need to be done:

AI-driven automation expansion: use AI services to identify repetitive tasks from daily data collection, develop new workflows, and continuously improve them. The focus technical area will include model training, predictive analytics, and process automation.
Advanced observability implementation: sub-tasks including real-time system performance, potential problem detective, and detailed performance analysis. Use AI service to provide suggestions and handle the first response with automation workflow.
Self-healing infrastructure: mainly aims at large-scale incident handling and system restoration such as disaster recovery.

Select your cookie preferences

Site Terms, Privacy, and more.

Petite SRE：An Observation on Cloud Infrastructure Design/Delivery in GenAI Era

Revolutionizing Generative AI Infrastructure with Site Reliability Engineering (SRE)Discover how SRE principles can transform AI infrastructure, enabling more reliable, scalable, and agile systems for cutting-edge generative AI technologies.

Comments