Petite SRE:An Observation on Cloud Infrastructure Design/Delivery in GenAI Era
Revolutionizing Generative AI Infrastructure with Site Reliability Engineering (SRE)Discover how SRE principles can transform AI infrastructure, enabling more reliable, scalable, and agile systems for cutting-edge generative AI technologies.
- For the notable Machine Learning Models, the industry produced 51 while academia contributed only 15. At the same time, 21 notable models resulted from industry-academia collaborations. This data shows that the Research and Development of AI is still dominated by industry.
- For Foundation Model development, the number of releases in 2023 was more than double that of 2022, and 65.7% were open source.
- Regarding technical performance, which is related to real-life integration, AI has surpassed human performance in several benchmarks, such as image classification and language understanding. However, it still trails behind on more complex tasks, including competition-level mathematics and visual commonsense reasoning.
- At the same time, Responsible AI is another perspective to focus on as standardization is still lacking, and political deepfakes are already affecting elections worldwide. The number of AI incidents continues to rise with 123 incidents reported in 2023, a 32.2% increase of 2022.
- Process: Motivating AI use by specific needs
- Formulation: Describing the problem to solve with AI
- Tools and Technologies: Assessing AI affordances
- Data: Informing AI use with Appropriate Information
- Context: Shaping AI Use, Benefits, and Risks in Situated Practice
- Application Development Integration including understand deployment patterns, related function structures, layered and well-structured resource allocation strategies.
- Infrastructure Design/Delivery Model should be function/service-based RACI frameworks instead of product/layer-based ones and also need to embrace changing service mesh requirements.
- SRE Implementation for deployment and improvement automation with performance monitoring at an early stage.
- Complete Cloud Native: replace existing on-premises systems with a new architecture design for cloud environments.
- Semi-Cloud Native: modify existing on-premises systems by implementing AWS agents or AWS Outposts. This type of on-premises system should be interactive with cloud-native ones closely but for some reason cannot be totally transformed by now.
- On-premises Remains: This kind of system is the one waiting for retirement.
- Individual Service-based Modules: e.g. EC2 launch template containing pre-defined user data for OS settings. These are usually used for single-point deployment and service implementation PoC.
- Technical Solution-based Modules: e.g. Fault Injection Simulator template for incident handling training. These are generally based on daily system use cases including backup/restore, and disaster recovery.
- Industry Business case-based Modules: e.g. Baseline Environment on AWS for Financial Services Institute (BLEA for FSI). These are tailored for specific industry features.
- Observability Modules: These modules work as a set with the above modules to collect source data for module refinement and basic system operation.
- Break down and define each task clearly enough for its worker: set up definitions for all tasks and make sure relative members are on the same page with at least, the task goal, output image, and DDL. Group tasks into different skill level tracks to align with teams. Coordinating all tasks into pipelines with relentless improvement.
- Leveraging teams with different goals and technical skill sets: in general, 3 types of teams are effective and practical. The L1 team focuses on basic operations such as taking care of existing automation workflow, and handling on-demand operations with the procedure; the L2 team focuses on advanced operations including standardizing the workflow and creating new automation solutions based on it. L3 team for high-level operations such as GPU-based container infrastructure and APM with 3rd products. Member and skill transformation between teams runs in a spiral track. This will be discussed later in 3.5 Continuous Input/Output.
- Using AI-integrated services as a companion: Amazon Q Developer and Amazon CodeCatalyst can accelerate module refinement and development while Amazon CloudWatch Insight Query generated from natural language can help with log analysis.
- Transform people while transforming systems: as mentioned in 3.3 Team/technical Agility, leveraging teams with different goals and technical skill sets can keep the whole project productive. The definition and scope for each team are not static but dynamically changing sensitively. Regularly refining the team mission and shifting team structure is important to keep the motivation.
- Data-driven improvement: as metrics, traces, and logs are collected, use them to make decisions for the next step lean on the goals settled at the beginning and the approach is realistic for execution. Use Amazon QuickSight, AWS Glue Databrew, and Amazon SageMaker as support for solution visualization and simulation.
- Keep aware of micromanagement: micromanagement can be exhausting for both sides. On the one hand, members may feel not trusted by the team resulting in a lack of productivity and collaboration or even worse, leaving the team. On the other hand, leaders will not have enough time to focus on what they should do but just burn out in endless checking.
- Encourage innovation and empower leadership: smart leaders use their member's wisdom as a database for improvement. Encourage innovation and empower leadership can bring innumerous vitality and its underlying logic is making people feel responsible and having control.
- AI-driven automation expansion: use AI services to identify repetitive tasks from daily data collection, develop new workflows, and continuously improve them. The focus technical area will include model training, predictive analytics, and process automation.
- Advanced observability implementation: sub-tasks including real-time system performance, potential problem detective, and detailed performance analysis. Use AI service to provide suggestions and handle the first response with automation workflow.
- Self-healing infrastructure: mainly aims at large-scale incident handling and system restoration such as disaster recovery.