AWS Resource Automatic Diagnostic AI Program Development
Real-time monitoring of MSK cluster performance and implementing AI-based automatic analytics
Hyunjoong Shin
Amazon Employee
Published Feb 4, 2025
Most issues occurring in AWS cloud environments can be resolved through official documentation. Particularly, AWS cloud resources can be monitored in real-time through detailed metrics and event logs via CloudWatch or CloudTrail. By combining AWS's powerful monitoring capabilities with the latest AI technology, we have developed an innovative program that automatically diagnoses and analyzes AWS resources.

This program was developed for MSK (Managed Streaming for Apache Kafka), AWS's managed streaming service. It aims to significantly improve operator efficiency by monitoring and automatically analyzing the status of complex Kafka clusters in real-time. In particular, it supports rapid problem detection and resolution for stable operation of MSK clusters handling large-scale traffic.
(Note: In that screenshot, I removed any sensitive information.)
The core of this program lies in comprehensive data collection and AI-based analysis.

1. Resource Metrics Collection
- Real-time performance metrics collection through CloudWatch
- API activity and resource change history tracking through CloudTrail
- Cluster configuration information collection via AWS SDK's describe_cluster API
2. Context Information Integration
- Integration of Support Automation Workflows (SAW) documents
- Knowledge base construction using Bedrock Knowledge Base
- Implementation of RAG (Retrieval-Augmented Generation) method to improve AI response accuracy
The program monitors and analyzes various critical metrics of MSK clusters in real-time:
1. Cluster Status Metrics
- Node count and instance type configuration
- ActiveControllerCount monitoring
- OfflinePartitionsCount tracking
- KafkaDataLogsDiskUsed analysis
2. Performance Metrics
- CPU usage trend analysis
- HeapMemoryAfterGC monitoring
- PartitionCount analysis
- Network throughput measurement
1. Data Collection Phase
- Real-time metrics collection through AWS API
- Historical data aggregation (Cloudwatch)
- Event log integration (Cloudtrail)
2. AI Analysis Phase
- Primary analysis based on SAW documents
- Context analysis based on Bedrock
- Pattern recognition and anomaly detection
- Solution derivation
1. Operational Efficiency Improvement
- Automation of monitoring tasks
- Rapid problem detection and response
- Predictable maintenance
2. Cost Optimization
- Resource usage optimization
- Reduction in unnecessary support tickets
- Operational staff efficiency
- Expansion plans to other AWS services
- Enhancement of machine learning models
- Integration of real-time alert system
- Development of customizable dashboards
This AWS resource automatic diagnostic AI program presents a new paradigm in MSK operations. This program, which can be easily implemented in a user's local environment, maximizes operational efficiency by enabling quick problem resolution without creating AWS Support tickets. We plan to develop it into an even more complete program through continuous improvement and expansion.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.