AWS Logo
Menu

AWS Resource Automatic Diagnostic AI Program Development

Real-time monitoring of MSK cluster performance and implementing AI-based automatic analytics

Hyunjoong Shin
Amazon Employee
Published Feb 4, 2025

Background

Most issues occurring in AWS cloud environments can be resolved through official documentation. Particularly, AWS cloud resources can be monitored in real-time through detailed metrics and event logs via CloudWatch or CloudTrail. By combining AWS's powerful monitoring capabilities with the latest AI technology, we have developed an innovative program that automatically diagnoses and analyzes AWS resources.

Program Overview

msk analyzer
MSK analyzer
This program was developed for MSK (Managed Streaming for Apache Kafka), AWS's managed streaming service. It aims to significantly improve operator efficiency by monitoring and automatically analyzing the status of complex Kafka clusters in real-time. In particular, it supports rapid problem detection and resolution for stable operation of MSK clusters handling large-scale traffic.
(Note: In that screenshot, I removed any sensitive information.)

Data Collection and Analysis Architecture

The core of this program lies in comprehensive data collection and AI-based analysis.
data_ingestion
data_collection
1. Resource Metrics Collection
  • Real-time performance metrics collection through CloudWatch
  • API activity and resource change history tracking through CloudTrail
  • Cluster configuration information collection via AWS SDK's describe_cluster API
2. Context Information Integration
  • Integration of Support Automation Workflows (SAW) documents
  • Knowledge base construction using Bedrock Knowledge Base
    • Implementation of RAG (Retrieval-Augmented Generation) method to improve AI response accuracy

Key Monitoring Metrics and Analysis

The program monitors and analyzes various critical metrics of MSK clusters in real-time:
1. Cluster Status Metrics
  • Node count and instance type configuration
  • ActiveControllerCount monitoring
  • OfflinePartitionsCount tracking
  • KafkaDataLogsDiskUsed analysis
2. Performance Metrics
  • CPU usage trend analysis
  • HeapMemoryAfterGC monitoring
  • PartitionCount analysis
  • Network throughput measurement

Analysis Process and Workflow

1. Data Collection Phase
  • Real-time metrics collection through AWS API
  • Historical data aggregation (Cloudwatch)
  • Event log integration (Cloudtrail)
2. AI Analysis Phase
  • Primary analysis based on SAW documents
  • Context analysis based on Bedrock
  • Pattern recognition and anomaly detection
  • Solution derivation

Expected Benefits and Applications

1. Operational Efficiency Improvement
  • Automation of monitoring tasks
  • Rapid problem detection and response
  • Predictable maintenance
2. Cost Optimization
  • Resource usage optimization
  • Reduction in unnecessary support tickets
  • Operational staff efficiency

Future Development Direction

  • Expansion plans to other AWS services
  • Enhancement of machine learning models
  • Integration of real-time alert system
  • Development of customizable dashboards

Conclusion

This AWS resource automatic diagnostic AI program presents a new paradigm in MSK operations. This program, which can be easily implemented in a user's local environment, maximizes operational efficiency by enabling quick problem resolution without creating AWS Support tickets. We plan to develop it into an even more complete program through continuous improvement and expansion.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments