
Improve Observability with AI/ML and GenAI: Key Considerations
Explore considerations for data ingestion, storage, indexing, databases, and search to take full advantage of AI/ML and generative AI in observability
1. Data Ingestion & Normalization
How Log , Metrics, and Trace Formats Impact Observability
Kubernetes and Container Telemetry Data
Importance of Normalization and Its Applications
3. Search and Purpose Built Databases
Using ML Capabilities within Existing Solutions
Training and Deployment with SageMaker
Using Capabilities within Existing Solutions
Building Blocks to build a custom open-source solution for event correlation
5. Augmenting Analysis with Generative AI
In this article, we explore key considerations for integrating generative AI (GenAI) into observability and highlight how it can complement existing machine learning (ML) capabilities to enhance anomaly detection, event correlation, and predictive analytics.
- Data ingestion and normalization – How to standardize and centralize logs and metrics
- Centralized storage and indexing – Methods to structure your data for queries
- Machine learning for anomaly detection and correlation – Techniques to surface insights
- Augmenting analysis with GenAI – Ways to use large language models to summarize, interpret, and act on data.
- Visualization and alerting – Creating actionable dashboards and notifications
- Security and compliance use cases – Solutions for security and compliance use cases
- Standardization: Use structured logging formats (such as JSON) and standardized metric and trace formats (e.g., OpenTelemetry) to enable seamless parsing and correlation.
- Contextual Enrichment: Append relevant metadata (e.g., service names, regions, correlation IDs, and span IDs) to improve traceability across distributed systems.
- Real-Time Processing: Utilize streaming telemetry pipelines to analyze and act on observability data in near real-time.
- Structured Logging with JSON: Many modern applications write logs in JSON format because it is both human-readable and machine-parsable. CloudWatch Logs automatically discovers log fields and indexes JSON data upon ingestion. JSON's flexibility also makes it ideal for embedding rich context (such as timestamps, severity levels, service names) that can later be mapped into a common schema.
- Metrics Formats: Metrics provide numerical insights into system health and performance. They are typically structured in formats such as:
- Prometheus/OpenMetrics: A standardized text-based wire format for exposing numeric time series data, widely adopted in cloud-native environments.
- OpenTelemetry Metrics: A standardized format that enables cross-platform compatibility and correlation with logs and traces.
- CloudWatch Embedded Metrics Format (EMF): AWS-native JSON-based structured format that integrates with CloudWatch Metrics.
- Trace Formats: Traces capture the execution path of requests across distributed systems. Common formats include:
- OpenTelemetry Traces: Enables distributed tracing across services with span-based correlation.
- AWS X-Ray: AWS-native tracing format that visualizes service interactions and latency bottlenecks.
- Application logs are often emitted in JSON format to facilitate structured search and correlation.
- System and runtime logs use CRI logging for consistency across container runtimes and are typically transformed into JSON or other structured formats for better observability.
- Metrics collection is handled by tools such as Prometheus or CloudWatch Agent, with exporters that standardize data in formats like Prometheus/OpenMetrics
- Traces are generated using OpenTelemetry SDKs or AWS X-Ray for end-to-end request tracking across microservices
- Before Ingestion:
- Filter at the source: Filter redundant or low-value data to reduce noise while ensuring critical telemetry is retained. For example, the CloudWatch agent performs filtering at the source, which is recommended.
- Transform at the Source: Logging agents like Fluent Bit and Logstash process and convert CRI logs into structured formats like JSON, while OpenTelemetry Collectors normalize traces and metrics before forwarding them to monitoring platforms.
- Enforce standardization: Ensure consistency and usability across observability platforms by defining data schemas at the point where telemetry is initially generated. For example, Prometheus and OpenTelemetry client libraries and APIs help generate metrics and traces that adhere to correct formats.
- Append metadata: Enrich telemetry with contextual information, such as request IDs, regions, span IDs, and execution context, for better traceability. In a microservices environment, adding a unique trace_id to logs enables correlation across multiple services, allowing for end-to-end request tracking.
- During Ingestion:
- The following are examples of how data can be processed during ingestion to different destinations.
- Normalize data: Collect, transform, and normalize logs, metrics, and traces before indexing them into OpenSearch using OpenSearch Ingestion.
- Use OpenTelemetry Collector to ingest, transform, and export telemetry data. For example, the collector can scrape metrics in Prometheus format, transform and enrich them with metadata, and then export them to CloudWatch or other monitoring platforms.
Amazon OpenSearch Service zero-ETL integration with Amazon S3 also lets you directly query operational data stored in S3 using OpenSearch's rich analytics capabilities, with no need to move the data. This is done by creating Glue Data Catalog tables that reference the data in S3, which OpenSearch can then query directly. This allows you to use OpenSearch for both real-time analytics on data ingested into the service, as well as ad-hoc analysis of the broader operational data lake in S3. The integration reduces the need for complex ETL pipelines.
- CloudWatch Log Insights can detect and analyze patterns in log events to identify recurring text structures in the log data. CloudWatch also includes anomaly detection to automatically detect irregular patterns in metrics and offers dynamic thresholds can also be set that adapt to usage patterns.
- OpenSearch has built-in machine learning capabilities through the OpenSearch ML plugin, which can automatically detect anomalies, identify patterns, and generate insights from the ingested log data.
- Amazon Timestream's integration with Amazon SageMaker empowers teams to use machine learning for predictive analysis. By training models on historical time-series data, organizations can proactively plan for capacity changes and prevent potential business interruptions.
There are a few ML approaches useful for predictive monitoring and event correlation:
- Unsupervised Anomaly Detection: Algorithms like Isolation Forest or Random Cut Forest (RCF) can identify outliers in metric trends or log frequencies with no need labeled data. For example, SageMaker has a built-in Random Cut Forest algorithm specifically designed to detect anomalous data points in time series. You could train an RCF model on weeks of CPU utilization metrics to learn what "normal" looks like, and then have it score new data to detect anomalies (like sudden spikes). Similarly, you could vectorize log data (e.g., count of error keywords per hour) and use an Isolation Forest to find abnormal spikes in error rates.
- Time-Series Models (LSTM/ARIMA): Recurrent neural networks such as LSTMs can model sequences of data. An LSTM trained on system metrics might forecast expected values and detect when actual values diverge significantly (an anomaly). These are useful for capturing seasonality or trends such as daily patterns and detecting context-dependent anomalies.
- Supervised Learning for Event Correlation: If you have historical incidents labeled (e.g., past outages with known root causes), you can train a model such as a Random Forest or XGBoost classifier to predict when a combination of log patterns and metric anomalies signals a specific issue. For instance, a model could learn that when an "OutOfMemory" error log coincides with high memory usage metric, it likely indicates an application crash event. Such a model would require a training dataset of events (normal vs. issue) – which is not always available – so this is an optional approach. Often, unsupervised models are easier to start with in observability.
- Deduplication and Aggregation: Redundant or duplicate events are common in large-scale systems. Modern solutions use advanced deduplication algorithms to merge multiple identical or highly similar events into a single incident. Aggregation further summarizes a burst of related events—whether from transient spikes or recurring alerts—reducing noise and preventing alert fatigue.
- Contextual Enrichment and Filtering: Beyond simply cleaning up raw data, observability platforms enrich events with context from configuration management databases (CMDB), topology maps, historical baselines, and threat intelligence feeds. This added context helps the correlation engine distinguish between benign anomalies and genuine incidents. For example, if multiple alerts are tied to a known critical service, they can be prioritized over similar alerts from less critical systems. Automated filtering techniques then suppress irrelevant or low-priority events, ensuring that teams see only the most actionable information.
- Temporal and Topological Analysis: By analyzing the timing of events and understanding the network or service topology, observability solutions can link events that occur within a specific window or that are part of a dependency chain. For instance, if a downstream service reports errors shortly after its upstream dependency fails, the correlation engine can attribute the incident to that dependency failure.
- AI and Machine Learning: To handle the dynamic nature of modern IT environments, many observability platforms incorporate machine learning. These models learn what "normal" looks like and can dynamically adjust correlation rules. Over time, the system becomes better at predicting which clusters of events represent true incidents versus those that are simply noise. One approach is to use unsupervised clustering on time windows – group together all unusual events that occurred within the same timeframe. Another approach is to use graph-based algorithms (treating events as nodes and linking those occurring closely in time or with similar patterns).
- OpenTelemetry can provide a strong data foundation needed for correlation. By using Opentelemetry SDKs, all telemetry data can carry consistent identifiers (trace IDs, resource attributes, etc) that make it easier to link events across systems. For example, traces from a distributed transaction can be tied to log entries by embedding trace IDs in logs.
- OpenTelemetry Collector can receive data from many sources and export to multiple backends. Thi means with one instrumentation standard, you can send data to an OpenSearch cluster for log analysis, a Prometheus backend for metrics, ensuring all tools see a consistent stream of events
- OpenSearch can be used to search and index logs and event data. It also has an Observability plugin that supports trace visualization and correlation features. A potential approach can be to send log and traces to OpenSearch and use its built-in Trace Analytics to correlate them. OpenSearch Observability UI can reconstruct distributed traces with flame graphs and build services maps that show relationships between services. By indexing trace spans and log records together with trace and span IDs you enable queries like “find all logs for this trace” effectively correlating different data types. OpenSearch also supported anomaly detection as discussed. Additionally, OpenSearch offers and Alerting plugin where you can define monitors and triggers.
- For metrics (system CPU, application latency, etc), a specialized time-series database can be used. Prometheus is a popular choice for metric collection and alerting. It can scrape metrics from endpoints and evaluate alerting rules. Prometheus can perform simple correlation such as recording rules or alerting expressions. For example, you might write a PromQL alert that checks if error rates are high while traffic count is low to detect a specific condition. Prometheus alertmanager groups alerts to avoid duplicates and can perform basic dependency groupings.
- OpenTelemetry traces can be used to build a service dependency graph using OpenSearch service map feature. This covers application topologies to visualize microservice calls.
"At 12:05 PM, an anomaly in CPU usage on 3 servers coincided with an uptick in error logs containing 'DatabaseConnectionError' messages."
"Around 12:05 PM, the application experienced a surge in CPU usage and database connection errors, suggesting a possible database overload or connectivity issue. This likely triggered high latency. Recommended action: check database health and consider scaling or optimizing queries."
- Amazon CloudWatch Dashboards: CloudWatch provides dashboards where you can display metrics, alarms, and even custom text or images. You can create a centralized dashboard that shows key metrics (CPU, memory, latency, etc.), counts of certain log events, and the output from your AI/ML pipeline. For instance, one panel might show a time series of anomaly scores (so you can see when the system thinks something is off), and another panel could list active CloudWatch Alarms. Using the CloudWatch Custom Widget feature, you can embed the generative AI summary directly into the dashboard as discussed in the previous section. The dashboard can also include classic elements like graphs of metric trends, or tables of recent log entries.
- Amazon QuickSight: QuickSight allows advanced BI visualization, especially if you want to combine data from multiple sources or do longer-term analysis. You can use QuickSight to create interactive charts and tables from logs stored in S3 by using Amazon Athena to query the data. Amazon QuickSight Q enables any user to ask questions of their data using natural language, without having to write SQL queries. Amazon QuickSight's machine learning (ML) capabilities allow you to uncover hidden insights, identify key drivers, and generate forecasts without needing to develop custom ML models or algorithms.
- Amazon OpenSearch Dashboards: If you have operators who need to dive into logs, OpenSearch Dashboards can be used for detailed exploration. You might set up saved searches or visualizations there, like a dashboard that shows log rate, error rate, and latency together.
- Amazon Managed Grafana: Amazon Managed Grafana can be used to visualize, query, and correlate metrics, logs, and traces at scale.Based on open source Grafana with added features like single sign-on, Amazon Managed Grafana enables querying, visualizing, alerting, and understanding of observability data from various sources, such as container metrics stored in Amazon Managed Service for Prometheus.
- Using Generative AI to Gain Insights into CloudWatch Logs – Demonstrates how to integrate Amazon Bedrock to summarize CloudWatch Logs on a dashboard
- Efficiently Build and Tune Custom Log Anomaly Detection Models with Amazon SageMaker - A step-by-step guide on using SageMaker to process log data and train anomaly detection models.
- Aggregating, Searching, and Visualizing Log Data from Distributed Sources with Athena and QuickSight – A part of a series showing how to use S3 (as a data lake), AWS Glue, Amazon Athena, and Amazon QuickSight to centralize and visualize logs and metrics from many sources
- Ingest Streaming Data into Amazon OpenSearch Service with Amazon Kinesis Data Firehose – Introduces how to set up a Firehose delivery stream to send data to an OpenSearch (Elasticsearch) cluster in a VPC
- Amazon DevOps Guru – The DevOps Guru FAQs explain how it correlates metrics and logs using ML to surface insights and provides recommendations
- AWS Observability Best Practices - Provides comprehensive guides, tools, recipes, and patterns to help improve observability and monitoring of AWS cloud environments
- AWS CAF Operations Perspective for Event Management with AIOPs - Provides an overview of event management using AIOps (Artificial Intelligence for IT Operations) in the context of the AWS Cloud Adoption Framework's operations perspective
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.