Improve Observability with AI/ML and GenAI: Key Considerations

As more businesses seek to enhance their monitoring capabilities, GenAI presents compelling opportunities to transform traditional observability strategies. To harness GenAI's potential in this space, it's essential to understand your observability stack's core components.
In this article, we explore key considerations for integrating generative AI (GenAI) into observability and highlight how it can complement existing machine learning (ML) capabilities to enhance anomaly detection, event correlation, and predictive analytics.

You'll learn about essential components, such as:

Data ingestion and normalization – How to standardize and centralize logs and metrics
Centralized storage and indexing – Methods to structure your data for queries
Machine learning for anomaly detection and correlation – Techniques to surface insights
Augmenting analysis with GenAI – Ways to use large language models to summarize, interpret, and act on data.
Visualization and alerting – Creating actionable dashboards and notifications
Security and compliance use cases – Solutions for security and compliance use cases

AWS offers observability solutions such as Amazon CloudWatch, which provides built-in anomaly detection capabilities powered by ML and GenAI integrations. Some customers choose to use open-source observability solutions or AWS partner solutions, depending on their specific requirements and existing toolsets. This post focuses on the building blocks of observability, irrespective of the underlying technology. However, it illustrates key concepts using examples from various solutions, including AWS services, open-source tools, and third-party platforms. By following this approach, you can transition from traditional reactive monitoring to an AI-driven, proactive observability strategy.

Let’s dive into the key components, starting with data ingestion and normalization.

1. Data Ingestion & Normalization

Organizations should follow best practices to optimize log, metric, and trace collection, processing, and analysis. This ensures consistency, usability, and integration with monitoring and analytics systems.

Standardization: Use structured logging formats (such as JSON) and standardized metric and trace formats (e.g., OpenTelemetry) to enable seamless parsing and correlation.
Contextual Enrichment: Append relevant metadata (e.g., service names, regions, correlation IDs, and span IDs) to improve traceability across distributed systems.
Real-Time Processing: Utilize streaming telemetry pipelines to analyze and act on observability data in near real-time.

How Log , Metrics, and Trace Formats Impact Observability

The choice of log, metric, and trace formats affects how efficiently telemetry data can be processed, queried, and used for insights. Structured formats allow for automation and correlation, while unstructured formats require additional parsing and transformation.

Structured Logging with JSON: Many modern applications write logs in JSON format because it is both human-readable and machine-parsable. CloudWatch Logs automatically discovers log fields and indexes JSON data upon ingestion. JSON's flexibility also makes it ideal for embedding rich context (such as timestamps, severity levels, service names) that can later be mapped into a common schema.
Metrics Formats: Metrics provide numerical insights into system health and performance. They are typically structured in formats such as:
- Prometheus/OpenMetrics: A standardized text-based wire format for exposing numeric time series data, widely adopted in cloud-native environments.
- OpenTelemetry Metrics: A standardized format that enables cross-platform compatibility and correlation with logs and traces.
- CloudWatch Embedded Metrics Format (EMF): AWS-native JSON-based structured format that integrates with CloudWatch Metrics.
Trace Formats: Traces capture the execution path of requests across distributed systems. Common formats include:
- OpenTelemetry Traces: Enables distributed tracing across services with span-based correlation.
- AWS X-Ray: AWS-native tracing format that visualizes service interactions and latency bottlenecks.

Kubernetes and Container Telemetry Data

Containers and Kubernetes environments generate logs, metrics, and traces in different ways depending on the runtime and observability configuration. Different components within a Kubernetes cluster may use different approaches:

Application logs are often emitted in JSON format to facilitate structured search and correlation.
System and runtime logs use CRI logging for consistency across container runtimes and are typically transformed into JSON or other structured formats for better observability.
Metrics collection is handled by tools such as Prometheus or CloudWatch Agent, with exporters that standardize data in formats like Prometheus/OpenMetrics
Traces are generated using OpenTelemetry SDKs or AWS X-Ray for end-to-end request tracking across microservices

Importance of Normalization and Its Applications

Normalization ensures consistency across logs, metrics, and traces, allowing for better correlation, querying, and analysis. Applying normalization at different stages of the observability pipeline improves the usability of monitoring systems.

Before Ingestion:
- Filter at the source: Filter redundant or low-value data to reduce noise while ensuring critical telemetry is retained. For example, the CloudWatch agent performs filtering at the source, which is recommended.
- Transform at the Source: Logging agents like Fluent Bit and Logstash process and convert CRI logs into structured formats like JSON, while OpenTelemetry Collectors normalize traces and metrics before forwarding them to monitoring platforms.
- Enforce standardization: Ensure consistency and usability across observability platforms by defining data schemas at the point where telemetry is initially generated. For example, Prometheus and OpenTelemetry client libraries and APIs help generate metrics and traces that adhere to correct formats.
- Append metadata: Enrich telemetry with contextual information, such as request IDs, regions, span IDs, and execution context, for better traceability. In a microservices environment, adding a unique trace_id to logs enables correlation across multiple services, allowing for end-to-end request tracking.
During Ingestion:
- The following are examples of how data can be processed during ingestion to different destinations.
  - Normalize data: Collect, transform, and normalize logs, metrics, and traces before indexing them into OpenSearch using OpenSearch Ingestion.
  - Use OpenTelemetry Collector to ingest, transform, and export telemetry data. For example, the collector can scrape metrics in Prometheus format, transform and enrich them with metadata, and then export them to CloudWatch or other monitoring platforms.

2. Centralized Storage

Amazon S3 provides a foundation for low-cost, long-term storage of observability data. Amazon S3 can store raw logs, normalized logs, metrics dumps, and alarm records. Many AWS services emit operational logs, such as VPC Flow Logs, S3 Access Logs, and Load Balancer Logs, and these are often stored in Amazon S3 for long-term retention. You can use AWS Glue Data Catalog to automatically discover and catalog your observability data in S3, enabling serverless SQL queries through Amazon Athena without moving the data.

3. Search and Purpose Built Databases

CloudWatch Log Insights

CloudWatch Logs Insights provides interactive searching and analysis of log data in Amazon CloudWatch Logs. Key features include automatic discovery of log fields and creating field indexes to speed up queries. CloudWatch Logs Insights also offers the ability to save queries, view query history, and add queries to dashboards.

OpenSearch

You can use OpenSearch to centralize logs for aggregation, log analytics, and visualization. OpenSearch Dashboards can help you explore log frequency or correlate metrics and logs on timelines. For instance, you can use Dashboards or Kibana to plot a time series of error log counts versus CPU utilization metrics. This indexed data also becomes a source for event correlation by ML, which we discuss next.
Amazon OpenSearch Service zero-ETL integration with Amazon S3 also lets you directly query operational data stored in S3 using OpenSearch's rich analytics capabilities, with no need to move the data. This is done by creating Glue Data Catalog tables that reference the data in S3, which OpenSearch can then query directly. This allows you to use OpenSearch for both real-time analytics on data ingested into the service, as well as ad-hoc analysis of the broader operational data lake in S3. The integration reduces the need for complex ETL pipelines.

Time Series Databases

Time-series databases are uniquely suited to handle the ingestion, storage, and analysis of observability data. Unlike traditional databases, these purpose-built solutions are designed to efficiently manage the high volume, velocity, and variety of time-stamped data points generated by applications, infrastructure, and services. For example, time-series databases support real-time dashboarding and visualization of key performance indicators, allowing observability teams to quickly identify anomalies and trends. They also facilitate predictive analytics, enabling organizations to anticipate potential issues before they affect end-users. Additionally, the ability to query and analyze historical observability data stored in a time-series database is invaluable for root cause analysis and long-term trend identification.

Amazon Timestream, a fast and scalable time-series database, is well-suited for observability use cases. Its ability to ingest and store high-volume operational data, such as CPU usage, memory utilization, and transactions per second, enables teams to gain real-time insights into the health and performance of their systems. The low-latency querying capabilities of Timestream allow for quick decision-making and troubleshooting based on the latest system metrics.

4. Machine Learning

Anomaly Detection

Anomaly detection is a critical component of effective monitoring and observability. Traditional approaches that rely on static thresholds or simple alert rules often fail to identify complex patterns and subtle deviations from normal behavior. Machine learning (ML) offers a more sophisticated and adaptive solution, allowing you to automatically detect anomalies and correlate related events across your logs, metrics, and alarms. ML models can learn the baseline behaviors of your environment and quickly flag any deviations, whether it's an unexpected spike in web traffic, a sudden increase in database query latency, or a potential security threat.

Using ML Capabilities within Existing Solutions

Many AWS services now offer built-in ML capabilities for anomaly detection. For example:

CloudWatch Log Insights can detect and analyze patterns in log events to identify recurring text structures in the log data. CloudWatch also includes anomaly detection to automatically detect irregular patterns in metrics and offers dynamic thresholds can also be set that adapt to usage patterns.
OpenSearch has built-in machine learning capabilities through the OpenSearch ML plugin, which can automatically detect anomalies, identify patterns, and generate insights from the ingested log data.
Amazon Timestream's integration with Amazon SageMaker empowers teams to use machine learning for predictive analysis. By training models on historical time-series data, organizations can proactively plan for capacity changes and prevent potential business interruptions.

Training and Deployment with SageMaker

SageMaker provides capabilities for building custom machine learning models to analyze logs, metrics, time series data and distributed system health. You can use SageMaker's built-in algorithms such as DeepAR for forecasting, or build your own custom models for anomaly detection, root cause analysis, and other observability use cases.

While these off-the-shelf models provide a good starting point, fine-tuning is often necessary to adapt to your application's data.
There are a few ML approaches useful for predictive monitoring and event correlation:

Unsupervised Anomaly Detection: Algorithms like Isolation Forest or Random Cut Forest (RCF) can identify outliers in metric trends or log frequencies with no need labeled data. For example, SageMaker has a built-in Random Cut Forest algorithm specifically designed to detect anomalous data points in time series. You could train an RCF model on weeks of CPU utilization metrics to learn what "normal" looks like, and then have it score new data to detect anomalies (like sudden spikes). Similarly, you could vectorize log data (e.g., count of error keywords per hour) and use an Isolation Forest to find abnormal spikes in error rates.
Time-Series Models (LSTM/ARIMA): Recurrent neural networks such as LSTMs can model sequences of data. An LSTM trained on system metrics might forecast expected values and detect when actual values diverge significantly (an anomaly). These are useful for capturing seasonality or trends such as daily patterns and detecting context-dependent anomalies.
Supervised Learning for Event Correlation: If you have historical incidents labeled (e.g., past outages with known root causes), you can train a model such as a Random Forest or XGBoost classifier to predict when a combination of log patterns and metric anomalies signals a specific issue. For instance, a model could learn that when an "OutOfMemory" error log coincides with high memory usage metric, it likely indicates an application crash event. Such a model would require a training dataset of events (normal vs. issue) – which is not always available – so this is an optional approach. Often, unsupervised models are easier to start with in observability.

Once a model is trained and validated, you can deploy it to a SageMaker endpoint for real-time inference, or run it periodically in batch mode to score new data.

Event Correlation

Event correlation is a key aspect of modern observability. It is the process of intelligently linking together disparate, noisy signals from multiple monitoring sources to reveal the underlying narrative of system behavior. Rather than treating every alert as an isolated incident, event correlation aggregates and analyzes events to determine patterns, isolate root causes, and reduce alert fatigue. At its core, event correlation takes streams of raw events—whether from logs, metrics, or traces, or other monitoring data sources — and analyzes them for common attributes, timing, and dependencies. The goal is to cluster related alerts into a cohesive, actionable incident that accurately reflects the true state of the system.

By identifying correlations between various monitoring data points, event correlation provides a more holistic and contextualized view of system performance and health. This allows engineers to more quickly diagnose and resolve issues, reducing mean time to detection (MTTD) and mean time to resolution (MTTR).

Modern observability solutions employ several key techniques to achieve effective event correlation:

Deduplication and Aggregation: Redundant or duplicate events are common in large-scale systems. Modern solutions use advanced deduplication algorithms to merge multiple identical or highly similar events into a single incident. Aggregation further summarizes a burst of related events—whether from transient spikes or recurring alerts—reducing noise and preventing alert fatigue.
Contextual Enrichment and Filtering: Beyond simply cleaning up raw data, observability platforms enrich events with context from configuration management databases (CMDB), topology maps, historical baselines, and threat intelligence feeds. This added context helps the correlation engine distinguish between benign anomalies and genuine incidents. For example, if multiple alerts are tied to a known critical service, they can be prioritized over similar alerts from less critical systems. Automated filtering techniques then suppress irrelevant or low-priority events, ensuring that teams see only the most actionable information.
Temporal and Topological Analysis: By analyzing the timing of events and understanding the network or service topology, observability solutions can link events that occur within a specific window or that are part of a dependency chain. For instance, if a downstream service reports errors shortly after its upstream dependency fails, the correlation engine can attribute the incident to that dependency failure.
AI and Machine Learning: To handle the dynamic nature of modern IT environments, many observability platforms incorporate machine learning. These models learn what "normal" looks like and can dynamically adjust correlation rules. Over time, the system becomes better at predicting which clusters of events represent true incidents versus those that are simply noise. One approach is to use unsupervised clustering on time windows – group together all unusual events that occurred within the same timeframe. Another approach is to use graph-based algorithms (treating events as nodes and linking those occurring closely in time or with similar patterns).

Using Capabilities within Existing Solutions

For those who want a simpler starting point, Amazon DevOps Guru is an AWS service that does this out-of-the-box – it uses ML to automatically analyze CloudWatch metrics and logs, identify anomalies, and correlate them into insights with recommendations.

CloudWatch Application Signals takes a holistic approach to application performance monitoring by seamlessly correlating telemetry data across multiple sources, including metrics, traces, logs, real-user monitoring, and synthetic monitoring. Application Signals provides a pre-built, standardized dashboard that displays the most critical metrics for application performance – volume, availability, latency, faults, and errors – for each application running on AWS. For applications running in containerized environments, CloudWatch Application Signals seamlessly integrates with Container Insights, enabling you to identify infrastructure-related issues that may impact application performance.

Building Blocks to build a custom open-source solution for event correlation

Implementing an event correlation solution with open-source components requires evaluating several components that must be integrated together. We have discussed several key capabilities so far that may be part of an overall event correlation solution

OpenTelemetry can provide a strong data foundation needed for correlation. By using Opentelemetry SDKs, all telemetry data can carry consistent identifiers (trace IDs, resource attributes, etc) that make it easier to link events across systems. For example, traces from a distributed transaction can be tied to log entries by embedding trace IDs in logs.
OpenTelemetry Collector can receive data from many sources and export to multiple backends. Thi means with one instrumentation standard, you can send data to an OpenSearch cluster for log analysis, a Prometheus backend for metrics, ensuring all tools see a consistent stream of events
OpenSearch can be used to search and index logs and event data. It also has an Observability plugin that supports trace visualization and correlation features. A potential approach can be to send log and traces to OpenSearch and use its built-in Trace Analytics to correlate them. OpenSearch Observability UI can reconstruct distributed traces with flame graphs and build services maps that show relationships between services. By indexing trace spans and log records together with trace and span IDs you enable queries like “find all logs for this trace” effectively correlating different data types. OpenSearch also supported anomaly detection as discussed. Additionally, OpenSearch offers and Alerting plugin where you can define monitors and triggers.
For metrics (system CPU, application latency, etc), a specialized time-series database can be used. Prometheus is a popular choice for metric collection and alerting. It can scrape metrics from endpoints and evaluate alerting rules. Prometheus can perform simple correlation such as recording rules or alerting expressions. For example, you might write a PromQL alert that checks if error rates are high while traffic count is low to detect a specific condition. Prometheus alertmanager groups alerts to avoid duplicates and can perform basic dependency groupings.
OpenTelemetry traces can be used to build a service dependency graph using OpenSearch service map feature. This covers application topologies to visualize microservice calls.

The following AWS blog post describe an observability solution for a microservices-based ecommerce application to ties several AWS Services to illustrate key capabilities discussed. The solution uses Amazon OpenSearch Service to ingest and correlate logs and traces collected from the application. FluentBit is used to gather logs, while AWS Distro for OpenTelemetry captures distributed traces. The Trace Analytics feature in OpenSearch Dashboards enables analysis of the trace data, which can then be correlated with relevant log entries. Additionally, the solution leverages Piped Processing Language (PPL) in OpenSearch to create real-time dashboards and visualizations for operational monitoring. Finally, the authors demonstrate how to generate an incident report using OpenSearch Notebooks, combining narrative text with PPL-powered visualizations to provide a comprehensive view of the identified issue.

5. Augmenting Analysis with Generative AI

Machine learning models can detect and correlate events, but understanding the big picture still often falls to humans. This is where Generative AI comes in: you can use large language models (LLMs) to generate human-readable summaries of the detected anomalies and suggest remediation steps. AWS offers Amazon Bedrock, a fully managed service to access foundation models (such as Amazon Titan, Jurassic-2 from AI21, Claude from Anthropic, etc.) via an API. With Bedrock, you don't need to manage any model infrastructure – you can invoke an API with your prompt and get back a generative answer.

For example, suppose your system detected:

"At 12:05 PM, an anomaly in CPU usage on 3 servers coincided with an uptick in error logs containing 'DatabaseConnectionError' messages."

You can send a prompt to a Bedrock-hosted model to generate a summary like:

"Around 12:05 PM, the application experienced a surge in CPU usage and database connection errors, suggesting a possible database overload or connectivity issue. This likely triggered high latency. Recommended action: check database health and consider scaling or optimizing queries."

The generative model translates raw events into an insight that's easier for on-call engineers to consume.

Amazon Bedrock can be invoked with just a few lines of code. For instance, you can integrate a Lambda function that is called whenever a new anomaly insight is produced. This Lambda would gather the relevant events (perhaps querying OpenSearch for the log lines around the anomaly, and pulling the metric values), construct a prompt, and call Bedrock.

The Bedrock API will return the model's generated text. You then take that summary and store it or present it on a dashboard. The prompt engineering is important – you might include instructions or format the input (e.g., provide a bullet list of events with timestamps to the model). Bedrock's flexibility allows trying different foundation models to see which yields the best quality summary or recommendations.

You can also use Amazon Bedrock to summarize CloudWatch Logs. AWS demonstrated a solution where a CloudWatch Dashboard can include a custom widget that displays a Bedrock-generated log summary. In that solution, a Lambda function (backing the widget) pulls recent log events and calls a Bedrock model to produce a summary, which is then shown on the dashboard. Bedrock essentially acts as an intelligent analysis engine, processing log and metric data to produce insights in plain English.

Generative AI can provide insights beyond traditional ML models by incorporating domain knowledge and reasoning. For example, an anomaly detector might flag a pattern it has never seen before – a generative model could still interpret it (if described in the prompt) and suggest known potential causes (perhaps gleaned from training data that includes documentation or common issue scenarios). In other words, the ML anomaly detection tells you what is wrong, and the generative AI tries to answer why it might be wrong and how to fix it. This greatly assists operators in making quick sense of complex situations.

Another use of generative AI is to create incident reports or tickets automatically. Imagine feeding the sequence of events to Bedrock and getting a nicely formatted summary that can be posted to a Slack channel or opened as a ticket in Jira, complete with recommended fixes. This would save engineers from manually writing incident summaries.

To ensure the generative AI stays accurate, feed it facts from your environment (like the actual log lines, alarm names, metric values). Bedrock doesn't train on your data, but it can take your data as input for each request.

6. Visualization & Alerting

No monitoring solution is complete without visualization and alerting for human operators. Several AWS services can help you achieve real-time visibility and notifications:

Amazon CloudWatch Dashboards: CloudWatch provides dashboards where you can display metrics, alarms, and even custom text or images. You can create a centralized dashboard that shows key metrics (CPU, memory, latency, etc.), counts of certain log events, and the output from your AI/ML pipeline. For instance, one panel might show a time series of anomaly scores (so you can see when the system thinks something is off), and another panel could list active CloudWatch Alarms. Using the CloudWatch Custom Widget feature, you can embed the generative AI summary directly into the dashboard as discussed in the previous section. The dashboard can also include classic elements like graphs of metric trends, or tables of recent log entries.
Amazon QuickSight: QuickSight allows advanced BI visualization, especially if you want to combine data from multiple sources or do longer-term analysis. You can use QuickSight to create interactive charts and tables from logs stored in S3 by using Amazon Athena to query the data. Amazon QuickSight Q enables any user to ask questions of their data using natural language, without having to write SQL queries. Amazon QuickSight's machine learning (ML) capabilities allow you to uncover hidden insights, identify key drivers, and generate forecasts without needing to develop custom ML models or algorithms.
Amazon OpenSearch Dashboards: If you have operators who need to dive into logs, OpenSearch Dashboards can be used for detailed exploration. You might set up saved searches or visualizations there, like a dashboard that shows log rate, error rate, and latency together.
Amazon Managed Grafana: Amazon Managed Grafana can be used to visualize, query, and correlate metrics, logs, and traces at scale.Based on open source Grafana with added features like single sign-on, Amazon Managed Grafana enables querying, visualizing, alerting, and understanding of observability data from various sources, such as container metrics stored in Amazon Managed Service for Prometheus.

7. Security & Compliance Use Cases

In environments with heavy compliance requirements, AWS Security Lake can be a valuable addition. Security Lake automatically collects and consolidates security-related logs (like CloudTrail, VPC Flow Logs, Route 53 DNS logs, etc.) into a dedicated data lake in S3, and it normalizes them to the OCSF schema. Your pipeline can feed into Security Lake by formatting custom sources into OCSF and storing them in the Security Lake bucket. The benefit is that security analysts can run queries across all logs in a standardized format and even integrate with third-party SIEM tools.

AWS CloudTrail Lake is another service to consider. It is a managed data lake specifically for auditing AWS API activity. CloudTrail Lake can immutably store all CloudTrail events (which include user actions, changes to resources, etc.) for long periods and allows SQL-based querying on that data. If your scope includes auditing and compliance, you should enable CloudTrail Lake to capture all AWS account activities. This data can complement your operational logs – for example, if an incident is caused by a deployment or a config change, CloudTrail Lake would show the API calls around that time. CloudTrail Lake is designed to simplify audit and security investigations by providing a single location for user/API activity logs. You can integrate CloudTrail Lake with your analysis by exporting relevant events or simply referencing them when needed. The key is to keep the audit logs tamper-evident and separate from regular ops logs (for integrity), while still making them available for correlation when appropriate.

If there’s a possibility that logs contain PII or other secrets, consider adding a step to detect and mask/redact sensitive information. AWS Glue can be used with transformations or libraries to scan for patterns like credit card numbers or social security numbers and mask them. There’s an AWS Big Data Blog example where they use Glue to detect PII in data streams and redact it before loading into OpenSearch. Incorporating such a step ensures that developers or analysts looking at logs don’t accidentally see data they shouldn’t (which also helps with GDPR compliance, etc.). Amazon Macie is another service that can scan S3 for sensitive data – you could run Macie on the log bucket as an added measure, though it’s more of a reporting tool than inline filtering.

Recommended AWS Resources

To dive deeper into topics discussed in this article, here are some recommended AWS blog posts and resources:

Using Generative AI to Gain Insights into CloudWatch Logs – Demonstrates how to integrate Amazon Bedrock to summarize CloudWatch Logs on a dashboard
Efficiently Build and Tune Custom Log Anomaly Detection Models with Amazon SageMaker - A step-by-step guide on using SageMaker to process log data and train anomaly detection models.
Aggregating, Searching, and Visualizing Log Data from Distributed Sources with Athena and QuickSight – A part of a series showing how to use S3 (as a data lake), AWS Glue, Amazon Athena, and Amazon QuickSight to centralize and visualize logs and metrics from many sources
Ingest Streaming Data into Amazon OpenSearch Service with Amazon Kinesis Data Firehose – Introduces how to set up a Firehose delivery stream to send data to an OpenSearch (Elasticsearch) cluster in a VPC
Amazon DevOps Guru – The DevOps Guru FAQs explain how it correlates metrics and logs using ML to surface insights and provides recommendations
AWS Observability Best Practices - Provides comprehensive guides, tools, recipes, and patterns to help improve observability and monitoring of AWS cloud environments
AWS CAF Operations Perspective for Event Management with AIOPs - Provides an overview of event management using AIOps (Artificial Intelligence for IT Operations) in the context of the AWS Cloud Adoption Framework's operations perspective

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.