Observability for GenAI workloads on AWS

Observability is part of two of the design principles of the Operational Excellence pillar of the Well-Architected Framework:

Implement observability for actionable insights: Gain a comprehensive understanding of workload behavior, performance, reliability, cost, and health. Establish key performance indicators (KPIs) and leverage observability telemetry to make informed decisions and take prompt action when business outcomes are at risk. Proactively improve performance, reliability, and cost based on actionable observability data.

Learn from all operational events and metrics: Drive improvement through lessons learned from all operational events and failures. Share what is learned across teams and through the entire organization. Learnings should highlight data and anecdotes on how operations contribute to business outcomes.

In this article we’ll lay out a four-level framework for GenAI workload observability. We’ll then provide a reference design supported by a working implementation.

Four levels of GenAI observability

A complete GenAI application may include several components beyond the foundation model itself. A RAG application, for example, will have a vector database, an embedding model, and an ingest pipeline for unstructured data. Andreesen-Horowitz published an “LLM Application Stack” that shows a comprehensive list of these components.

As with any application consisting of multiple components, when we consider observability, we’ll need to start with the lowest-level component metrics and work our way up to more meaningful KPIs.

Level 1: Component-level metrics

The basic level of observability focuses on visibility into the state of each of the components. This may include capturing metrics such as latency, number of invocation errors, and resource utilization.

Level 2: Tracing

We implement tracing to capture the interaction between a model and immediately surrounding components, like agents, tools, and vector databases.

Level 3: End-user feedback

End-user feedback provides early indications of potential issues with the application. This level starts to surface a KPI that we should track to indicate not only application health, but also application usefulness to the end user. End-user feedback can also drive improvements to the model through techniques like reinforcement learning with human feedback (RLHF) or direct preference optimization (DPO).

Level 4: Advanced metrics

This level involves implementing advanced monitoring techniques such as monitoring for embedding drift, measuring faithfulness (factual consistency of the answer to the context based on the question), and context precision/recall. This can also include monitoring the user sentiment based on the feedback provided.

Reference design

Level 1: Component-level metrics

Amazon CloudWatch makes it easy to observe these types of metrics, since it tightly integrates with AWS services that are used in the application. If you use third-party or open-source libraries, you can push their metrics into CloudWatch, and you can also record your own custom application metrics. With Amazon CloudWatch, we can visualize these metrics and also implement necessary alarms.

Level 2: Tracing

We can leverage OpenTelemetry to instrument tracing, where each span would capture the different stages of the workflow. Each span contains events which further breaks down the stage into granular steps or operations, providing detailed insights into the execution of each stage. Important datapoints such as the user prompt and LLM response would be captured as event attributes.

We may also choose to leverage OpenLLMetry, an open source project built on top of OpenTelemetry that provides auto-tracing capabilities for generative AI applications. Currently, OpenLLMetry integrates with various LLM providers, vector databases, and frameworks (e.g., LangChain, LlamaIndex).

To collect these traces for further processing and visualization, we can use the OpenTelemetry Collector. We can then use the Collector to export the traces to AWS X-Ray for visualization and debugging.

Level 3: End-user feedback

We can leverage OpenTelemetry/OpenLLMetry or MLFlow to capture the feedback, whether it is a simple thumbs up/thumbs down or a review written by the user.

At the moment OpenLLMetry requires a Traceloop API key to capture traces. Once this requirement is lifted, using the OpenTelemetry Collector, simple feedback metrics would be exported to Amazon CloudWatch, while feedback with more substance will be captured as part of a trace span and exported to Amazon S3 for further processing. (see Level 4).

Level 4: Advanced metrics

These techniques typically require offline processing using services like Amazon SageMaker (processing jobs).

We can use the OpenTelemetry Collector to export traces to Amazon S3, which will be used as the data source for running the monitoring jobs. The monitoring jobs will generate the relevant metrics and publish them to Amazon CloudWatch.

Example implementation

This GitHub repository contains a working example of these four levels of observability, including tailored CloudWatch dashboards. It implements a RAG application.

Next steps

There’s no universal observability implementation that will work in every scenario. Rather, we tried to present a logical framework for GenAI observability. We then presented a reference design where you can plug in your own observability tools. We also provided a working example using mostly AWS-native services.

We welcome feedback, and also contributions to our GitHub repository. If you have additional patterns to share, feel free to file a pull request. We are also investing in contributions to OpenLLMetry, and again welcome collaboration.

Finally, we invite you to also review a repository written by some of our colleagues at AWS.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Site Terms, Privacy, and more.

Observability for GenAI workloads on AWS

Applying the operational excellence pillar to GenAI

Four levels of GenAI observability

Level 1: Component-level metrics

Level 2: Tracing

Level 3: End-user feedback

Level 4: Advanced metrics

Reference design

Level 1: Component-level metrics

Level 2: Tracing

Level 3: End-user feedback

Level 4: Advanced metrics

Example implementation

Next steps

Comments