5 tips to control the "I" in RoI-driven observability
5 recommendations concerning the "investment", also know as "cost", aspect of the Return on Investment-driven observability framework.
Michael Hausenblas
Amazon Employee
Published Jan 16, 2025
This is the article I should have written already 2 years ago. Oh well, better late than never. Note that I'm not trying to compete with the Quinns of the world here. If anything, then I hope they can use the following as a further input to help you reducing your AWS bill.
Some context: On a daily basis, as part of my role at AWS, I'm working with customers to educate and advise on the usage of our offerings. This includes roadmap items however also tips on using services such as Amazon Managed Service for Prometheus.
Return on Investment (RoI) driven observability in a nutshell is a framework to start with the desired goal such as MTTD reduction or feature delivery increase and compare that with the required investments including costs of services used to meet and exceed the goals you set yourself. This helps to align stakeholders (your boss, folks from financing, internal customers, etc.)
So, let's focus on the costs. Here are 5 recommendations you can use on a daily basis to lower your observability costs and with it increase the RoI:
This first one is obvious but worth calling it out because at scale this is non-trivial to execute. I remember, working at Mesosphere, some 10y ago, there was an awesome internal tool that would automatically shut down cluster resources if you would not actively prevent that. Inspired by that, I created later EKSphemeral, allowing you to do the same for dev/test EKS clusters.
Coming back to the o11y space. For example, an idle Amazon Managed Service for Prometheus (AMP) managed collector, our agentless offering, costs you ca. $28 per month, easy to determine using our pricing calculator. I can also tell you how much it will cost you to use AMP for an average sized EKS cluster. The question is: how many pieces of infra such as the AMP collector have you running that don't do anything for you or your internal customers? Do you even know? What mechanisms and automation do you have in place to reap orphaned infra?
Are you using a certain metric in a, say, Grafana dashboard? Are you alerting on it? Are logs from an EKS pod or a Lambda function from last year actually helpful in troubleshooting a problem that you have, right now? If not, why are you ingesting it in the first place? For example, in the Prometheus community, which I have been a member of since ca. 2016, there was for a long time and arguably still is a guidance around along the line "ingest everything, don't worry about it, decide later". While this may be true for a vertically scaled, single Prometheus instance, this is certainly not true for SaaS. In my experience, ingest can make up to 80% and more of the costs.
Caching, sure. But what about non-optimized queries? PromQL for example is super powerful and that's why customers love it and use it. But do you know how much a naive, unoptimized query can cost you? A time range that's way too long or selectors that are not specific enough? And when do you actually notice it? Say, you're part of a platform team that owns compute platform and vends them to your internal customers. At the end of the month you get a 10x bill for your observability tool of choice. You may ask yourself: Why is that the case and who (of your internal customers) caused it? How can we prevent that from happening in the future, without compromising the insights desired by our internal customers?
I am a big believer and supporter of all things AI/ML. I think that GenAI has great use cases and even simple things like natural language query to DSL query language (such as to PromQL) are awesome. Same for anomaly detection, driving dynamic alerting. However, if the output of the AI/ML feature in question is poorly understood and/or unbounded, you might find yourself in a world of pain. The "I" of ROI goes up by a lot without you seeing or being able to justify the "R".
You know why I love OpenTelemetry? Mostly because it's a standard. Standard for on-the-wire data transfer, standard data model in the context of SDKs, standard naming (aka as semantic conventions). The implementation might be scary and also the docs could be more helpful for non-trivial setups, but OpenTelemetry as a standard surely helps you to "instrument once, consume everywhere" … for those amongst you who are old enough, Java wants its headline back ;)
For example, if you build your telemetry collection pipeline on OpenTelemetry then you can benefit not only from a rich set of correlated signals but also, if your current vendor turns out to be too expensive, the cost of switching and the opportunity costs are lower, compared to relying on a proprietary API and SDKs.
Kudos: The cover image is by John McArthur on Unsplash.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.