Enhancing Incident Management with AI: Tips for Builders

Enhancing Incident Management with AI: Tips for Builders

Generative AI can help you take care of a lot of chores when managing incidents. In this article, you'll learn a few concrete use cases for AI in this context.

Published Mar 15, 2024
No system is perfect, specially in the distributed and complex landscape in which you operate. As a builder, your priority is to create resilient systems and to respond effectively to incidents when they occur. The integration of Artificial Intelligence (AI) into this process presents a game-changing opportunity because it can take a lot of manual work off your place.
Let's dive into practical tips for leveraging AI to revolutionize your incident management practices.

Embrace AI intentionally

Putting AI into critical systems like incident management is not a trivial affair. How your sensible data will be treated is crucial not only for peace-of-mind, but due to legal requirements. Rootly's CEO wrote a more extensive article on this topic for Forbes if you’re interested.
So, before diving into practical stuff, I want to look into a few key elements that you need to keep in mind when integrating AI into your toolchain.

Flexibility and Control

When implementing AI into your incident management workflow, flexibility is key. This means being able to tailor AI's involvement to your team's specific needs—whether that's full automation for certain tasks or AI-assisted recommendations that require human approval. The ability to opt in or out of AI features ensures that your team remains in control, finding the perfect balance between automation and human judgment.

Prioritize Data Security

Incorporating AI into your incident management process must not come at the expense of data security. Opt for tools that guarantee your data remains your own, with no cross-customer learning or unauthorized data storage. Moreover, ensure that any AI tool you employ scrubs sensitive information from its processing, maintaining compliance and data privacy.

Integration and Customization

A key to effective AI implementation is the seamless integration with your existing tools and processes. The ability to customize how AI interacts with your systems, such as connecting to your own OpenAI instance, ensures that AI operates within your predefined security and operational frameworks, enhancing trust and efficiency.

Setting up an LLM

AI is a catch-all term used to refer to a collection of machine learning techniques. One of the most popular AI techniques on the rise since last year is Large Language Models (LLMs), such as ChatGPT by OpenAI.
An LLM is a perfect fit for automating a lot of manual tasks in incident management, as it can ‘understand’ human speech—known as natural language—and write semantically and syntactically correct texts. There’s quite a few commercial alternatives offering LLMs that reduce the barriers of entry as you can interact with them through APIs. But there’s also a myriad of open-source options to build your own LLM.
At Rootly, we use enterprise OpenAI because it provides us with the guarantees that our customers demand: data privacy, no training on their data, and cutting-edge models. Then we train our model with SRE knowledge and teach it how to interact with responders.
Our Slack app has access to each incident’s channel messages (based on each customer’s privacy settings). The incident’s information is used as context for the model to respond to prompts that we’ve fine-tuned over months so they output relevant information in different use cases.

Use cases for AI in incident management

AI in incident management isn't about replacing builders or responders but getting rid of the burdensome chores and unlocking more time to focus on resolving incidents. For example, when an incident breaks, you have all these stakeholders asking for updates that you have to write every now and then in a language that makes sense to them.
Or, when a new person is added to the war room, they need to spend time catching up through a lot of messages or have somebody take some time to walk them through what has happened so far.
AI can process what’s going on in the incident and write summaries, answer questions about the incident for anyone interested, or craft unbiased resolution messages. Let’s look at some examples you can build into your Slack bot to leverage AI in incident management.

Getting up to speed

Just landed in the middle of an ongoing incident, with a Slack channel buzzing with theories, metrics, and discussions? Use AI to cut through the noise fast. Use the incident’s context to write a good summary of what’s going on.
Screenshot of Rootly AI's catchup feature
`/rootly catchup` command in Slack prompts the AI model to generate an updated summary for you

Ask any question about the incident

At Rootly we implemented a context-aware agent for our LLM. That means, our Slack bot can understand natural language questions and answer them based on all the information it has about the incident. We also implemented a UI to access our LLM through Rootly’s web app, for those who need it outside Slack.
Screenshot of Ask Rootly AI
Ask Rootly AI lets people ask any questions about the incident

Unbiased resolution messages

Incidents are loaded with a psychological and emotional charge, which might make it hard to write cold-headed conclusions of the things that happened without blaming anyone. I know teams that use AI for this purpose alone: so they can get honest reviews of what caused incidents and how they were resolved to evolve their reliability strategy.
Screenshot of a generated resolution message
`/rootly resolve` lets people use GenAI to get a facts-only message on how the resolution happened

AI writing is just the beginning

Integrating AI into your incident management practices saves you time by taking manual writing chores off your plate. But these are just the basic use cases. Once you have more confidence, you can built out deeper AI integrations that help you extract insights on how your reliability practice is performing and plan proactive improvements.
Remember, the goal is not just to manage incidents but to continuously improve your systems and processes, making each incident an opportunity for growth and learning. Ready to see how AI can transform your incident management? Explore the tools and strategies available and embark on your journey toward a smarter, more resilient system.
At Rootly, we built an entire suite of AI features that integrate throughout the entire incident lifecycle.