Supercharge your DevOps practices with generative AI

This article was written in collaboration with Chris Williams.

Development teams aspire to ship software at speed without compromising on the stability of their application workloads. For many this may seem unattainable due to existing pressures or too much time being spent on low effort tasks. But what if we could ease these challenges by using technology?

This was the inspiration that led to Chris Williams, Julie Gunderson and I delivering our session Supercharge your DevOps practices with generative AI at re:Invent 2024. Following the session there was a great deal of interest in attendees being able to learn more and get hands-on with these architectures.

In this article we will do just that. This contains a written summary of the session including the architectures we discussed. Additionally you will find links to the code assets so that you too can deploy them within your environment.

DevOps is Amazing

In the time before DevOps, software development teams were facing numerous issues which were hindering their ability to effectively deliver software. The walls between development and IT operations led to silos which ultimately slowed down speed at which software could be delivered. This led to issues with operating as knowledge sharing and common practices were not being shared across teams and could have a knock-on effect to the reliability of application uptime.

But then DevOps was introduced the world and teams began to adopt it. Suddenly that wall between disparate teams was broken down, with a more fostered collaboration and shared responsibility. Delivery speeds increased as challenges were addressed with changes in process to create repeatable scaling and tools such as continuous monitoring to allow rapid recovery when things go wrong. Even security concerns were baked into the entire development lifecycle rather than being an afterthought through DevSecOps.

As DevOps matured the question was asked how do we achieve “Good DevOps”? For this you need to create measurable metrics. This is where the good people at DORA have provided well thought out metrics that you can use to measure your efforts:

Lead Time for Changes: The time it takes from code commit to production deployment.
Deployment Frequency: How often an organization successfully releases to production.
Change Failure Rate: The percentage of deployments causing a failure in production.
Mean Time to Recovery (MTTR): How long it takes to restore service when a failure occurs in production.

DevOps Meets Generative AI

Whilst most teams dream of fully embracing DevOps, allowing them to ship multiple times a day with a low chance of failure, there are often existing challenges that hinder this transition. These challenges are typically related to how the teams currently operate, preventing them from fully adopting DevOps practices.

Examples include:

Accumulating technical debt and poor code quality
Lack of automation, inadequate observability, and monitoring gaps
Inter-department task complexity
Poor incident management
Inefficient code review processes

However, the rise of Generative AI presents an opportunity to ease the burdens on all development teams. This technology can be used to automate parts of the processes, allowing teams to focus their time and efforts on more rewarding challenges.

In our demonstrations, we will show you how Generative AI can help to reduce the friction points as well as give actionable examples that you can leverage in your own environments.

Solutions

In our talk, we introduced 3 separate solutions that address key challenges found throughout the software development lifecycle.

These are:

Automation of Kanban quality validation and task refinement
Extending code reviews with bespoke team checklists
Capturing incident details to produce incident reports

In the subsequent sections we will explore each of these in more detail.

Automating Kanban Workflows

Most organisations use task management software to track workflows and distribute tasks across technical teams. An initial hurdle in this is making sure that ticket quality is high, requirements are clearly understood, and that workstreams are broken down into manageable components (often referred to as subtasks).

Today, these efforts are done manually for many organisations leading to delays and often confusion on the expected deliverables. This is where the Automating Kanban Workflows architecture can support teams.

After a new ticket is created or an edit is made, this workflow can be programmatically triggered by posting to an Amazon SNS topic. This in turn triggers a workflow whereby the task description can be reviewed against a checklist of what a high quality ticket should contain, using Amazon Bedrock. If it doesn’t meet this checklist the task is reassigned back to the original author with some friendly suggestions to help meet that quality bar.

Image not found

Figure 1: Architecture for Automating Kanban Workflows

If the task passes, further actions can be taken. In the sample architecture above (figure 1) this workflow can evaluate if the task itself should be broken down into subtasks. This helps reduce risk by directing developers to ship smaller portions of code. The workflow could be extended to use other logic, for example if the ticket is a bug to review previous commits or attempt to find the bug in the source code and raise a pull request.

In this sample architecture we are using Generative AI to help improve the overall quality of tickets, reducing the total time spent in backwards and forwards dialogue. In addition, it is being used to make low-risk decisions (does the ticket meet the quality bar), and perform manual steps that would have further delayed the time to start working or introduced risk had they not been performed.

Improving Code Quality Reviews

A key responsibility for developers is not just to write their own code, but to review their peers code. This process is absolutely necessary, however can be a significant time sink when balanced amongst other deadlines and responsibilities. The role of code reviews is designed to balance amongst the technical implementation alongside validating that the code delivers on the objectives of the task.

From experience, a lot of effort is given to review the technical implementation whilst not necessarily spending as much on whether the task achieves those objectives. The Improving Code Quality Reviews architecture helps support here, by running a code review through a developer checklist.

Image not found

Figure 2: Architecture for Improving Code Quality Reviews

Every time a developer raised or amends a pull request, a webhook is triggered that notifies an AWS Lambda function that an automated code review is required. A bespoke developer checklist for the organisation is passed into Bedrock alongside the diff of the pull request, to have it then analyse the contents against the required technical implementations.

This means that expected criteria like, new code having unit tests, infrastructure includes monitoring, or even making sure our APIs are backwards compatible can be validated before the code review happens. This saves time on peers needing to check these for ourselves, whilst enabling you to get feedback in near real time. Furthermore code reviews by others can spend more time also validating that the ticket objectives have been met.

Streamline Incident Response

Once code has shipped to production, that is not the end of the story for developers. Many teams operate in a model of “you build it, you run it”. This means that when applications encounter an issue, the expectation is that teams investigate and remediate the issue so that that service is restored.

As a part of this process, details surrounding the incident should be captured into an incident report so that in future occurrences teams can remediate or mitigate the issue much sooner. Unfortunately this last step is often missed as other priorities become the focus. However much of this process can actually be automated, which is what the Streamline Incident Response architecture demonstrates.

During an incident different data streams such as the messages in a chat application or the logs from AWS CloudTrail tell the story of how the incident unfolded. This includes the investigation performed and actions used to resolve the incident. By passing these details into Bedrock, a markdown formatted incident report is output. This is then stored in an Amazon S3 bucket to review in further incidents.

Image not found

Figure 3: Architecture for Streamline Incident Response

But what about when it happens again? This architecture extends to ingest each incident report into an Amazon Bedrock Knowledge Base so that natural language can be used to find previous incidents and key resolution details. In this sample, this is all formatted in a runbook format dynamically so that supporting teams can quickly get guidance on how an alarm was previously resolved.

By invoking both generation of incident reports and dynamically generating runbooks from historical events, teams can more rapidly reduce downtime from reoccurring alerts. This can be amplified by extending AWS Chatbot to invoke these workflows directly within your messaging application.

Considerations

Now that we’ve reviewed the solutions what should you consider when using the samples or building your own automated workflows?

As these are built its important to consider, that these are integrating with existing workflows that your team might have today. Therefore it is important to consider how and when people should be involved to perform validation of output or ensure that the process is being followed as expected.

The focus of efforts should be on low-skilled tasks which can slow down teams from delivering meaningful and impactful work. By looking to automate these kind of friction points, the desire is that developers and other supporting roles are able to spend more time on adding value to the organisation.

Finally, always consider your own compliance and regulatory requirements. Factors like geographical locations, PII (personally identifiable information) and industry-based frameworks (such as HIPAA) must always be discussed before proceeding.

Conclusion

By automating low value tasks, teams can free themselves to work on more rewarding and engaging tasks. In this article we discussed how your DevOps processes can evolve to use Generative AI in performing automation throughout the software development lifecycle.

All of the solutions in this article can be found in the GenAI for DevOps repository on GitHub. You can also see these architectures running in the demo videos featured throughout our re:Invent talk Supercharge your DevOps practices with generative AI on YouTube.

If you’ve enjoyed reading this article, or maybe have some ideas of other GenAI based DevOps architectures in the future please let us know in the comments.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.

Supercharge your DevOps practices with generative AI

Learn how generative AI can embedded into your DevOps practices to improve your agility and stability

DevOps is Amazing

DevOps Meets Generative AI

Solutions

Automating Kanban Workflows

Improving Code Quality Reviews

Streamline Incident Response

Considerations

Conclusion

1 Comment