Automatic Troubleshooting & ITSM System using EventBridge and Lambda
This is a system which will monitor EC2 instances metrics like cpu/mem/disk. Login to those instances and perform basic troubleshooting.
Published Aug 21, 2024
Folks, In IT Operations, it's a very generic task to monitor server metrices like utilization of cpu/memory and disk or filesystems, but in case any of the metrics gets triggered to be critical, then dedicated persons need to perform some basic troubleshooting by logging into server and find out the initial cause of utilization which person has to perform multiple times if he gets multiple same alert that creates boredom and not productive at all. So as a workaround, there can be a system developed which will react once alarm gets triggered and act on those instances by executing few basic troubleshooting commands. Just to summarize the problem statement and expectation -
Develop a system which will fulfill below expectations -
- Each EC2 instances should be monitored by CloudWatch.
- Once alarm gets triggered, something has to be there which will login to that affected EC2 instance and perform some basic troubleshooting commands.
- Then, create a JIRA issue to document that incident and add the output of commands in comment section.
- Then, send an automatic email with providing all alarm details and JIRA issue details.
- EC2 Instances
- CloudWatch Alarms
- EventBridge Rule
- Lambda Function
- JIRA Account
- Simple Notification Service
- Open Systems Manager console and click on "Documents"
- Search for "AWS-ConfigureAWSPackage" document and execute by providing required details.
- Package Name = AmazonCloudwatchAgent
- Post installation, CloudWatch agent needs to be configured as per configuration file . For this, execute AmazonCloudWatch-ManageAgent document. Also, make sure JSON CloudWatch config file is stored in SSM Parameter.
- Once you see that metrices are reporting to CloudWatch console, then create an alarm for CPU and Memory utilizations etc.
To track the alarm state changes, here, we have customized pattern a little to track alarm state changes from OK to ALARM only, not reverse one. Then, add this rule to a lambda function as a trigger.
This lambda function is created for multiple activities which is triggered by EventBridge rule and as a destination SNS topic is added by using AWS SDK(Boto3). Once EventBridge rule is triggered then sends JSON event content to lambda by which function captures multiple details to process in different way.
Here, as of now we have worked on two type of alarms - i. CPU Utilization and ii. Memory Utilization. Once any of these two alarms are triggered and alarm state is changed from OK to ALARM, then EventBridge gets triggered which also triggered Lambda function to perform those tasks mentioned in the form code.
// Lambda Prerequisites:
We need below modules to import for make the codes work -
>> os
>> sys
>> json
>> boto3
>> time
>> requests
>> sys
>> json
>> boto3
>> time
>> requests
Note: From above modules, except 'requests' module rest all are downloaded within a lambda underlying infrastructure by default. Importing 'requests' module directly will not be supported in Lambda. Hence, first, install request module in a folder in your local machine(laptop) by executing below command -
After that, this will be downloaded in the folder from where you are executing above command or where you want to store the module source codes, here I hope lambda code is being prepared in your local machine. If yes, then create a zip file of that entire lambda source codes with module. After that, upload the zip file to lambda function.
So, here we are performing below two scenarios -
- CPU Utilization - If CPU utilization alarm gets triggered, then lambda function need to fetch the instance and login to that instance and perform top 5 high consuming processes. Then, it will create a JIRA issue and add the process details in the comment section. Simultaneously, it will send an email with alarm details and jira issue details with process output.
- Memory Utilization - Same approach as above
Now, let me reframe the task details which lambda is supposed to perform -
- Login to Instance
- Perform Basic Troubleshooting Steps.
- Create a JIRA Issue
- Send Email to Recipient with all Details
First Set (Define the cpu and memory function) :
Second Set (Create JIRA Issue) :
Third Set (Send an Email) :
Fourth Set (Calling Lambda Handler Function) :
Full Code:
Alarm Email Screenshot :
Note: In ideal scenario, threshold is 80%, but for testing I changed it to 10%. Please see the Reason.
Alarm JIRA Issue :
In this scenario, if any server cpu or memory utilization metrics data are not captured, then alarm state gets changed from OK to INSUFFICIENT_DATA. This state can be achieved in two ways - a.) If server is in stopped state b.) If CloudWatch agent is not running or went in dead state.
So, as per below script, you'll be able to see that when cpu or memory utilization alarm status gets insufficient data, then lambda will first check if instance is in running status or not. If instance is in running state, then it will login and check CloudWatch agent status. Post that, it will create a JIRA issue and post the agent status in comment section of JIRA issue. After that, it will send an email with alarm details and agent status.
Full Code :
Insufficient Data Email Screenshot :
Insufficient data JIRA Issue :
In this article, we have tested scenarios on both cpu and memory utilization, but there can be lots of metrics on which we can configure auto-incident and auto-email functionality which will reduce significant efforts in terms of monitoring and creating incidents and all. This solution has given a initial approach how we can proceed further, but for sure there can be other possibilities to achieve this goal. I believe you all will understand the way we tried to make this relatable. Please like and comment if you love this article or have any other suggestions, so that we can populate in coming articles. 🙂🙂
Thanks!!
Anirban Das
Anirban Das