AWS Logo
Menu
AWS Cost Optimization Using Lambda Functions and Terraform

AWS Cost Optimization Using Lambda Functions and Terraform

Lambda-based scheduling and monitoring systems that help reduce costs

Published Feb 26, 2025
Last Modified Feb 27, 2025

Introduction

In modern cloud infrastructure, cost optimization and proactive incident prevention are crucial for maintaining efficient operations. This document outlines our implementation of AWS Lambda-based scheduling and monitoring systems that help reduce costs and prevent potential issues before they impact production.
💡 Note: For the complete code implementation and examples, please check my GitHub repository linked in the References section below.

I understand that there is an AWS Instance scheduler service but despite all its advantages, it has a number of disadvantages.

Limitations of AWS Instance Scheduler

Although AWS Instance Scheduler is a useful tool for cost optimization, it has some limitations that may make it unsuitable for certain use cases:
  1. Limited Flexibility – The scheduler works based on predefined schedules and does not allow dynamic scaling or real-time adjustments based on workload demands.
  2. No Load-Based Scaling – It stops or starts instances according to a set schedule, without considering actual usage metrics such as CPU or memory utilization.
  3. Supports Only EC2 & RDS – The service is restricted to managing EC2 and RDS instances, excluding other AWS resources such as ECS, EKS, or Lambda.
  4. No Auto-Discovery – Users must manually configure schedules and tag instances, making it less automated for large-scale environments.
  5. IAM Permission Complexity – The required IAM roles and policies can be difficult to set up correctly, leading to potential security misconfigurations.
  6. Latency in Execution – Since the scheduler relies on AWS Lambda and DynamoDB, there may be slight delays in instance state changes.

AWS Instance Scheduler vs. Custom Lambda + Terraform Solution

comparing

Resource Scheduling System

We utilize several specialized Lambda functions to manage different AWS resources:

1. ASG (Auto Scaling Group) Scheduler

The ASG scheduler manages compute resources based on time schedules:

2. RDS (Relational Database Service) Maintenance Scheduler

The RDS scheduler handles database maintenance tasks:

3. EKS (Elastic Kubernetes Service) Scheduler

The EKS scheduler manages Kubernetes clusters:

Incident Prevention System

CloudWatch Metrics Monitoring

Our system implements proactive monitoring of critical metrics:
  1. Database Metrics
  2. Storage space utilization
  3. CPU usage
  4. Connection count
  5. IOPS utilization
  6. Application Metrics
  7. Response times
  8. Error rates
  9. Queue lengths
  10. Memory usage

Automated Prevention Actions

Example of automated response to metrics:

Slack Notifications

Our system sends notifications to Slack channels.

IAM Security Configuration

Each Lambda function has specific IAM roles with least-privilege access:

Cost Optimization Features

1. Automated Resource Management 🤖

  • Scheduled start/stop of development resources
  • Capacity adjustment based on usage patterns
  • Weekend and holiday scheduling
  • ⏰ Shutdown of dev/stage/test EKS clusters during off-hours (~8-12 hours/day)
  • 🛑 Stopping RDS instances for dev/stage environments
  • 📉 Reducing EC2 instances count in ASG for dev/stage environments

2. Preventive Maintenance 🔧

  • Automated database VACUUM operations
  • Storage space monitoring
  • Performance optimization

3. Resource Right-sizing 📊

  • Regular utilization analysis
  • Automatic scaling adjustments
  • Cost-effective resource allocation

Benefits Achieved

1. Cost Reduction 💰

  • 40-60% reduction in development environment costs through:
    • EKS clusters shutdown during off-hours
    • RDS instances stoppage during non-working hours
    • Reducing EC2 instances in ASG
  • Elimination of idle resource costs during weekends
  • Monthly savings of approximately $5000-7000 on dev/stage environments

2. Improved Reliability

  • Zero downtime due to storage issues
  • Proactive issue detection
  • Automated maintenance procedures

3. Operational Efficiency 🎯

  • Reduced manual intervention
  • Consistent resource management
  • Automated incident response

Implementation Details

Here's an example of our RDS maintenance implementation:

Monitoring and Alerting

  1. Critical Metrics
  2. Database storage utilization
  3. Application error rates
  4. Resource utilization patterns
  5. Performance metrics
  6. Alert Thresholds
  7. Warning: 70% utilization
  8. Critical: 85% utilization
  9. Emergency: 95% utilization
  10. Response Actions
  11. Automated maintenance
  12. Resource scaling
  13. Team notifications
Best Practices
  1. Resource Tagging
  2. Monitoring Configuration
  3. Set appropriate thresholds based on historical data
  4. Implement graduated response actions
  5. Maintain comprehensive monitoring documentation
  6. Security Measures
  7. Use least-privilege IAM roles
  8. Implement proper error handling
  9. Maintain audit logs

Implementation Example

Here's an example of our RDS maintenance implementation:

Monitoring and Alerting

  1. Critical Metrics
  2. Database storage utilization
  3. Application error rates
  4. Resource utilization patterns
  5. Performance metrics
  6. Alert Thresholds
  7. Warning: 70% utilization
  8. Critical: 85% utilization
  9. Emergency: 95% utilization
  10. Response Actions
  11. Automated maintenance
  12. Resource scaling
  13. Team notifications

Conclusion

Our AWS Lambda-based scheduling and monitoring system has proven highly effective in:
- Reducing operational costs through automated resource management
- Preventing incidents through proactive monitoring
- Improving system reliability through automated maintenance
- Reducing team workload through automation
The combination of scheduled resource management and proactive monitoring ensures optimal resource utilization while maintaining system stability and performance.

References

Comments