Reduce ETL Time by Converting Sequential Code to Parallel AWS Lambda Execution
Using the AWS Lambda functions to reduce the ETL time and make the execution parallel
Published Sep 15, 2024
Few years back, when I was quite fresh in the cloud world, I was given an ETL problem that the current code is written in Java that executes on linux server and the whole ETL time was more than 8 hours minimum. As a cloud enthusiastic, my challenge was to reduce the ETL time.
So, the code as using Google Adwords API to extract the data and store on servers where the data was send to a data warehouse. The whole process used Pentaho tool to perform the ETL.
For quick resolution, I had 2 options: either to use AWS Lambda or AWS Glue. I choose AWS Lambda because the ETL time per Google Adwords account would never exceed 10 min in worst case.
I made an architecture as below:
- The Pentaho tool would invoke the AWS Lambda function named 'data-sync-automate' with accountID as payload.
- The function will execute the 10 other AWS Lambdas, each associated with a metrics of Google-Ads, fetch the records and store in S3.
- Once fetched, the AWS Lambda function 'data-sync-automate' will send a message in SQS.
- Pentaho will fetch the message and download the data from S3 for that particular accountID.
The whole ETL time was reduced from 8 hours to less than 50 minutes.
Below is an example how to fetch Google Adwords Keyword Report:
The code is short and simple and it added a great value in the whole ETL process and made the clients happy.