Create an ETL Pipeline with Amazon EMR and Apache Spark
Learn how to build an ETL pipeline for batch processing with Amazon EMR and Apache Spark.
About | |
---|---|
✅ AWS Level | 200 - Intermediate |
⏱ Time to complete | 30 mins - 45 mins |
💰 Cost to complete | USD 0.30 |
🧩 Prerequisites | - An AWS Account (if you don't yet have one, create one and set up your environment) - An IAM user that has the access to create AWS resources. - Basic understanding of Python |
📢 Feedback | Any feedback, issues, or just a 👍 / 👎 ? |
⏰ Last Updated | 2023-03-30 |
- Create and set up an Amazon EMR cluster
- Submit a PySpark job on EMR
- Integrate Amazon EMR with Amazon S3
CSV
file and it needs to be processed and made available to your data analysts for querying and analysis.RAW
data (which is the input and unprocessed data)CLEANSED
data (which is output and processed data)
Key Pair
, which we would need to access the EMR cluster's master node later on. So let's do that really quickly.- Log in to your AWS account and navigate to the EC2 console and click on the Key Pairs option on the left menu bar. And then, click
Create Key Pair
.
- Provide a name (
mykey-emr
) for your key pair and clickCreate Key Pair
.
- Now we can go ahead and create an
Amazon EMR cluster
. For that, navigate to Amazon EMR in the console and click Create Cluster to create an EMR cluster.
- Provide
Cluster name
asMyDemoEMRCluster
to your EMR cluster, and select the following:- Select the latest release of EMR under Software configuration section
- Select Spark under Application bundle section,
- Select the right EC2 key pair (which you created in the previous step) under the Security and access section
- Cluster creation will take some time, and after couple of minutes, you will see that the cluster is up and running with a state as
Waiting
(which means the cluster is now ready and waiting to execute any ETL job).
RAW
and CLEANSED
data.- Navigate to the Amazon S3 console and click on Create Bucket.
- Create a bucket (e.g.
etl-batch-emr-demo
).

- Once the bucket is created, create two sub-folders named:
cleaned_data
raw_data
- Upload the sales dataset CSV file in the bucket under the folder
raw_data
.
- Sign in to the AWS Management Console, and open the Amazon EMR console.
- Under EMR on EC2 in the left navigation pane, choose Clusters, and then select the
myDemoEMRCluster
cluster where you want to retrieve the public DNS name. - Note the Primary node public DNS value in the Summary section of the cluster details page.

- SSH to the EMR cluster's Primary node from your terminal
1
ssh -i "mykey-emr.pem" root@ec2-18-219-203-79.us-east-2.compute.amazonaws.com
- Copy the PySpark code
etl-job.py
and save on thePrimary Node
under the home directory and make the following changes and save the file:S3_INPUT_DATA
= 's3://<YOUR_BUCKET_LOCATION_OF_RAW_DATA>'S3_OUTPUT_DATA
= 's3://<YOUR_BUCKET_LOCATION_OF_CLEANED_DATA>'
- Submit the
PySpark job
and wait for the job to complete before proceeding.
1
sudo spark-submit etl-job.py
- Once the job completes, check the S3 bucket under the folder
cleaned_data
, you will see the new transformed and processed data in parquet format.
cleansed
data is available in Amazon S3 in the form of parquet format, but to make it more consumable for data analysts or data scientists, it would be great if we could enable querying the data through SQL by making it available as a database table.- We need to run the Glue crawler to create an AWS Glue Data Catalog table on top of the S3 data.
- Once that is done, we can run a query in Amazon Athena to validate the output.
- Navigate to the AWS Glue crawler console and click on Create Crawler.
- Give a name for the Glue Crawler (
my-crawler-1
).
- Add the data source as S3 bucket where you have your cleansed and processed data (
s3://etl-batch-emr-demo/cleaned_data
).
- Create an IAM role (
AWSGlueServiceRole-default
) and attached the same. You can create a role and attach the following policies for more details you can refer to this and follow the steps:
- The AWSGlueServiceRole AWS managed policy, which grants the required permissions on the Data Catalog
- An inline policy that grants permissions on the data source (
S3_INPUT_DATA
location)
- Create a database by clicking on Add database and select the same from dropdown menu (
my_demo_db
).
- Review and verify all the details and click on Create crawler.
- Once the crawler is created, select the crawler and click on Run.
- Once the crawler finishes its run, you will see
detected tables
.
using a Glue ETL (pySpark) job. Finally we will use that cleaned data for analysis using Amazon Athena.
- Open Athena query editor. You can keep Data Source as the default
AwsDataCatalog
and selectmy_demo_db
for Database (as show in the screen shot) and run the following query.
1
SELECT * FROM "my_demo_db"."cleaned_data" limit 10;
- Now you can perform other SQL queries to analyze the data. For example, if we would like to know the
forecast_monthly_revenue
for eachregion per segment wise
, you can run this:
1
2
3
4
5
6
SELECT
region,
segment,
SUM(forecasted_monthly_revenue) as forecast_monthly_revenue
FROM "my_demo_db"."cleaned_data"
GROUP BY segment, region;
- Delete the EMR Cluster
- Delete the Amazon S3 bucket
1
aws s3 rb s3://<YOUR_BUCKET_LOCATION> --force
- Delete the Glue Database
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.