Deploy Serverless Spark Jobs to AWS Using GitHub Actions
GitHub Actions have become a popular way of maintaining continuous integration and deployment as part of code repositories. In this post, we show how to deploy an end-to-end Spark ETL pipeline to Amazon EMR Serverless with GitHub Actions.
- How to use Amazon EMR Serverless
- How to setup OpenID Connect
- How to configure unit tests and integration tests for PySpark
- How to automatically deploy your latest code
About | |
---|---|
✅ AWS Level | 200 - Intermediate |
⏱ Time to complete | 60 minutes |
💰 Cost to complete | ~$10 |
🧩 Prerequisites | - AWS Account - GitHub Account |
📢 Feedback | Any feedback, issues, or just a 👍 / 👎 ? |
⏰ Last Updated | 2023-04-18 |
- An AWS account (if you don't yet have one, you can create one and set up your environment here).
- A GitHub account - sign up for free at github.com
- The
git
command - An editor (VS Code, vim, emacs, Notepad.exe)
- EMR Serverless application - We'll use EMR 6.9.0 with Spark 3.3.0
- S3 bucket - This will contain our integration test artifacts, versioned production releases, and logs from each job run
- IAM roles
- One role used by our GitHub Action with a limited set of permissions to deploy and run Spark jobs, and view the logs
- One role used by the Spark job that can access data on S3
Note: This demo can only be run inus-east-1
region - if you want to use another region, you will need to create your EMR Serverless application with a VPC.
GitHubRepo
is theuser/repo
format of your GitHub repository that you want your OIDC role to be able to access. We create the repository in the next step and will be something like<your-github-username>/ci-cd-serverless-spark
CreateOIDCProvider
allows you to disable creating the OIDC endpoint for GitHub in your AWS account if it already exists.
Note: There's a lot of copy/paste in this tutorial. If you'd like to take a look at the finished state, please refer to the ci-cd-serverless-spark repo.
ci-cd-serverless-spark
for the repository name. The repository can be public or private.Note: Make sure you use the same repository name that you did when you created the CloudFormation Stack above!

git push
. Assuming you're running in a standard terminal, we'll create a test_basic.py
file in the pyspark/tests
directory and a requirements-dev.txt
file in the pyspark
directory.- Create a
test_basic.py
file inpyspark/tests
that contains only the following simple assertion.
- Create
requirements-dev.txt
inpyspark
that defines the Python requirements we need in our dev environment.
pytest
every single time we push a new commit to our repository.unit-tests.yaml
file in the .github/workflows
directory. The file should look like this:- Checkout the code
- Install Python 3.7.10 (the version that EMR Serverless uses)
- Install our
pytest
dependency fromrequirements-dev.txt
- Run
pytest
git add
and git push
our code. Your directory structure should look like this:

git push
, the unit tests in pyspark/tests
will be run to validate your code. Let's move on to creating some actual Spark code.72793524234
. Let's take a look at the the data from that station for 2022 - it's located at s3://noaa-gsod-pds/2022/72793524234.csv
.pyspark
directory:- An
entrypoint.py
file that will be where we initialize our job and run the analysis:
- A
jobs/extreme_weather.py
file that has the actual analysis code broken down into unit-testable methods:
pyspark/tests
, create a conftest.py
fileconftest.py
- Creates a sample dataframe for testing
test_basic.py
file with a new test. Feel free to leave the old test in the file.requirements-dev.txt
file:

integration_test.py
file that uses our existing code and runs a few validations over a known-good set of files. We'll then create a new GitHub Action to run when people create pull requests on our repository. This will help validate that any new changes we introduce still produce the expected behavior.pyspark
directory, create a new integration_test.py
file.run-job.sh
script in the pyspark/scripts
directory - this script runs an EMR Serverless job and waits for it to complete..github/workflows
directory, we're going to create a new workflow for running our integration test! Create an integration-test.yaml
file. In here, we'll replace environment variables using a few values from our CloudFormation stack.APPLICATION_ID
, S3_BUCKET_NAME
, JOB_ROLE_ARN
, and OIDC_ROLE_ARN
values with the appropriate values from your stack.integration-test
workflow we created will run whenever somebody opens a new pull request.feature/integration-test
had a recent push and can create a new pull request.
integration-test.yaml
workflow, click Compare & pull request to activate the integration workflow. Once you press the button, you will get the Open a pull request form. Give it a name Add integration test
and press the Create pull request button.
scripts/run-job.sh
shell script, which will reach out to your AWS resources and push a Spark job into your EMR Serverless application and run the integration_test.py
script. You can monitor the progress and see the job status change from PENDING to RUNNING and then to SUCCESS.

v1.0.2
), we'll automatically package up our project and ship it to S3!Note: In a production environment, we could make use of different environments or accounts to isolate production and test resources, but for this demo we just use a single set of resources.
deploy
workflow that only occurs when a tag is applied..github/workflows/deploy.yaml
, replacing S3_BUCKET_NAME
and OIDC_ROLE_ARN
with the previous values.:- Return to the GitHub UI and click on the Releases link on the right-hand side.
- Then click on the Create a new release button.
- Click on Choose a tag and in the Find or create a new tag box, type
v0.0.1
. - Then click on the Create new tag: v0.0.1 on publish button below that.


.github/workflows/run-job.yaml
and make sure to replace the environment variables at the top.BEGIN: BE SURE TO REPLACE THESE VALUES
? Because I sure didn't! But if you didn't, it's a good chance to remind you that this GitHub Action could be running in an entirely different account with an entirely different set of permissions. This is the awesome power of OIDC and CI/CD workflows.
latest
. Click the green Run workflow button and this will kick off an EMR Serverless job!stdout
. When logs are enabled, EMR Serverless writes the driver stdout
to a standard path on S3.
aws s3 cp
command, assuming you have gunzip
installed.S3_BUCKET
with the bucket from your CloudFormation stack and APPLICATION_ID
and JOB_RUN_ID
with the values from your Fetch Data GitHub Action.run-job.yaml
GitHub Action as well.Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.