Deploy Serverless Spark Jobs to AWS Using GitHub Actions
GitHub Actions have become a popular way of maintaining continuous integration and deployment as part of code repositories. In this post, we show how to deploy an end-to-end Spark ETL pipeline to Amazon EMR Serverless with GitHub Actions.
AWS Admin
Amazon Employee
Published Apr 18, 2023
Last Modified Jun 21, 2024
Apache Spark is one of the most popular frameworks for data processing both on-premises and in the cloud. Despite its popularity, modern DevOps practices for Apache Spark are not well-documented or readily available to data teams. GitHub Actions have become a popular way of maintaining continuous integration and deployment as part of code repositories - by combining development workflows with source code, developers get immediate feedback on their changes and can iterate faster. In this post, we show how to deploy an end-to-end Spark ETL pipeline to Amazon EMR Serverless with GitHub Actions to measure weather trends for a provided location.
- How to use Amazon EMR Serverless
- How to setup OpenID Connect
- How to configure unit tests and integration tests for PySpark
- How to automatically deploy your latest code
About | |
---|---|
✅ AWS Level | 200 - Intermediate |
⏱ Time to complete | 60 minutes |
💰 Cost to complete | ~$10 |
🧩 Prerequisites | - AWS Account - GitHub Account |
📢 Feedback | Any feedback, issues, or just a 👍 / 👎 ? |
⏰ Last Updated | 2023-04-18 |
In this blog post, we'll show you how to build a production-ready Spark job in Python that runs unit and integration tests automatically, automatically builds and deploys new releases, and can even run manual or scheduled ETL jobs.
We'll go step-by-step to create a new repository and build up a PySpark job from scratch to a fully-deployed job in production with unit and integration tests, automatic deploys of versioned assets, and automated job runs. We'll use the NOAA Global Surface Summary of Day as our source data.
In order to follow along, you'll need:
- An AWS account (if you don't yet have one, you can create one and set up your environment here).
- A GitHub account - sign up for free at github.com
- The
git
command - An editor (VS Code, vim, emacs, Notepad.exe)
We also need to create some infrastructure for our jobs to run. For the purposes of this tutorial, we're only going to create one set of resources. In a real-world environment, you might create test, staging, and production environments and change your workflows to run in these different environments or even completely different AWS accounts. These are the resources we need:
- EMR Serverless application - We'll use EMR 6.9.0 with Spark 3.3.0
- S3 bucket - This will contain our integration test artifacts, versioned production releases, and logs from each job run
- IAM roles
- One role used by our GitHub Action with a limited set of permissions to deploy and run Spark jobs, and view the logs
- One role used by the Spark job that can access data on S3
Note: This demo can only be run inus-east-1
region - if you want to use another region, you will need to create your EMR Serverless application with a VPC.
You can create these resources by downloading the CloudFormation template and either using the AWS CLI, or by navigating to the CloudFormation console and uploading the template there.
There are two parameters you can set when creating the stack:
GitHubRepo
is theuser/repo
format of your GitHub repository that you want your OIDC role to be able to access. We create the repository in the next step and will be something like<your-github-username>/ci-cd-serverless-spark
CreateOIDCProvider
allows you to disable creating the OIDC endpoint for GitHub in your AWS account if it already exists.
Once the stack is created, navigate to the Outputs tab for the stack you created on the CloudFormation console as you'll need these values later.
Note: There's a lot of copy/paste in this tutorial. If you'd like to take a look at the finished state, please refer to the ci-cd-serverless-spark repo.
Now let's get started!
First, create a new repository on GitHub. For the rest of this tutorial, we'll assume you used
ci-cd-serverless-spark
for the repository name. The repository can be public or private.Note: Make sure you use the same repository name that you did when you created the CloudFormation Stack above!
In this step, we'll create our initial source code structure as well as our first GitHub Action that will be configured to run on every
git push
. Assuming you're running in a standard terminal, we'll create a test_basic.py
file in the pyspark/tests
directory and a requirements-dev.txt
file in the pyspark
directory.- Create a
test_basic.py
file inpyspark/tests
that contains only the following simple assertion.
- Create
requirements-dev.txt
inpyspark
that defines the Python requirements we need in our dev environment.
Next we need to create our GitHub Action to run unit tests when we push our code. If you're not familiar with GitHub Actions, it's a way to automate all your software workflows by creating workflow files in your GitHub repository that can be triggered by a wide variety of actions on GitHub. The first GitHub Action we're going to create automatically runs
pytest
every single time we push a new commit to our repository.To do this, create a
unit-tests.yaml
file in the .github/workflows
directory. The file should look like this:This performs a few steps:
- Checkout the code
- Install Python 3.7.10 (the version that EMR Serverless uses)
- Install our
pytest
dependency fromrequirements-dev.txt
- Run
pytest
With these three files added, we can now
git add
and git push
our code. Your directory structure should look like this:Once you do this, return to the GitHub UI and you'll see a yellow dot next to your commit. This indicates an Action is running. Click on the yellow dot or the "Actions" tab and you'll be able to view the logs associated with your commit once the GitHub runner starts up.
Great! Now whenever you
git push
, the unit tests in pyspark/tests
will be run to validate your code. Let's move on to creating some actual Spark code.As mentioned, we'll be using the NOAA GSOD dataset. What we'll do next is add our main PySpark entrypoint script and a new class that can return the largest values from a Spark DataFrame.
Let's take a quick look at the data. The raw structure is fairly typical and straightforward. We have an S3 bucket with CSV files split into yearly partitions. Each CSV file is a specific weather station ID. If we open one of the files, it contains daily weather readings including min, max, and mean measures of temperature, wind, and pressure as well as information about the amount and type of precipitation. You can find more information about the dataset on noaa.gov.
The station ID for Boeing Field in Seattle, WA is
72793524234
. Let's take a look at the the data from that station for 2022 - it's located at s3://noaa-gsod-pds/2022/72793524234.csv
.Our job is simple: we're going to extract "extreme weather events" across all stations from a single year.
For the structure of our PySpark job, we'll create the following files in the
pyspark
directory:- An
entrypoint.py
file that will be where we initialize our job and run the analysis:
- A
jobs/extreme_weather.py
file that has the actual analysis code broken down into unit-testable methods:
We'll also create a new unit test for our analysis as well as some mock data:
In
pyspark/tests
, create a conftest.py
fileconftest.py
- Creates a sample dataframe for testing
Then update the
test_basic.py
file with a new test. Feel free to leave the old test in the file.And add the following dependency to the
requirements-dev.txt
file:Your directory structure should now look like this:
Now that's done, simply go ahead and commit and push your changes.
Using the GitHub Action we created before, your new unit test will automatically run and validate that your analysis code is running correctly.
In the GitHub UI, in the "Actions" tab, you should now have two workflow runs for your unit tests.
For extra credit, feel free to make the test fail and see what happens when you commit and push the failing test.
This is awesome! But our mock data is a small slice of what's actually available, and we want to make sure we catch any errors with big changes to the codebase.
In order to do this, we'll create a new
integration_test.py
file that uses our existing code and runs a few validations over a known-good set of files. We'll then create a new GitHub Action to run when people create pull requests on our repository. This will help validate that any new changes we introduce still produce the expected behavior.In the
pyspark
directory, create a new integration_test.py
file.Let's also create a
run-job.sh
script in the pyspark/scripts
directory - this script runs an EMR Serverless job and waits for it to complete.Now in the
.github/workflows
directory, we're going to create a new workflow for running our integration test! Create an integration-test.yaml
file. In here, we'll replace environment variables using a few values from our CloudFormation stack.To find the right values to replace, take a look at the "Outputs" tab in the stack you created in the CloudFormation Console or use this AWS CLI command.
Replace the
APPLICATION_ID
, S3_BUCKET_NAME
, JOB_ROLE_ARN
, and OIDC_ROLE_ARN
values with the appropriate values from your stack.So we can see how integration tests integrate (ha!) with pull requests. We're going to commit these changes by creating a new branch, pushing the files into that branch, then opening a pull request.
The
integration-test
workflow we created will run whenever somebody opens a new pull request.Once pushed, go to your GitHub repository and you will see a notification that the new branch
feature/integration-test
had a recent push and can create a new pull request.To activate the
integration-test.yaml
workflow, click Compare & pull request to activate the integration workflow. Once you press the button, you will get the Open a pull request form. Give it a name Add integration test
and press the Create pull request button.This activates the integration workflow. In the new screen, click on the Details link of the PySpark Integration Tests.
You will see the status of the deploy-and-validate pull request workflow. The workflow will run the
scripts/run-job.sh
shell script, which will reach out to your AWS resources and push a Spark job into your EMR Serverless application and run the integration_test.py
script. You can monitor the progress and see the job status change from PENDING to RUNNING and then to SUCCESS.If you want to, you can use the EMR Serverless Console to view the status of the jobs.
If you haven't set up EMR Studio before, click the Get started button and then Create and launch EMR Studio.
Once the checks finish, go ahead and click the Merge pull request button on the pull request page and now any new pull requests to your repo will require this integration check to pass before merging!
In your local repository, on your desktop/laptop, return to the main branch and do a git pull.
Okay, so we've taken our brand new repository and added unit tests, integration tests, and now we want to begin shipping things to production. In order to do this, we'll create a new GitHub Action based on whenever somebody adds a tag to our repository. If the tag matches a semantic version (e.g.
v1.0.2
), we'll automatically package up our project and ship it to S3!Note: In a production environment, we could make use of different environments or accounts to isolate production and test resources, but for this demo we just use a single set of resources.
In theory, tags will only be applied when new code has been verified and is ready to ship. This approach allows us to easily run new versions of code when ready, or rollback to an older version if a regression is identified.
We'll create a new
deploy
workflow that only occurs when a tag is applied.Create and commit this file in
.github/workflows/deploy.yaml
, replacing S3_BUCKET_NAME
and OIDC_ROLE_ARN
with the previous values.:Now let's create a new release.
- Return to the GitHub UI and click on the Releases link on the right-hand side.
- Then click on the Create a new release button.
- Click on Choose a tag and in the Find or create a new tag box, type
v0.0.1
. - Then click on the Create new tag: v0.0.1 on publish button below that.
If you want you can fill in the release title or description, or just click the "Publish release" button!
When you do this, a new tag is added to the repository and will trigger the Action we just created.
Return to the main page of your repository and click on the Actions button. You should see a new Package and Deploy Spark Job Action running. Click on the job, then the deploy link and you'll see GitHub deploying your new code to S3.
The last step is getting our code to run in production. For this, we'll create a new GitHub Action that can both automatically run the latest version of our deployed code, or manually run the same job with a set of custom parameters.
Create the file
.github/workflows/run-job.yaml
and make sure to replace the environment variables at the top....just checking. Did you replace the 4 variables after
BEGIN: BE SURE TO REPLACE THESE VALUES
? Because I sure didn't! But if you didn't, it's a good chance to remind you that this GitHub Action could be running in an entirely different account with an entirely different set of permissions. This is the awesome power of OIDC and CI/CD workflows.Now that you've triple-checked the placeholder values are replaced, commit and push the file.
Once pushed, this Action will run your job every day at 02:30 UTC time. But for now, let's go ahead and trigger it manually.
Return to the GitHub UI, click on the Actions tab and click on ETL Job on the left-hand side. Click on the "Run workflow" button and you're presented with some parameters we configured in the Action above.
Feel free to change the git tag we want to use, but we can just leave it as
latest
. Click the green Run workflow button and this will kick off an EMR Serverless job!The GitHub Action we created starts the job, waits for it to finish, and then we can take a look at the output.
This job just logs the output to
stdout
. When logs are enabled, EMR Serverless writes the driver stdout
to a standard path on S3.If the job is successful, the job output is logged as part of the GitHub Action.
You can also view the logs with the following
aws s3 cp
command, assuming you have gunzip
installed.Replace
S3_BUCKET
with the bucket from your CloudFormation stack and APPLICATION_ID
and JOB_RUN_ID
with the values from your Fetch Data GitHub Action.Keep in mind that this GitHub Action will run daily, incurring AWS costs.
To prevent additional cost, delete your EMR Serverless application in the EMR Serverless Console. And if you don't want email notifications when your scheduled job fails, be sure to delete your
run-job.yaml
GitHub Action as well.The EMR team has been hard at work improving the local Spark development experience for EMR as well. Here are a few more resources for you to check out:
If you enjoyed this tutorial, found any issues, or have feedback for us, please send it our way!
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.