Deploy Serverless Spark Jobs to AWS Using GitHub Actions

GitHub Actions have become a popular way of maintaining continuous integration and deployment as part of code repositories. In this post, we show how to deploy an end-to-end Spark ETL pipeline to Amazon EMR Serverless with GitHub Actions.

AWS Admin
Amazon Employee
Published Apr 18, 2023
Last Modified Jun 21, 2024
Apache Spark is one of the most popular frameworks for data processing both on-premises and in the cloud. Despite its popularity, modern DevOps practices for Apache Spark are not well-documented or readily available to data teams. GitHub Actions have become a popular way of maintaining continuous integration and deployment as part of code repositories - by combining development workflows with source code, developers get immediate feedback on their changes and can iterate faster. In this post, we show how to deploy an end-to-end Spark ETL pipeline to Amazon EMR Serverless with GitHub Actions to measure weather trends for a provided location.

What You Will Learn

Sections

About
✅ AWS Level200 - Intermediate
⏱ Time to complete60 minutes
💰 Cost to complete~$10
🧩 Prerequisites- AWS Account
- GitHub Account
📢 FeedbackAny feedback, issues, or just a 👍 / 👎 ?
⏰ Last Updated2023-04-18

Prerequisites

In this blog post, we'll show you how to build a production-ready Spark job in Python that runs unit and integration tests automatically, automatically builds and deploys new releases, and can even run manual or scheduled ETL jobs.
We'll go step-by-step to create a new repository and build up a PySpark job from scratch to a fully-deployed job in production with unit and integration tests, automatic deploys of versioned assets, and automated job runs. We'll use the NOAA Global Surface Summary of Day as our source data.
In order to follow along, you'll need:
  • An AWS account (if you don't yet have one, you can create one and set up your environment here).
  • A GitHub account - sign up for free at github.com
  • The git command
  • An editor (VS Code, vim, emacs, Notepad.exe)
We also need to create some infrastructure for our jobs to run. For the purposes of this tutorial, we're only going to create one set of resources. In a real-world environment, you might create test, staging, and production environments and change your workflows to run in these different environments or even completely different AWS accounts. These are the resources we need:
  • EMR Serverless application - We'll use EMR 6.9.0 with Spark 3.3.0
  • S3 bucket - This will contain our integration test artifacts, versioned production releases, and logs from each job run
  • IAM roles
    • One role used by our GitHub Action with a limited set of permissions to deploy and run Spark jobs, and view the logs
    • One role used by the Spark job that can access data on S3

Creating Demo Resources

Note: This demo can only be run in us-east-1 region - if you want to use another region, you will need to create your EMR Serverless application with a VPC.
You can create these resources by downloading the CloudFormation template and either using the AWS CLI, or by navigating to the CloudFormation console and uploading the template there.
There are two parameters you can set when creating the stack:
  • GitHubRepo is the user/repo format of your GitHub repository that you want your OIDC role to be able to access. We create the repository in the next step and will be something like <your-github-username>/ci-cd-serverless-spark
  • CreateOIDCProvider allows you to disable creating the OIDC endpoint for GitHub in your AWS account if it already exists.
Once the stack is created, navigate to the Outputs tab for the stack you created on the CloudFormation console as you'll need these values later.
Note: There's a lot of copy/paste in this tutorial. If you'd like to take a look at the finished state, please refer to the ci-cd-serverless-spark repo.
Now let's get started!

Create a Unit Test That Runs on git push

First, create a new repository on GitHub. For the rest of this tutorial, we'll assume you used ci-cd-serverless-spark for the repository name. The repository can be public or private.
Note: Make sure you use the same repository name that you did when you created the CloudFormation Stack above!
Screenshot of creating a new repository on GitHub
In this step, we'll create our initial source code structure as well as our first GitHub Action that will be configured to run on every git push. Assuming you're running in a standard terminal, we'll create a test_basic.py file in the pyspark/tests directory and a requirements-dev.txt file in the pyspark directory.
  • Create a test_basic.py file in pyspark/tests that contains only the following simple assertion.
  • Create requirements-dev.txt in pyspark that defines the Python requirements we need in our dev environment.
Next we need to create our GitHub Action to run unit tests when we push our code. If you're not familiar with GitHub Actions, it's a way to automate all your software workflows by creating workflow files in your GitHub repository that can be triggered by a wide variety of actions on GitHub. The first GitHub Action we're going to create automatically runs pytest every single time we push a new commit to our repository.
To do this, create a unit-tests.yaml file in the .github/workflows directory. The file should look like this:
This performs a few steps:
  • Checkout the code
  • Install Python 3.7.10 (the version that EMR Serverless uses)
  • Install our pytest dependency from requirements-dev.txt
  • Run pytest
With these three files added, we can now git add and git push our code. Your directory structure should look like this:
Directory listing after step 1
Once you do this, return to the GitHub UI and you'll see a yellow dot next to your commit. This indicates an Action is running. Click on the yellow dot or the "Actions" tab and you'll be able to view the logs associated with your commit once the GitHub runner starts up.
Screenshot of GitHub Action unit test running
Great! Now whenever you git push, the unit tests in pyspark/tests will be run to validate your code. Let's move on to creating some actual Spark code.

Add PySpark Analysis and Unit Test

As mentioned, we'll be using the NOAA GSOD dataset. What we'll do next is add our main PySpark entrypoint script and a new class that can return the largest values from a Spark DataFrame.
Let's take a quick look at the data. The raw structure is fairly typical and straightforward. We have an S3 bucket with CSV files split into yearly partitions. Each CSV file is a specific weather station ID. If we open one of the files, it contains daily weather readings including min, max, and mean measures of temperature, wind, and pressure as well as information about the amount and type of precipitation. You can find more information about the dataset on noaa.gov.
The station ID for Boeing Field in Seattle, WA is 72793524234. Let's take a look at the the data from that station for 2022 - it's located at s3://noaa-gsod-pds/2022/72793524234.csv.
Our job is simple: we're going to extract "extreme weather events" across all stations from a single year.
For the structure of our PySpark job, we'll create the following files in the pyspark directory:
  • An entrypoint.py file that will be where we initialize our job and run the analysis:
  • A jobs/extreme_weather.py file that has the actual analysis code broken down into unit-testable methods:
We'll also create a new unit test for our analysis as well as some mock data:
In pyspark/tests, create a conftest.py file
  • conftest.py - Creates a sample dataframe for testing
Then update the test_basic.py file with a new test. Feel free to leave the old test in the file.
And add the following dependency to the requirements-dev.txt file:
Your directory structure should now look like this:
Directory listing after step 2
Now that's done, simply go ahead and commit and push your changes.
Using the GitHub Action we created before, your new unit test will automatically run and validate that your analysis code is running correctly.
In the GitHub UI, in the "Actions" tab, you should now have two workflow runs for your unit tests.
Screenshot of GitHub unit test workflows and status
For extra credit, feel free to make the test fail and see what happens when you commit and push the failing test.

Create an Integration Test to Run on New Pull Requests

This is awesome! But our mock data is a small slice of what's actually available, and we want to make sure we catch any errors with big changes to the codebase.
In order to do this, we'll create a new integration_test.py file that uses our existing code and runs a few validations over a known-good set of files. We'll then create a new GitHub Action to run when people create pull requests on our repository. This will help validate that any new changes we introduce still produce the expected behavior.
In the pyspark directory, create a new integration_test.py file.
Let's also create a run-job.sh script in the pyspark/scripts directory - this script runs an EMR Serverless job and waits for it to complete.
Now in the .github/workflows directory, we're going to create a new workflow for running our integration test! Create an integration-test.yaml file. In here, we'll replace environment variables using a few values from our CloudFormation stack.
To find the right values to replace, take a look at the "Outputs" tab in the stack you created in the CloudFormation Console or use this AWS CLI command.
Replace the APPLICATION_ID, S3_BUCKET_NAME, JOB_ROLE_ARN, and OIDC_ROLE_ARN values with the appropriate values from your stack.
So we can see how integration tests integrate (ha!) with pull requests. We're going to commit these changes by creating a new branch, pushing the files into that branch, then opening a pull request.
The integration-test workflow we created will run whenever somebody opens a new pull request.
Once pushed, go to your GitHub repository and you will see a notification that the new branch feature/integration-test had a recent push and can create a new pull request.
Screenshot of pull request notification on GitHub
To activate the integration-test.yaml workflow, click Compare & pull request to activate the integration workflow. Once you press the button, you will get the Open a pull request form. Give it a name Add integration test and press the Create pull request button.
Screenshot of creating a new pull request on GitHub
This activates the integration workflow. In the new screen, click on the Details link of the PySpark Integration Tests.
You will see the status of the deploy-and-validate pull request workflow. The workflow will run the scripts/run-job.sh shell script, which will reach out to your AWS resources and push a Spark job into your EMR Serverless application and run the integration_test.py script. You can monitor the progress and see the job status change from PENDING to RUNNING and then to SUCCESS.
Screenshot of deploy-and-validate workflow running in GitHub Actions
If you want to, you can use the EMR Serverless Console to view the status of the jobs.
If you haven't set up EMR Studio before, click the Get started button and then Create and launch EMR Studio.
EMR Studio creation dialog
Once the checks finish, go ahead and click the Merge pull request button on the pull request page and now any new pull requests to your repo will require this integration check to pass before merging!
In your local repository, on your desktop/laptop, return to the main branch and do a git pull.

Ship It! 🚢

Okay, so we've taken our brand new repository and added unit tests, integration tests, and now we want to begin shipping things to production. In order to do this, we'll create a new GitHub Action based on whenever somebody adds a tag to our repository. If the tag matches a semantic version (e.g. v1.0.2), we'll automatically package up our project and ship it to S3!
Note: In a production environment, we could make use of different environments or accounts to isolate production and test resources, but for this demo we just use a single set of resources.
In theory, tags will only be applied when new code has been verified and is ready to ship. This approach allows us to easily run new versions of code when ready, or rollback to an older version if a regression is identified.
We'll create a new deploy workflow that only occurs when a tag is applied.
Create and commit this file in .github/workflows/deploy.yaml, replacing S3_BUCKET_NAME and OIDC_ROLE_ARN with the previous values.:
Now let's create a new release.
  • Return to the GitHub UI and click on the Releases link on the right-hand side.
  • Then click on the Create a new release button.
  • Click on Choose a tag and in the Find or create a new tag box, type v0.0.1.
  • Then click on the Create new tag: v0.0.1 on publish button below that.
Screenshot of creating a new release
If you want you can fill in the release title or description, or just click the "Publish release" button!
When you do this, a new tag is added to the repository and will trigger the Action we just created.
Return to the main page of your repository and click on the Actions button. You should see a new Package and Deploy Spark Job Action running. Click on the job, then the deploy link and you'll see GitHub deploying your new code to S3.
Screenshot of the logs of the GitHub action uploading the pyspark job

Configure a Job Runner

The last step is getting our code to run in production. For this, we'll create a new GitHub Action that can both automatically run the latest version of our deployed code, or manually run the same job with a set of custom parameters.
Create the file .github/workflows/run-job.yaml and make sure to replace the environment variables at the top.
...just checking. Did you replace the 4 variables after BEGIN: BE SURE TO REPLACE THESE VALUES? Because I sure didn't! But if you didn't, it's a good chance to remind you that this GitHub Action could be running in an entirely different account with an entirely different set of permissions. This is the awesome power of OIDC and CI/CD workflows.
Now that you've triple-checked the placeholder values are replaced, commit and push the file.
Once pushed, this Action will run your job every day at 02:30 UTC time. But for now, let's go ahead and trigger it manually.
Return to the GitHub UI, click on the Actions tab and click on ETL Job on the left-hand side. Click on the "Run workflow" button and you're presented with some parameters we configured in the Action above.
Screenshot of the GitHub dialog to start a workflow
Feel free to change the git tag we want to use, but we can just leave it as latest. Click the green Run workflow button and this will kick off an EMR Serverless job!
The GitHub Action we created starts the job, waits for it to finish, and then we can take a look at the output.

View the output

This job just logs the output to stdout. When logs are enabled, EMR Serverless writes the driver stdout to a standard path on S3.
If the job is successful, the job output is logged as part of the GitHub Action.
Screenshot of the job output from the GitHub job showing the pyspark job being submitted
You can also view the logs with the following aws s3 cp command, assuming you have gunzip installed.
Replace S3_BUCKET with the bucket from your CloudFormation stack and APPLICATION_ID and JOB_RUN_ID with the values from your Fetch Data GitHub Action.

Conclusion

Keep in mind that this GitHub Action will run daily, incurring AWS costs.
To prevent additional cost, delete your EMR Serverless application in the EMR Serverless Console. And if you don't want email notifications when your scheduled job fails, be sure to delete your run-job.yaml GitHub Action as well.
The EMR team has been hard at work improving the local Spark development experience for EMR as well. Here are a few more resources for you to check out:
If you enjoyed this tutorial, found any issues, or have feedback for us, please send it our way!

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments