Automating and orchestrating data engineering workflows in MWAA - Part 1
First of a two part blog to illustrate of data engineering workflow using MWAA (with ECS) to execute SQL queries in Redshift
Published Aug 31, 2024
In this two part blog, I will demonstrate the ability to orchestrate a simple data pipeline using the Amazon Managed Workflows for Apache Airflow (MWAA) service. This will involve a couple of tasks:
- fetching raw data from an S3 bucket and copying into Redshift
- followed by an execution of custom docker container in ECS which will transform the data ingested into Redshift and create a view.
The purpose of this is to illustrate the options of executing tasks in MWAA environment or also using containers in ECS. We will provision the Airflow environment and required resources (S3, Redshift, ECS) using IAC (Terraform) .
All the resources below can be deployed using Terraform from modules created in my github repo. Install Terraform from the hashicorp website . Clone the repo and run the following commands:
Note: The MWAA resource can take around 30 mins to create.
In summary the following resources were required:
- VPC with public and private subnet, nat and internet gateways, routes,required security groups for ingress to Redshift and port 443 and outbound rule for all traffic. The terraform public vpc module is used for creating these resources.
- ECR repository to push the images to, which can then be referenced in the task definition in ECS service. We will be offloading one of the tasks to ECS from airflow so we will need to have the required dependencies in the image.
- SSM parameters which will be referenced in the python dag scripts.
- Two S3 buckets. A
demo-ml-datasets
bucket containing the sample housing dataset to be used for this demo. MWAA needs to fetch the dag and requirements from an S3 bucket. The requirements.txt file contains all the dependencies needed for the installation of Apache Airflow and other base requirements when setting up the Airflow environment. By default MWAA, expects this to be in a requires folder in the bucket. The dags folder contains the dag and another utils folder containing a module which is imported into the main dag script. - Redshift, MWAA and ECS cluster are deployed in the same VPC. Typically, it takes about 20-30 minutes for the MWAA environment to create at the beginning. However, once the envronment is setup and if there is any subsequent update to dags script in the S3 bucket, then this is relatively instantaneously updated on the airflow UI.
- MWAA needs an IAM service role to be able to make API calls to the various other AWS services for logging, data access etc. Additionally, it needs an IAM role (with the necessary permissions) for the user to be able to access the Airflow WebserverUI. This `mwaa-role` is created as part of the terraform deployment.
- A service linked role predefined by Redshift will automatically be created in the account the first time a redshift cluster is created (or Redshift managed VPC endpoint) and will include predefined permissions required to call other API services on behalf of the redshift cluster e.g. secrets manager for retrieving credentials, creating log group and putting logs, modify, create vpc endpoint and subnet configurations etc. These permissions cannot be modified. An IAM role for Redshift with AWS Managed Policy for full access to S3 will also be created.
- ECS roles to include the following:
- A service linked role for ECS automatically created when the cluster is created.
- ECS TaskRole with AWS Managed Policies for Full Access to Redshift, Systems Manager, Secrets Manager ReadWrite.
- ECS TaskExecutionRole with AWS Managed ECSTaskExecutionRole Policy
Once the Redshift cluster us setup, we will need to open the query editor and setup a new table in the new db schema created. We will also load data into the table from dataset in S3.
Click on
Query Editor v2
in the Redshift console (on the left navigation bar)Now right click on redshift-cluster-1 drop down as shown in the screenshot below. Select
Edit Connection
In the new window, select
Temporary credentials using a database user name.
For the database name enter dev
and username awsuser
. Click save.Now if you click on the dropdown in
redshift-cluster-1
you should see a dev
database created. Highlight it and then click the Load data option on the top. Select Load from S3 bucket as in the screenshot below and enter the path to the housing.csv file in S3. Leave the other options as default and click Next.
Click on
Load new table
and select the schema public
and add table name housing
. Select the IAM role RedshiftRole
created whch should have permissions to read to S3 and publish to cloudwatch. The schema detials should be automatically inferred, we can accept the default and select Create Table
This will create a query in the query window to copy data from S3 to redshift with IAM role and specifying the format as CSV. Click Run on the query. If the query ran successfully, you should see success log as below.
Click the Refresh icon under the Load data button and you should see the housing table in the dropdown under the dev schema. You can query it to check the data is loaded as expected.
Naviagte to the ECR consolse and you should see a repository creatd as part of the Terraform deployment. Click on the View Push commands.
This will bring up the window below with a list of commands to run to build and push custom docker images to this repository.
cd to the repository root which contains the dockerfile and then run the docker_build_ecr.sh script using the command below. Replace <IMAGE-TAG>, <ACCOUNT-ID> and <AWS-REGION> with the name you want to tag the image (e.g. ecs-redshift), your account-id and the region (e.g. us-east-1 ).
The region and account id are required for the authentication to the ECR repository before building and pushing the image. The
aws ecr get-login-password
command below first generates an authentication token using the GetAuthorizationToken API which is then piped to the docker login command to authenticate to an Amazon ECR registryThen the docker commands are run to build , tag and subsequently push the image to the ECR private repository repository. In the screenshot below, the image is tagged as
ecs-redshift
and pushed to ECR repository in us-east-1 region. Navigate to the ECR repository in the ECR console, and click on
ecs-redshift
.You should see the image artifact. Building and pushing images subsequently will store all the artifacts with the latest image pushed being tagged with the latest tag.When debugging and testing code, the docker container can be run from the image locally before executing in ECS, to identify any issues if required. We can set values for environment variables AWS_DEFAULT_REGION, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to be used when the container is running. This avoids having to hardcode these credentials in application code or copy any credential files into container. Alternatively they could be stored and retrieved from secret manager. The access keys associated with your IAM user which will be used programatically access AWS services. In this case, we will need access to Redshift to execute a query so your IAM user will need to have a policy which will allow read and write access to the table created previously in the Redshift cluster. Pass in the repository uri from the previous section which should be of the format <ACCOUNT-ID>.dkr.ecr.<AWS-REGION>.amazonaws.com/<IMAGE-TAG>
This will execute the CMD instruction in the dockerfile which will execute the execute_redshift_query.py script.
Navigate to the ECS console and you should see the cluster deployed by Terraform previously. Click on the task definition. The screenshot below shows multiple task definitions because I had updated this during the course of this deployment to update the definition (which then increments the definition number even if the lower versions are made inactive and deleted), but you should see only one in your console. Click on the task definition link to see more details.
This will give details on the task definition, including task size (cpu, memory), as in the screenshot below. Since we are using FARGATE, we can only have a max of 2 CPU and 4GB memory for containers. Under containers, we can see the ECR image URI which is referenced in the task definition during deployment. This image was built and pushed to the ECR repository in the previous section. We also need to set a TaskRole and TaskExecutionRole with grants the Amazon ECS container and Fargate agents permission to make AWS API calls on your behalf. In this use case, The TaskRole assumed will have the permissions to run queries in Redshift and the TaskExecutionRole will give the permissions to pull a container image from an Amazon ECR private repository and also send logs to cloudwatch using awslogs driver.
In Part 2 of this blog, we will go over the MWAA execution and analysis of the outputs.