Building customized Data Flows using Amazon AppFlow
Start building your own Data Integrations using AWS services
Published Jul 28, 2024
We will be exploring Amazon AppFlow, a fully managed, serverless and agnostic AWS data integration service that enables you to securely and effectively connect data between AWS services and multiple Software-as-a-Service (SaaS) applications without the need to code.
The diagram below, which was taken from the official documentation, shows how AppFlow works in a high-level overview: It ingests data into your flows from one of the multiple data sources available, and then that data enters into your AppFlow for enrichment, mapping and customized validations. Lastly, once the data is in the middle stage it goes to the AppFlow supported destinations.
In this tutorial, we will cover both theoretical concepts and practical exercises to help you build customized data flows using Amazon AppFlow.
Amazon AppFlow simplifies the process of integrating data across AWS services and this is its basic offer for developers looking to desingning and execute robust data integration solutions:
- Secure data transfer with encryption in transit and at rest.
- Built-in data transformation and filtering.
- Seamless integration with AWS services such as Amazon S3, Amazon Redshift, and AWS Lambda.
- No-code interface for easy configuration and management of data flows.
Amazon AppFlow can be configured to perform both: real-time and scheduled data streams from source to target apps. Usually, data enrichment from SaaS applications will be derived in better management decisions based on data, that's one of the most common use cases of this service
Before creating your first data flow you will need to prepare your environment to perform basic queries using AWS CLI and to allow your applications to interact with other AWS services:
- Creating an IAM Role: You will have to grant your user with an IAM role with permissions to AWS Glue Data Catalog, here is a sample:
- Configuring AWS CLI: Perform the configuration settings in order to perform programmatic calls from AWS CLI.
- GitHub personal access token: If you don't have a personal access token configured for your GitHub account, you can get one following this guidelines.
The first thing you will need to do is to search for Amazon AppFlow in your AWS Management Console, once there you can start reading a bit about how this service works, and even take a quick look at the official AWS documentation about this service.
There are 3 main components in AppFlow, corresponding to three different things you can do in the console:
- Connectors: You can choose one of the built-in connectors to start from scratch your AppFlow application, or you can register a new connector by creating a custom connector using the Python or Java SDK, which necessarily needs to be implemented through a Lambda function, I will cover the whole process in a separate tutorial.
- Connections: Here you can visualize all the connections you have set previously, and also set new ones.
- Flows: This is the place where connectors, sources, and targets meet to build your flows. Typically every AppFlow uses one Connector for interacting with the Source SaaS and another Connector for interacting with the target SaaS.
Once you have gathered some basic information about the service functionality, you could go ahead and begin to get your hands dirty by clicking the "Create flow" button.
There you will have to fill in some mandatory and necessary steps to put your flow into action
Next step is to configure your flow, by selecting a Source and a Destination for your data. You can be able to choose from more than 70 built-in connectors, or to choose a custom connector you have built and previously registered for a private or public use.
Here we are building a source based in the built-in GitHub connector, here is where you will need to provide your GitHub Personal Access Token.
As you can notice in the image above, I have previously set up a GitHub source that points to my account, the only requirement for this is that you have configured a GitHub personal token in your developer page. Then you can choose one of the multiple destinations, I have choose for this example an S3 bucket to dump my GitHub commits.
You can also set the option to build a Data Catalog in AWS Glue, which is completely optional. Data Catalogs are metadata repositories of your data, that metadata represents aspects of your data, such as the schema, format, and data types, you can find more info on Glue Data Catalogs in the official documentation. The last step of this stage is to select how your AppFlow will run, on demand or on schedule.
The next stage, is Mapping Fields, here is where you will define your mapping method, being manually or uploading a CSV fle that defines how you will map yours fields from source to destination.
You can also Add formulas to create new fields and enrich your datasets.
By setting Partitions and Aggregations you can organize your output data into folders and files, and you can also optimize the query performance of your AppFlow to access the data once it is connected. The last step of this stage is the set validations for your data, this is a completely optional step, even though it is extremely useful to guarantee you will get as an output the data you are expecting.
You can specify filters that determines which records will be transferred from source to destination. When adding filters you can set multiple filters and set criteria for your filters. For more details on how to configure your entire AppFlow application, follow the guidelines in AWS official documentation for AppFlow.
Now that you have built your AppFlow application, you can run it on demand, or you can change in the configuration to execute your Flow on an specific schedule. From here there is nothing else to do that see yur AppFlow in action, so after executing your flow, in the section referred to the Destination bucket you will be able to access and check that some data was created in an S3 bucket that you previously created during the configuration of your AppFlow.
If you query the output data in your S3 bucket you will check that corresponds to the data that the AppFlow was supposed to extract from your Source (GitHub Commits) to your Destination bucket.
This tutorial consisted in a very basic Application that you can build using Amazon AppFlow, there are of course more complex Flows that you can start building using this awesome serverless service from AWS. This time we have explored how to build customized data flows using Amazon AppFlow, and we have also covered the theoretical aspects of AppFlow and walked through very basic sample to create, transform, schedule, and monitor data flows. By leveraging Amazon AppFlow, you can automate and streamline your data integration processes, ensuring data consistency and reducing manual effort.