Choosing the Right Orchestration Service for Your Data Pipeline

Data volumes in organisations are increasing at an unprecedented rate, exploding from terabytes to petabytes and in some cases exabytes. As data volume increases, it attracts more and more users and applications to use the data in many different ways, depending on the unique use cases for every organization so that the relevant business innovations can be performed. In this context, Modern Data Architecture on AWS becomes extremely relevant to seamlessly integrate your data sources, enrich data with business context for easy consumption and store in purpose-built data stores. To create this data integration pattern, ETL (Extract-Transform-Load) is the most widespread approach, which is the process of combining data from multiple data sources into a large, central repository like data lake, data warehouse. Therefore, scalable and robust data orchestration strategy to work with the various ETL jobs on the data platform carries significant weightage so that organisations can work seamlessly towards their innovation agenda.

In this post, we will provide you the guidance on how few AWS native services can be used for data orchestration purpose through discussing various features of the services on relevant parameters and workload characteristics.

What is Data Orchestration?

Data orchestration is the process of moving siloed data from multiple storage locations into a centralised repository where it can be combined, cleaned, enriched for activation (for example, generating business specific KPIs through business intelligence service like Amazon QuickSight). It helps automate the flow of data between tools and systems to ensure organisations are working with complete, accurate, and up-to-date information. These end-to-end process may be going through a multiple set of steps or sub-paths and these sub-paths can be of different types like sequential, parallel, and they can even be a choice of state where one of the parts needs to be picked on some data conditions. Hence, for the end-to-end process you need to consider to have a robust system that can handle failures, retries, and so on, guided by the requirements.

Extending the data orchestration workflow further, generally the workflows are comprised of various tasks that can either be scheduled at a different interval or it can be triggered on a specific event. They can be finite and have a definitive purpose in what they do. Additionally, multiple workflows can also be stitched together, to form an order. Once defined, they act on data stored in multiple different storage locations like databases, data warehouses, file stores, data lakes, etc. and do some action on the data. This include complex transformations, aggregations or processing or as simple as copying from one source system to the other. As an end-user doing this orchestration, a medium is required 1/to provide a good deal of observability, 2/to be able to integrate with other services or third party services and 3/to provide scalability. This has to be done with job-0 security, the ability to meet organisational standards, and to be cost effective without proprietary licenses.

What are the Data Orchestration Service Options on AWS?

On AWS, at a high level data orchestration services can be divided into two categories, Purpose-built orchestrators and General purpose orchestrators. Under each of these categories, AWS data orchestration services can be categorized.

Purpose-built orchestrators: These AWS native data orchestration services are built with a specific purpose for a specific use case.
- Workflows in AWS Glue - AWS Glue is a serverless data integration service that makes it easy for analytics users to discover, prepare, move, and integrate data from multiple sources. You can use it for analytics, machine learning, and application development. It also includes additional productivity and data ops tooling for authoring, running jobs, and implementing business workflows. Workflows in AWS are part of AWS Glue and as such, it's part of a fully managed serverless service that's native to AWS. Glue Workflows provide orchestration of AWS Glue jobs, crawlers and triggers. And if the data pipeline is built using jobs and crawlers, this allows to have a single platform, AWS Glue, that can handle all aspects of the data pipeline. It is particularly well suited for complex, multi job ETL operations, involving Python or Apache spark. It provides a visual designer using workflow graphs. It supports ingestion of streaming data and batch data and it also provides template based workflow creation allowing you to reuse some of your data pipeline assets.
General purpose orchestrators: These AWS native data orchestration are built to orchestrate workflows of various natures.
- AWS Step Functions - It is a visual workflow service that helps developers use AWS services to build distributed applications to automate processes, to orchestrate micro services and also create data and machine learning pipelines.
- Amazon Managed Workflows for Apache Airflow (MWAA) - It is a managed orchestration service for Apache Airflow that you can use to setup and operate data pipelines in the cloud at scale. Apache Airflow is an open-source tool used to programmatically author, schedule, and monitor sequences of processes and tasks referred to as workflows.

In this post, we will be discussing about Workflows for AWS Glue, AWS Step Functions and Amazon Managed Workflows for Apache Airflow (MWAA) from a data orchestration point of view covering both purpose-built and general purpose orchestrators.

Mapping concepts between Workflows in AWS Glue, AWS Step Functions and Amazon MWAA

In the intricate landscape of data orchestration, understanding the nuanced differences in terminology and constructs across the three different services is crucial, as each service approaches key aspects of the orchestration process differently. The following table provides the high level mapping concepts between the three services on the basis of parameters :

End-to-end pipeline: Describes the terminology used to define the end-to-end pipeline for a data pipeline.
Steps: Describes the components involved in each step, highlighting the unique classification.
Runtime: Describes the distinctive term for the runtime execution of the data pipeline orchestration.
Error handling: Describes how each service handles errors with their specific constructs.
Step dependency: Describes how dependencies in between steps are defined within each service.

Comparative Study on Features of Workflows in AWS Glue, AWS Step Functions and Amazon MWAA

The following table provides an in-depth comparison of the features between the three services, namely:

Deployment Model: Describes the type of deployment model the service uses.
Authoring: Describes how users can create and define their workflows.
Scheduling: Describes how workflows are scheduled and executed at specific times or intervals.
Multiple Scheduling: Describes whether the service allows for the configurations of multiple schedules for the same workflow.
High Availability: Describes high availability and resiliency features offered to maintain robust workflows.
Invoke: Describes the different ways that workflows can be invoked.
Backfill / Catch-up: Describes if the service is able to process historical data in a data store that may have been missed initially when the workflows were run.
Integration: Describes how well integrated it is with both AWS and non-AWS services.
Data Transferred / Payload: Describes the volume and method of data exchange between different steps or components within a workflow.
Error Handling: Describes workflow resilience and fault tolerance.
Failure Notification: Describes how users get alerted in the event of a failure.
Observability: Describes the visibility and monitoring capabilities of the workflow execution, providing insight on performance and internal operations.
Loop Iterations: Describes if the service supports the repetition of certain steps or tasks in a workflow to be repeated as many times as needed, without needing to define each repetition.
Conditional Branching: Describes the ability to direct the flow of execution of a workflow based on conditional logic or decision points.
Concurrent Executions: Describes the ability to run multiple executions of a workflow simultaneously.
Maximum Number of Steps: Describes the maximum number of steps a user can have in a single workflow.
Batch Events: Describes the ability to trigger batch jobs or workflows involving large sets of data as opposed to single records.
Streaming Data: Describes the ability to orchestrate data that is being ingested in real time.
Cost: Describes the price of orchestrating workflows within the service.

Conclusion

ETL operations are the backbone of a data lake. ETL workflows often involve orchestrating and monitoring the execution of many sequential and parallel data processing tasks. As the volume of data grows, developers find they need to move quickly to process this data to ensure they make faster, well-informed design and business decisions. To process data at scale, developers need to elastically provision resources to manage the data coming from increasing diverse sources and often end up building complicated data pipelines. AWS managed orchestration services such as AWS Step Functions, Amazon MWAA, and Workflows for AWS Glue are managed workflow orchestration services that help simplify ETL workflow management that involves a diverse set of technologies. These services provide the scalability, reliability, and availability needed to successfully manage your data processing workflows.

As a general rule of thumb, the table below will defines the use-cases on when to use which orchestration service.

Working backwards from our customers need for data orchestration, it has been realised that no one service fits all the requirements. To identify the right data orchestration service for, it is important to define the success criteria, workload characteristics, feature requirements, understanding preference of the development team, familiarity with the tooling and services, which in turn helps in choosing one or multiple options. If needed, use a scoring model to make the choice of the service data driven. Use this post to understand the different mapping capabilities, feature understanding on creating the scoring model.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Site Terms, Privacy, and more.