logo
Menu
Data Quality Checks and Anomaly Detection, all within AWS Glue

Data Quality Checks and Anomaly Detection, all within AWS Glue

Learn how this new AWS Glue feature works and start defining your Data Quality checks inside your Glue pipelines

Published Jan 13, 2024

Introduction

Data quality is one of the fundamental elements of every successful data pipeline within all stages of the ETL process, working as safety nets upon these kind of processes. They act as gatekeepers as they guarantees through clearly and well defined rules that data stays in top shape.
When Data Analysts, Engineers and ETL developers deal with huge amounts of data moving around, data quality checks become the key components, catching issues before they become full-blown problems.
For this purpose, AWS Glue Data Quality feature was recently added to the pipeline build workflow of AWS Glue, and it will help to improve the data quality of Glue data pipelines by using machine learning to detect statistical anomalies and unusual patterns.
Through this post we will review what are its main features, use cases and how can it help to improve our data pipeline process by adding an extra data quality layer, all within AWS ecosystem.

Benefits, Key Features and How does it works?

AWS Glue Data Quality allows developers to add an extra DQ layer to their pipelines by measuring and monitoring the quality of the data involved in their traditional AWS Glue jobs, and this way it can drive to make better business decisions and have roughest ETL processes .
Glue Data Quality is essentially built over the shoulder of the open-source framework DeeQu, which is a tool developed and used internally in Amazon for verifying the quality of many large production datasets. AWS Glue Data Quality also uses Data Quality Definition Language (DQDL) under the hood, which is the feature used to define the data quality rules.
Data Quality Rules

To learn more about DQDL and supported rule types, see Data Quality Definition Language (DQDL) reference.

Benefits and Main Features

  • It's Serverless there is no installation, patching or maintenance.
  • AWS Glue Data Quality allows to to get started right away quickly analyzes your data and generates instant outputs of your data quality checks according to the rules you previously defined:
    Get instant results on your Data Quality Checks
  • Detect data quality issues – Use machine learning (ML) to detect anomalies and hard-to-detect data quality issues.
  • Provides**** with 25+ out-of-the-box DQ rules to start from, you can create rules that suit your specific needs.
    out-of-the-box DQ rules
  • Pay as you go – There are no annual licenses you need to use AWS Glue Data Quality.
  • No lock-in – AWS Glue Data Quality is built on open source DeeQu, allowing you to keep the rules you are authoring in an open language.
  • Data quality checks – AWS Glue Data Quality You can enforce data quality checks on Data Catalog and AWS Glue ETL pipelines allowing you to manage data quality at rest and in transit.
  • ML-based data quality detection – Use machine learning (ML) to detect anomalies and hard-to-detect data quality issues.
    You can get more information about its features in AWS documentation.
    To understand how Glue performs Data Quality Checks, first we need to understand what it need to be prepared for that process:
  1. AWS Glue need to have defined some IAM permissions that will allow to use the quality check features.
  2. Needs to define built-in rules to check the quality of your data.
    Regarding permissions definition, there are plenty of information in the documentation and regarding the data quality rules, they can be performed either over objects stored in AWS Glue Data Catalog or in Glue ETL Jobs.
    For this post I have prepared some data that will work as a ETL job, this data corresponds to the Global Superstore dataset, which I left publicly available in the bucket https://glue-etl1217.s3.us-east-2.amazonaws.com/global_superstore_orders.csv
    S3 bucket dataset

    Then when setting up the Glue job to read the data from the S3 bucket, you will get as usual a preview of your data loaded.
    ETL Job

    Next step is as easy as define some Quality Checks over your Data, this can be done by going to the 5th tab in your Glue Job definition canvas.
    Defining Data Quality Checks

    You can find more details on how to manually define your own Data Quality Checks using DQDL in the documentation, this time I went simple by defining the built-in DQ rules ColumnCount and ColumnExists and the output is visualized in real time as shown in the image above, both checks are successful (Passed) as this datasets contains 24 columns and the column named as profit is present in the dataset.

Conclusion

As you could notice is pretty easy to define quick and effective Data Quality Checks right on your Glue ETL jobs using this new and awesome feature, is up to you to try it out by reading the documentation in detail and defining some DQ checks.
AWS Data Quality Release Notes
Use anomaly detection with AWS Glue

 

Comments