Unleashing Data Analytics on S3 Data lake with AWS Glue Crawler and Amazon Athena

Introduction

In today's world, data is growing at an exponential rate. While this data holds immense value and has the potential to impact various aspects of our world, extracting value from such large datasets is no simple task. This is where data analytics comes in. Data analytics is a process used in the field of data science to examine large datasets, uncovering hidden patterns, correlations, trends, insights, and relationships. These insights can help organizations gain a competitive edge in the market by making data-driven decisions, improving operational efficiency, and identifying new business opportunities. The main tasks involved in data analytics include data collection, cleaning, processing, and analysis.

In my previous blog, I discussed ELT and data lakes, highlighting their importance in the current big data world. This blog continues that discussion by showcasing the power of AWS tools like AWS Glue and AWS Athena for illustrating data analytics on the AWS platform. But before we delve into that, let's first understand what AWS Glue and AWS Athena are and why they are used in data engineering. AWS Glue is a serverless data integration service that simplifies the process of data discovery, preparation, movement, and integration from multiple sources. It offers data crawlers that can automatically infer schema from data stored in data lakes like S3 buckets and create a centralized data catalog. The Glue data catalog serves as a metastore containing metadata for all the crawled tables. Additionally, AWS Glue allows us to visually create ETL pipelines for transformation and use Spark notebooks to process and transform data. Similarly, Amazon Athena is an interactive query service that enables us to analyze data using standard SQL. Amazon Athena offers two options for analyzing data: using the distributed query engine Trino to analyze data in S3, on-premises, or other clouds, or using notebooks to build interactive Spark applications and analyze data using PySpark and Spark SQL.

In this blog, I will use the Trino query engine of Amazon Athena to analyze the data in the Glue database, which was ingested from the S3 data lake using AWS Glue data crawlers. The architecture for this implementation is explained below in the Architecture section.

Architecture

About AWS Glue

AWS Glue is a serverless data integration solution widely used for data analytics in AWS. AWS Glue enables us to connect with over 70 diverse data sources and to manage data in a centralized data catalog. AWS Glue supports various workloads like ETL, ELT, and streaming, providing services for data discovery, modern ETL, data cleansing, transformation, and centralized cataloging. The following are some of the main components of AWS Glue:

AWS Glue Data Catalog

AWS Glue data catalog is a fully managed centralized metadata repository and catalog service that stores metadata like table definitions, column names, data types, and other attributes describing the data used in AWS Glue ETL jobs. It simplifies and automates the process of data discovery and cataloging, offering a unified view of data across different AWS services and data sources. This makes it easier for users to query and analyze data without needing to know its exact location or format while ensuring data consistency.

AWS Glue Crawler

AWS Glue Crawler is a feature of AWS Glue that automatically discovers data schema in the source and stores metadata in the AWS Glue Data Catalog. It automates the process of cataloging data making it easier for users to analyze their data without the need to manually define schema for each dataset in the source. AWS Glue crawlers run classifiers, which can be either built-in or custom classifiers created by users. These classifiers automate the process of inferring schema from datasets, saving time and effort and allowing users to focus on data analysis and deriving insights.

AWS Glue Databases

AWS Glue databases are logical containers that store metadata tables in the AWS Glue data catalog. They help manage and organize metadata tables that define data from different data stores, making data querying easier. When a Glue data crawler runs on a data source, it creates tables within a specified Glue database in the data catalog. Databases organize tables based on their schema and usage, making it easy for users to locate and access tables needed for analysis and processing.

AWS Glue ETL

AWS Glue ETL (Extract, Transform, Load) is a fully managed service by AWS that helps users prepare and transform data for analysis quickly. It simplifies the ETL process by automating tasks like data preparation, type conversion, and format conversion. It's a serverless, scalable, and cost-effective solution for data analytics. AWS Glue offers options to visually create ETL Jobs or write scripts and Spark jobs programmatically for ETL workloads. It allows scheduling jobs to run as needed. Additionally, AWS Glue evaluates and monitors data quality and provides version control options by linking Glue jobs with a git repository in AWS CodeCommit.

About Amazon Athena

AWS Athena is an interactive query engine service provided by AWS that is used for analyzing the data stored in data lakes like Amazon S3 using standard SQL queries. AWS Athena is a cost-effective, serverless, and highly scalable service that uses the trino query engine and spark in the backend which can run queries in parallel allowing it to generate fast results even for large datasets and complex queries. AWS Athena uses the AWS Glue Data Catalog to query data sources for analysis. It also integrates with tools like Amazon QuickSight or third-party BI tools to visualize query results, creating dashboards and reports to understand and communicate insights gained from data.

Implementation Details

This section describes the various steps taken to implement a simple data analytics pipeline in AWS using AWS Glue and AWS Athena.

Creating AWS Glue Data Crawler

Firstly, navigate to the AWS management console and search for AWS Glue, then open AWS Glue by clicking it.

This opens the Glue console as shown below. Here on the left side, you can see various sections like Data Catalog, Data Integration, and ETL and Legacy pages. For this project, we are only dealing with the AWS Glue data catalog and various components of Glue associated with the AWS Glue data catalog. Here we aim to connect AWS Glue with the AWS S3 data lake which we set up in our last blog and to ingest all the data in that S3 data lake for data analysis in AWS Athena.

Firstly we navigate to AWS Glue Crawler and create a crawler.

Our original data source was SpaceX API and the raw data from this API is stored in our S3 data lake, so let’s name the glue crawler as spacex_crawler and we can provide a suitable description to the crawler. Also, optionally we can add tags to the cluster. Once done we navigate to the next page.

On the next page, we need to configure the data source to ingest and catalog the data. Also, for this project, we don’t need to set up a custom classifier as the built-in classifier available in AWS Glue is capable of inferring schema from the CSV files stored in our s3 data lake.

As our data is not already mapped to Glue tables we have to select “Not yet” and click on “Add a data source”

Here select S3 as data source, choose location of s3 data as in this account, then click on browse S3 to select the S3 path of airbyte folder which stores all the data that was ingested from airbyte to our S3 data lake. Also select crawl all sub-folders options, as airbyte ingests data from multiple API endpoints and stores them into multiple folders in our S3 bucket. Then select “Add an S3 data source”.

In the next step, we will configure a new IAM role which will provide us necessary access to our S3 buckets. Select the Create new IAM Role option inside configure security settings and then provide a suitable name for your IAM role. This automatically creates a new IAM role with all necessary permissions required to access the files from s3 bucket as shown below. Once the role is configured, click on Next and navigate to the new page.

On the new page, we have to configure output for data and scheduling. Here we define a glue database as our target destination. For this, we will first add a new database by clicking on “Add database” and then provide a suitable name for the database i-e “spacex_db” in my case.

Once created return to the previous page where if we refresh the options, we can see our newly created database. Also, we can optionally configure a table prefix for the metadata tables that will be created inside the glue database. Also in the crawler schedule we have currently selected the “On Demand” option to run the glue crawler only when we need it manually. But according to your need, you can specify a schedule to automate the crawler runs.

If you want you can also explore the advanced options available in Glue Crawler.

These are the scheduling options available in the Glue crawler

Once done, we will click next.

Finally, we review all the properties and options that we have configured for the AWS Glue data crawler and click on Create to create the glue data crawler. You can now see your crawler in the list of available crawlers.

Now wait till the status of the glue crawler becomes ready. Once ready click on run crawler and this will start the metadata ingestion and data cataloging process in AWS Glue.

The status of the crawler changes to running which means that the crawler is now ingesting the data from the data lake and storing it inside the metadata tables in the glue database. This process takes some time and depends on the amount of data it needs to ingest from the data lake.

When the crawler run completes successfully, you can see the status of completed in the crawler runs tab. Also, the status of the crawler changes back to ready.

Now navigate to the AWS Glue databases page by clicking on the databases option in the right side. Here select the spacex_db.

Inside spacex-db, we can see the list of all the tables that were created after ingesting data from the data lake using the glue crawler. Here we can see the names of tables, the database they belong to, their location in the s3 data lake, and the type of file that contains the original data.

We can also view the list of metadata tables and inspect their details by navigating to the tables tab in the right-hand menu.

The successful creation of metadata tables inside the Glue database indicates the successful creation of the AWS Glue data catalog which contains data from our S3 data lake. Now we can use AWS Athena on this glue data catalog to analyze the data stored in our data lake easily.

Analyzing glue tables using Amazon Athena

For this firstly navigate to the AWS management console and search for Athena, then click on it to open Athena’s console.

In Athena’s UI click on launch query editor to open the Athena query editor. We can use the query editor to query the tables of the Glue database and perform different analyses.

Here in the query editor, we select the data source as the glue data catalog we just created. Then we can select spacex_db in databases. This will then show the list of all the tables and views available inside spacex_db. We can now write queries in the right-hand pane of the query editor. Below I have shown a simple example where I have written a select query to query the data stored in one of the tables. Once we click on Run, we can view the results of the query below, along with the time it takes to execute the query and the total data scanned by the query.

Then in this Athena query editor, we can perform different queries for data analysis, here are some of the examples.

Select all data from the spacex_crew table

Select specific columns from the spacex_crew table

Find the count of total crew members using the COUNT() aggregation function.

Filtering data to get details of crew members belonging to the “SpaceX” agency.

Aggregating the data based on agency to find the number of members belonging to each agency.

Hence in this way, we can run different SQL queries to analyze the data from the data lake in AWS Athena. Here I have only shown a few examples of queries, you can try to run more complex queries using joins, subqueries, etc. based on your data model. Also, if we want to do more complex data transformations, cleaning, and analysis in our data we will use spark notebooks from Glue ETL to process the data.

Conclusion

In conclusion, the combination of AWS Glue and AWS Athena offers a powerful solution for data analytics on the AWS platform. By utilizing AWS Glue's data crawler to ingest data from the s3 data lake, creating a Glue data catalog with databases and tables, and analyzing the data in Glue tables using SQL queries in AWS Athena, valuable insights can be effectively extracted the data, as we have done in our project so far. These tools streamline the process of data discovery, preparation, and integration, making it easier for businesses to gain a competitive edge by making data-driven decisions, improving operational efficiency, and identifying new business opportunities in today's data-driven world.

Epilogue

Hence, simple data analytics was done over the data we loaded in the S3 data lake using AWS Glue and AWS Athena. In the next part of this project, I will be cleaning and transforming the data using spark notebooks in AWS Glue and will finally load the transformed data into AWS Redshift, which will be the final data warehouse for storing the processed data.

Previous Blog: Community | Unleashing the power of ELT in AWS using Airbyte

Site Terms, Privacy, and more.