Unleashing Data Analytics on S3 Data lake with AWS Glue Crawler and Amazon Athena
Explore the power of AWS Glue and AWS Athena in data analytics on the AWS platform. AWS Glue simplifies data integration, offering data crawlers to automatically infer schema from data in S3 and create a centralized data catalog. AWS Athena, an interactive query service, enables analysis using standard SQL. Learn how to leverage the Trino query engine of Amazon Athena to analyze data in the Glue database, ingested from the S3 data lake using AWS Glue data crawlers.
Published Mar 9, 2024
In today's world, data is growing at an exponential rate. While this data holds immense value and has the potential to impact various aspects of our world, extracting value from such large datasets is no simple task. This is where data analytics comes in. Data analytics is a process used in the field of data science to examine large datasets, uncovering hidden patterns, correlations, trends, insights, and relationships. These insights can help organizations gain a competitive edge in the market by making data-driven decisions, improving operational efficiency, and identifying new business opportunities. The main tasks involved in data analytics include data collection, cleaning, processing, and analysis.
In my previous blog, I discussed ELT and data lakes, highlighting their importance in the current big data world. This blog continues that discussion by showcasing the power of AWS tools like AWS Glue and AWS Athena for illustrating data analytics on the AWS platform. But before we delve into that, let's first understand what AWS Glue and AWS Athena are and why they are used in data engineering. AWS Glue is a serverless data integration service that simplifies the process of data discovery, preparation, movement, and integration from multiple sources. It offers data crawlers that can automatically infer schema from data stored in data lakes like S3 buckets and create a centralized data catalog. The Glue data catalog serves as a metastore containing metadata for all the crawled tables. Additionally, AWS Glue allows us to visually create ETL pipelines for transformation and use Spark notebooks to process and transform data. Similarly, Amazon Athena is an interactive query service that enables us to analyze data using standard SQL. Amazon Athena offers two options for analyzing data: using the distributed query engine Trino to analyze data in S3, on-premises, or other clouds, or using notebooks to build interactive Spark applications and analyze data using PySpark and Spark SQL.
In this blog, I will use the Trino query engine of Amazon Athena to analyze the data in the Glue database, which was ingested from the S3 data lake using AWS Glue data crawlers. The architecture for this implementation is explained below in the Architecture section.

AWS Glue is a serverless data integration solution widely used for data analytics in AWS. AWS Glue enables us to connect with over 70 diverse data sources and to manage data in a centralized data catalog. AWS Glue supports various workloads like ETL, ELT, and streaming, providing services for data discovery, modern ETL, data cleansing, transformation, and centralized cataloging. The following are some of the main components of AWS Glue:
- AWS Glue Data Catalog
AWS Glue data catalog is a fully managed centralized metadata repository and catalog service that stores metadata like table definitions, column names, data types, and other attributes describing the data used in AWS Glue ETL jobs. It simplifies and automates the process of data discovery and cataloging, offering a unified view of data across different AWS services and data sources. This makes it easier for users to query and analyze data without needing to know its exact location or format while ensuring data consistency.
- AWS Glue Crawler
AWS Glue Crawler is a feature of AWS Glue that automatically discovers data schema in the source and stores metadata in the AWS Glue Data Catalog. It automates the process of cataloging data making it easier for users to analyze their data without the need to manually define schema for each dataset in the source. AWS Glue crawlers run classifiers, which can be either built-in or custom classifiers created by users. These classifiers automate the process of inferring schema from datasets, saving time and effort and allowing users to focus on data analysis and deriving insights.
- AWS Glue Databases
AWS Glue databases are logical containers that store metadata tables in the AWS Glue data catalog. They help manage and organize metadata tables that define data from different data stores, making data querying easier. When a Glue data crawler runs on a data source, it creates tables within a specified Glue database in the data catalog. Databases organize tables based on their schema and usage, making it easy for users to locate and access tables needed for analysis and processing.
- AWS Glue ETL
AWS Glue ETL (Extract, Transform, Load) is a fully managed service by AWS that helps users prepare and transform data for analysis quickly. It simplifies the ETL process by automating tasks like data preparation, type conversion, and format conversion. It's a serverless, scalable, and cost-effective solution for data analytics. AWS Glue offers options to visually create ETL Jobs or write scripts and Spark jobs programmatically for ETL workloads. It allows scheduling jobs to run as needed. Additionally, AWS Glue evaluates and monitors data quality and provides version control options by linking Glue jobs with a git repository in AWS CodeCommit.
AWS Athena is an interactive query engine service provided by AWS that is used for analyzing the data stored in data lakes like Amazon S3 using standard SQL queries. AWS Athena is a cost-effective, serverless, and highly scalable service that uses the trino query engine and spark in the backend which can run queries in parallel allowing it to generate fast results even for large datasets and complex queries. AWS Athena uses the AWS Glue Data Catalog to query data sources for analysis. It also integrates with tools like Amazon QuickSight or third-party BI tools to visualize query results, creating dashboards and reports to understand and communicate insights gained from data.
This section describes the various steps taken to implement a simple data analytics pipeline in AWS using AWS Glue and AWS Athena.
Firstly, navigate to the AWS management console and search for AWS Glue, then open AWS Glue by clicking it.




As our data is not already mapped to Glue tables we have to select “Not yet” and click on “Add a data source”












If you want you can also explore the advanced options available in Glue Crawler.










For this firstly navigate to the AWS management console and search for Athena, then click on it to open Athena’s console.




Then in this Athena query editor, we can perform different queries for data analysis, here are some of the examples.
- Select all data from the spacex_crew table

- Select specific columns from the spacex_crew table

- Find the count of total crew members using the COUNT() aggregation function.

- Filtering data to get details of crew members belonging to the “SpaceX” agency.

- Aggregating the data based on agency to find the number of members belonging to each agency.

In conclusion, the combination of AWS Glue and AWS Athena offers a powerful solution for data analytics on the AWS platform. By utilizing AWS Glue's data crawler to ingest data from the s3 data lake, creating a Glue data catalog with databases and tables, and analyzing the data in Glue tables using SQL queries in AWS Athena, valuable insights can be effectively extracted the data, as we have done in our project so far. These tools streamline the process of data discovery, preparation, and integration, making it easier for businesses to gain a competitive edge by making data-driven decisions, improving operational efficiency, and identifying new business opportunities in today's data-driven world.
Hence, simple data analytics was done over the data we loaded in the S3 data lake using AWS Glue and AWS Athena. In the next part of this project, I will be cleaning and transforming the data using spark notebooks in AWS Glue and will finally load the transformed data into AWS Redshift, which will be the final data warehouse for storing the processed data.
Previous Blog: Community | Unleashing the power of ELT in AWS using Airbyte