AWS Logo
Menu
How to Ingest Excel Files into a Data Lake Using AWS Glue

How to Ingest Excel Files into a Data Lake Using AWS Glue

As organizations modernize their data infrastructure, ingesting legacy Excel files into cloud-based data lakes is becoming increasingly important. Whether you’re dealing with departmental spreadsheets or externally sourced data, AWS Glue provides a serverless, low-code approach for transforming and loading Excel files from Amazon S3 into your data lake.

Published Jun 5, 2025
In this post, we’ll walk through how to ingest Excel files using AWS Glue’s support for Excel via the S3 file source node, transforming and storing the data in a queryable format like Parquet.

Overview

AWS Glue now supports ingesting Excel files directly from Amazon S3 using the S3 file source node in the Glue Studio visual editor. This allows users to:
  • Read Excel files without custom scripts
  • Perform column mapping and data transformations visually
  • Write clean, structured output into S3
  • Register the resulting dataset in the AWS Glue Data Catalog for querying with Amazon Athena
This approach simplifies ETL workflows for Excel ingestion and eliminates the need for intermediate conversion steps.
In this blog post, we are going to look at the scenario where you have some financial data in an Excel workbook that you want to ingest into your data lake. In our example, the workbook has data for financial transactions, transaction types, categories, amounts, etc. as shown below:

Prerequisites

To follow along, you’ll need:
  • An AWS account and access to the AWS Management Console
  • An Excel file
  • An S3 bucket where you will upload your Excel file
  • An S3 bucket where you will store the ingested data
  • An AWS Glue-compatible IAM role with S3 and Glue permissions
  • An AWS Glue data catalog database (optional, but recommended)

Step 1: Check your Excel file

Before you get started, have a quick check of your Excel file. When creating a Glue Visual ETL job, the default setting is to import the data from the first sheet of your Excel workbook, so make sure that the data you want to import is located there. Also check that the sheet is well-structured (headers in the first row, consistent column types, etc.). This will help later when AWS Glue attempts to infer the schema from the file.

Step 2: Upload your Excel File to S3

In order to read your Excel spreadsheet, you will need to upload it to an S3 bucket. To create an S3 bucket and upload your file, follow these steps:
1. Navigate to the Amazon S3 console. In this example, we are going to create a general purpose bucket to store your Excel spreadsheet.
2. In the S3 console, click on the Create bucket button and give the bucket a unique name, then click the button at the bottom of the page to create your bucket. In my example, I named the bucket "raw-excel-data-019201"
3. Once the bucket has been created, you can use the S3 console to navigate to the bucket you just created. You can use the search box in the S3 console to find your bucket, then click on the bucket name to view the contents--your bucket should be empty at this point.
4. Next, click the Upload button then Add Files and select the Excel workbook you want to upload from your local computer.

Step 3: Create a bucket for your ingested data

So when we get to the step where we are creating our AWS Glue job to extract and transform the data, we will need somewhere to store the data. In the ETL job, we are going to transform the Excel data into Parquet files. Parquet is an open-source, columnar storage file format used for efficient data storage and retrieval, particularly for large datasets and analytical workloads.
Before we can create our ETL job, we need to create a second S3 bucket to store these Parquet files. To do that, follow these steps:
1. Navigate to the Amazon S3 console. In this example, we are going to create another general purpose bucket to store your Parquet files.
2. In the S3 console, click on the Create bucket button and give the bucket a unique name, then click the button at the bottom of the page to create your bucket. In this instance, I named the bucket "financial-data-10781"

Step 4: Create a Database in the Glue Data Catalog

For our last prerequisite step, we are going to create a Database in the Glue Data Catalog. This is optional, but it will mean that once you have your Parquet data, you will be able to query this data using Amazon Athena.
To create a database, follow these steps:
1. Navigate to the AWS Glue console, and under Data Catalog in the menu on the left-hand side, click on Databases.
2. To create a database, click the Add database button in the upper right-hand corner of the databases listing.
3. You will need to give your database a name—in the example below I have named the database “finance”. Click Create database to finish.
With all the prerequisites complete, you are now ready to move on to the next step, and create your AWS Glue job to extract the data from your Excel spreadsheet.

Creating an AWS Glue Visual ETL Job

For this ETL job, we are going to create a very basic setup. The job itself will have two nodes—one for the S3 File Source for your Excel spreadsheet, and a second node for an S3 File Target for your Parquet files. If you did want to do some transformation, you could add an additional node between these two, like the Change Schema transform to change field mappings, data types, etc.


Step 1: Create a new AWS Glue ETL job

  1. In the AWS Management Console, navigate to AWS Glue.
  2. From the AWS Glue home page, click on Go to ETL jobs. This will take you to AWS Glue Studio.
  3. To create a new ETL job, click on Visual ETL-- this will open a visual editor you can use to create your job.

Step 2: Configure the S3 File Source Node

  1. Within the visual editor, locate the Add nodes dialog and search for Amazon S3 as a source node, then click to add this to your canvas.
  2. In the Data source properties, use the Browse S3 button to select the bucket where you uploaded your Excel file earlier and select the Excel file name. When complete, the S3 URL should include both the bucket name and file name (i.e. s3://raw-excel-data-019201/financialinfo.xlsx)
  3. For the Data format, use the drop-down list to select Excel.
At this point, you should be able to preview your Excel data in the Data preview pane, as shown below:
Note: If you can not see any data in the preview, it is most likely that the IAM role you are using for Glue does not have permission to access the S3 bucket where you have uploaded your Excel file. If you click on the settings icon (which looks like a cog), you can change the IAM role that is used for the preview, or alternately update your AWS Glue IAM role to ensure that it has access to the S3 bucket.

Step 3: Set Up the S3 Target

  1. Within the visual editor, locate the Add nodes dialog and click on the Targets tab, then search for Amazon S3 as a target node, then click to add this to your canvas.
  2. In the Data target properties, select Parquet for the file format, and Snappy for the compression type. For the S3 target location, select an S3 bucket location for where you would like to store the data extracted from the Excel file. In this example, this is the second S3 bucket I created, which I named "financial-data-10781"
  3. Under Data Catalog options, select the option to “Create a table in the Data Catalog and on subsequent runs, keep existing schema and add new partitions”
  4. For the table name, enter “financial_info” and for the Database, select the Finance database we created earlier.
  5. For the Data catalog update options, select the option for "Create a table in the Data Catalog and on subsequent runs, keep existing schema and add new partitions". This would allow you to ingest additional Excel files to update the data, but keep the existing schema in place.
With your source and target in place, your ETL job should like this:

Step 4: Configure Job Settings and Run

Now it is the moment of truth-- to run your ETL job to ingest and transform your Excel data. But before we can run your ETL job, we need to configure some of the job settings.
  1. Click on the Save button to save your ETL job. You will need to give your ETL jobs a name. In this example, I have named my job "Ingest Excel into Data Lake". Make sure to click the Save button again after you have made any changes.
  2. To run the ETL job, click on the Run button in the upper right-hand corner. You will get a banner indicating that the job has started successfully, as shown below:
  3. You can click on the Run details link to watch the progress of your ETL job. Depending on how much data is in your Excel workbook, it may take a few minutes. When your job status is updated to "Succeeded" you will know that the job has finished running..
Once the job is successfully run, your Excel data will be transformed and saved to S3 in Parquet format. You can navigate to the S3 bucket you created to hold the parquet data and verify that the files have been created.

Step 5: Query with Athena (Optional)

For a final, optional step you can use Amazon Athena to query the parquet files that you created, using the steps below:
  1. From the AWS console, navigate to Amazon Athena and select the Finance database.
  2. Under the Tables menu, you should be able to see the financial_information table.
  3. You can use the Athena query editor to run queries like:
SELECT * FROM "finance"."financial_info" limit 10;
SELECT * FROM "finance"."financial_info"
where transactiontype = 'Purchase';
These queries are running off of the Parquet files that you created when you ran your AWS glue job, which contain the data you ingested from the Excel file.

Conclusion

With native Excel support in the AWS Glue Studio visual editor, ingesting Excel spreadsheets into your data lake is easier and faster than ever. You can build fully managed, reusable pipelines—without writing code—to process tabular data from Excel files and make it available for downstream analytics and machine learning.
Whether you're onboarding legacy data or standardizing input from business teams, AWS Glue Studio offers a flexible and scalable solution to bring Excel into your modern data lake architecture.
 

Comments