Simplifies Data Retrieval using AWS S3 Select
Run SQL Queries on S3 to filter and retrieve the data using Amazon S3 Select
Published Feb 12, 2024
Last Modified Feb 13, 2024
In this article, we will discuss a great feature of AWS i.e. AWS S3 Select. In general, when we think about querying to S3 the first solution that comes to our mind is ATHENA which is correct. Actually, In Athena, you have to create tables that represent many objects in S3 and perform aggregations, joins, etc. across the tables. But what if you want to apply the selection only on a single S3 object for a subset of data only without creating any tables - The answer is S3 Select.
S3 Select will help us to retrieve only a portion of data (subset) from an S3 object by using a simple SQL Query.
Let's consider a use case where a company has multiple departments and all of them want to read a large log file from S3 having 100 million records.
Case 1: Each department has to download the file create their version and upload it back to S3. This will cause data issues because everyone worked on different versions of the object.
Case 2: One department needs only 10 records, another department needs 50 records, and so on but in either case, they need to download the large file having all the records.
Solution: By the Implementation of AWS S3 Select you can use a simple SQL expression to return only the data from the file you’re interested in, instead of retrieving the entire object. You can resolve both the issues as well as increase the performance and save the cost of running the queries on the file.
AWS S3 Select works on objects that are stored in the following formats:
- Apache Parquet
You can perform the query using AWS SDK's, CLI commands, or AWS Console. If you need to fetch data larger than 40 MB use API or AWS CLI.
A few points need to be considered:
- The length of a SQL expression should not exceed 256 KB.
- The length of a record in the input or the result is 1 MB max.
- Permission of
s3:GetObjectfor the object you are querying is a must.
How it will:
- Increase performance: By selecting only limited data instead of all the data from S3.
- Reduce cost: Reduce the long-running thread of reading all the data and process only relevant data.
Image taken from the AWS website.
The above image shows that before implementing the S3 Select how we are reading all the data from S3 and do the processing for each data to get the result. And how we are using S3 select to get the limited data only.
According to AWS S3 Select is up to 400% faster and you can save up to 80% cost for the query.
- Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.
- Click on Buckets (left pane).
- Select the bucket that contains the object that you want to read the content from, and then choose the name of the object.
- Select the Object actions, and choose Query with S3 Select.
- Configure the Input settings, like the format, JSON content type, and Compression of your input data, and the Output settings i.e. the format of the output data that you want to receive.
- Under SQL query, enter the SELECT SQL commands according to your requirements.
- Then click on Run SQL query. Then, under Query results, you can see the results of your SQL queries.
This is how we can use the S3 Select using the AWS console. Alternatively, you can use AWS SDK when trying to implement S3 Select in the application or apply AWS CLI.