logo
Menu
Redshift Queries Iceberg Tables

Redshift Queries Iceberg Tables

Hands on guide setting up a transactional data lake with Iceberg and Querying it with Redshift

Published Feb 2, 2024
Why Iceberg
Iceberg is an open table format, Open tables are a way to give data lakes properties of relational databases. It aims to solve the problems with Hive table which is a directory based table format. Iceberg is a file based table format, storing it's data in metadata and manifest files
Depiction of Structure of hive table format vs Iceberg table format
Hive vs Iceberg Table format
Source
Iceberg improves data lakes compliance with ACID property. While maintaining the ability to use different compute engines like Spark, Flink on a single datastore.
Other Benefits of Iceberg
Hidden Partition - Iceberg partitions data and optimizes query without the user needing to specify how the data should be partitioned.
Time Travel - It enables rollback of data to a previous point in time. To learn more about problems solved by iceberg see this talk to learn about benefits of iceberg.

Overview of Iceberg

Iceberg store data in layers meta data layer and data layer
Iceberg Structure
Iceberg Architecture: ( Source )
 
The iceberg metadata stores information about the data schema and the manifest files, which store information of the original data and it's partition. To learn more about iceberg metadata see this blog and talk.
Lets create an iceberg table and query it using Redshift.

Copying Data to S3 Data Lake

  1. Create an S3 bucket give it any name, you can use this guide to create an s3 bucket, I named mine redshift-iceberg-tutorial .
  2. Navigate to the CloudShell, you can navigate to CloudShell console by typing CloudShell on the search bar in any service page and clicking Cloudshell. Paste the code snippet below into cloud shell CLI to Copy the data into the recently created bucket. Paste the code below in the CLI to explore list of files in the folder.
1
aws s3 ls "s3://nyc-tlc/trip data/"
Paste the code below to copy green taxi files to your bucket. Replace redshift-iceberg-tutorial with your bucket name.
1
2
aws s3 cp "s3://nyc-tlc/trip data/" s3://redshift-iceberg-tutorial/Parquet/ --exclude "*" --include "green_tripdata_2020*" --recursive
aws s3 cp "s3://nyc-tlc/trip data/" s3://redshift-iceberg-tutorial/Parquet/ --exclude "*" --include "green_tripdata_2021*" --recursive

 

(Optionally) Create A New Workgroup for Athena, or use the default primary Workgroup

To create an Athena workgroup
  1. Navigate to Athena service page and click workgroup on the left navigation pane. A workgroup in Athena is a resource for Athena to separate queries within the same account.
  2. Click create workgroup tab. Give your workgroup a name like redshift-iceberg. Leave the remaining settings as default and click create workgroup.
  3. Click Query Editor in the left navigation pane and select the newly created workgroup.
Creating Athena Workgroup

 

Create a Glue Database for the Catalog

You can follow the steps below to create a new database or skip to use an existing or the default database
  1. Go the Glue console to create a table for our data. In the left navigation pane of the glue service page click databases to create a new database. Click Add database button to add a new database. Give the database a name and click the Create database button, I named mine icebergdb.
Glue studio

 

Create a Glue Crawler

We would create and run a glue clue crawler using our database to populate our database with a table.
  1. Create a crawler to retrieve the schema of the data and create a table. Click on crawler from the left Navigation pane in Glue console. Click create crawler
  2. on the set crawler properties page give the crawler a name, leave the rest settings as default, and click the Next button.
  3. On the Chose data source and classifiers page click Browse S3 buckets to search our newly created bucket or
  4. Type the name of the bucket to search for the iceberg bucket, select it or any folder within it containing our data in my case the Parquet folder
  5. Click choose to add to Parquet folder as an s3 data source for our crawler
  6. Click Create new IAM role button. On the pop up icon give the role a name and or leave default name, click the create button leaving other settings as default.
  7. On the set output and scheduling page select the glue database we created previously.
  8. Review your choices and click create crawler to create the crawler.
  9. Click on the crawler page and then click run crawler, to create a table for our glue database
Create Crawler

 

Preview the Table in Athena

Under Glue Databases select our previously created database and under tables select newly created table, Click table data under view data to preview the table in Athena.
preview Athena data

 
If you get an Error Query results location not set click on settings tab in the query editor. Then click the Manage button and select an s3 location to store results of Athena query.

Creating an Iceberg Table

In the Athena Service page. Click Query Editor and open a new editor. In the editor paste the code below to create an iceberg table
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
CREATE TABLE green_taxi_iceberg (
vendorid bigint,
lpep_pickup_datetime timestamp,
lpep_dropoff_datetime timestamp,
store_and_fwd_flag string,
ratecodeid double,
pulocationid bigint,
dolocationid bigint,
passenger_count double,
trip_distance double,
fare_amount double,
extra double,
mta_tax double,
tip_amount double,
tolls_amount double,
ehail_fee int,
improvement_surcharge double,
total_amount double,
payment_type double,
trip_type double,
congestion_surcharge double)
PARTITIONED BY (year(tpep_pickup_datetime))
LOCATION 's3://redshift-iceberg-tutorial/iceberg/iceberg'
TBLPROPERTIES (
'table_type'='iceberg',
'write_compression'='snappy',
'format'='parquet');
To load the data into iceberg table from our Glue table using this code. Paste into Query editor
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
insert into green_taxi_iceberg (
vendorid, lpep_pickup_datetime,
lpep_dropoff_datetime,
store_and_fwd_flag, ratecodeid,
pulocationid, dolocationid,
passenger_count, trip_distance,
fare_amount, extra, mta_tax,
tip_amount, tolls_amount, ehail_fee,
improvement_surcharge, total_amount,
payment_type, trip_type, congestion_surcharge
)
select vendorid, lpep_pickup_datetime,
lpep_dropoff_datetime,
store_and_fwd_flag, ratecodeid,
pulocationid, dolocationid,
passenger_count, trip_distance,
fare_amount, extra, mta_tax,
tip_amount, tolls_amount, ehail_fee,
improvement_surcharge, total_amount,
payment_type, trip_type, congestion_surcharge
from parquet;
we can preview our table to verify our insert operation worked.

Querying Data With Redshift

Create Serverless workgroup, use an existing workgroup or cluster, you can setup a Redshift workgroup using this guide guide.
In the Serverless Namespace configuration click the name and select Security and encryption tab under the Namespace configuration.
Click the IAM role linked to the redshift cluster. In the IAM page select the Add permissions button. Search for AmazonAthenaFullAccess on the next page, select it and click the Add permissions button to attach AmazonAthenaFullAccess Policy to default Redshift role. Follow this guide to learn more.
Redshift Default IAM Role

 
In Redshift console open the Query v2 Editor to query our iceberg table.
With the Editor create an External Schema to Query to the Iceberg table created previously with Athena.
Query Editor

 
Paste the code below into the query editor to create external schema. Specifying the name of your glue database, and region.
1
2
3
4
5
CREATE external schema spectrum_iceberg_schema
from data catalog
database 'icebergdb'
region 'us-east-1'
iam_role default;
Now we can query our iceberg table using the external schema paste this code below into an editor in Query v2.
1
2
3
SELECT * FROM dev.spectrum_iceberg_schema.green_taxi_iceberg
WHERE vendorid=2
LIMIT 5;
Query Editor results
If you get an error similar to this confirm that you attached AmazonAthenaFullAccess to your default IAM role.
Access Denied Error RedshiftIamRoleSession is not authorized to perform: glue:GetTable on resource:
We have successfully Queried an Iceberg table in our centralized data lake giving it properties of relational databases and maintained it's compute engine agnostic property using Athena and Redshift.

Some Iceberg Features Currently Not Supported With Redshift

  1. Time travel
  2. Create and Alter commands currently not supported
  3. The table should be defined in a glue catalog

References

Comments