AWS Logo
Menu

Transition Data from OLTP to OLAP to Archive using Zero-ETL

Zero-ETL is a set of integrations that eliminates or minimizes the need to build ETL data pipelines. Extract, transform, and load (ETL) is the process of combining, cleaning, and normalizing data from different sources to get it ready for analytics.

Sherin Chandy
Amazon Employee
Published Dec 3, 2024
Zero-ETL integration between Amazon Aurora to Amazon Redshift made it possible to perform analytics queries on near realtime data replicated from Aurora to Redshift. Once the requirement to keep the replicated data in Redshift for analytics queries is finished, it can be moved to S3 by using UNLOAD or Redshift Spectrum for long term archival, making it possible to build historical data repo for ad-hoc querying for compliance and other purposes. Data once offloaded to S3 from Redshift can be queried whenever needed using Redshift, Amazon Athena, EMR or any other tools which supports data reads from S3.
After replicating data to Redshift via zero-ETL, it is possible to shrink database storage of Aurora by deleting already replicated rows or tables in Aurora. When the replicated data in Aurora is deleted by dropping table, database or by deleting specific rows from the table, the overall allocated space decreases by a comparable amount with the help of dynamic resizing capability of Aurora storage. This helps to reduce storage charges.
The following diagram illustrates the architecture for this use case.
Architecture

Since data changes replicated via Zero-ETL from Aurora to Redshift also includes delete operations, bulk deletes there will be corresponding data deletes in the Redshift replication target table as well. To preserve the replicated data at Redshift as a historical repo, the data needs to be copied from the replication target table in Redshift to another backup table in Redshift or UNLOAD to S3 before performing any bulk delete, truncate or drop operations in Aurora.
The best approach is to periodically create a backup table in Redshift using CTAS from the original replication target table and periodically move data from this backup table into into S3 using UNLOAD or Spectrum for long term retention. It is a good practice create separate backup tables based on date/time( ex: monthly, weekly or daily) and unload the tables to corresponding date/time based directories in S3 bucket, so that individual Redshift backup tables and S3 datasets corresponding to specific date/time range can be deleted separately in the future. The unloaded data on S3 via UNLOAD or Spectrum can catalogued in AWS Glue and directly accessed for ad-hoc queries using Redshift Spectrum or Amazon Athena. Below given are steps to implement this approach.
  1. Follow https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/zero-etl.html and create Zero-ETL integration between Amazon Aurora and Redshift and verify data is being replicated to Redshift. (In this example Redshift table “zero_etl_replication_target” is the replication target table).
TargetTableCount
  1. Create a backup table in Redshift at regular intervals(Weekly, Monthly, Daily …). Below screenshot shows creation of backup a table “zero_etl_replication_target_2024_JAN_Archive” for the month of January 2008 from the replication target table “zero_etl_replication_target”.
CreateBackupTable
  1. Once the backup table became queried infrequently or not needed to retain in Redshift, UNLOAD the table into S3 and truncate/drop the backup table. Below screenshot shows unloading of the backup table “zero_etl_replication_target_2024_JAN_Archive” to S3 as Parquet files.
UNLOADBkpTab
  1. The unloaded dataset on S3 can be catalogued into AWS Glue and queried directly using Redshift Spectrum, in case needed in the future for any adhoc requirements. Below screenshot shows querying an archived table directly from S3 by leveraging the AWS Glue data catalog and Glue data catalog automatic mounting feature of Redshift
GlueCataLogMountRS
  1. Ad-hoc query on the same dataset on S3 can be performed directly from Athena as shown below.
AthenaAdhoc
Note: In case dataset on S3 needs to be kept for very long time, leverage S3 class tiering options to optimize cost. Also, use S3 lifecycle management feature to delete objects automatically when the retention period ends
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

4 Comments