Near Real-Time Database Replication to Apache Iceberg Table on S3 Using Amazon Data Firehose

In November 2024, Amazon Web Services (AWS) announed a powerful new capability in Amazon Data Firehose that enables near real-time Change Data Capture (CDC) from operational databases like PostgreSQL and MySQL, with direct replication to Apache Iceberg tables in Amazon Simple Storage Service (Amazon S3).

Traditionally, to achieve near real-time data replication from databases to Amazon S3 data lakes, AWS customers needed to develop custom extract, transform, and load (ETL) jobs that periodically poll databases for changes. This approach presented two significant challenges: the ETL processes often impacted database transaction performance, and the batch-oriented nature of these jobs introduced latency, sometimes delaying data availability for analytics by several hours.

This new capability provides a simple, scalable, and fully managed solution for streaming database updates. Amazon Data Firehose leverages native database replication logs to capture changes, significantly reducing the impact on database transaction performance. You can configure a Data Firehose delivery stream in minutes to capture and process CDC events from your database. This enables seamless replication of data from various databases into Apache Iceberg tables on Amazon S3, making fresh data readily available for large-scale analytics and machine learning (ML) applications.

In this blog post, I will demonstrate this capability using an Amazon Relational Database Service (Amazon RDS) for MySQL database. I will walk you through the step-by-step process of configuring a Amazon Data Firehose delivery stream and validating the data synchronization. While this tutorial uses Amazon RDS, the solution also works with self-managed databases. Although we will use the AWS Management Console for this walkthrough, you can achieve the same results using the AWS Command Line Interface (AWS CLI), AWS SDKs, or AWS CloudFormation.

Image not found

Preparation

The initial setup primarily focuses on establishing secure database connectivity. To enable Amazon Data Firehose to access your database within your Virtual Private Cloud (VPC) without exposing traffic to the internet, we need to configure AWS PrivateLink VPC endpoint service. This setup is necessary because Data Firehose runs in AWS's service VPC and requires secure access to your database environment in your VPC.

The blog post Access Amazon RDS across VPCs using AWS PrivateLink and Network Load Balancer demonstrates the key aspects to create a private connection between your VPC resources and an external consumer. With the same idea, the high level steps to enable the connectivity between RDS database and Data Firehose include:

Expose RDS as a service that runs behind a Network Load Balancer in your VPC
Data Firehose connects via VPC interface endpoint
In this way, all traffic remains within the AWS network infrastructure
You can control access through endpoint policies and service permissions

Step 1: Prepare your RDS database

This tutorial assumes you already have a RDS MySQL database created. If not, create one through the Amazon RDS console. During the database creation, for Credentials Settings, you should choose Managed in AWS Secrets Manager, which is required at the later stage when you create Data Firehose stream.

To enable Change Data Capture, you need to configure MySQL binary logging with specific settings:

Set Binary Log Format

Create a new Parameter Group in the Amazon RDS console
Give Parameter group name and Description
Select the appropriate Engine type (e.g. MySQL Community)
Choose the matching Parameter group family for your MySQL version (e.g. mysql8.0)
Click Create
Edit the Parameter Group and modify the binlog_format parameter from MIXED to ROW, and Save Changes
In RDS console, Modify the database by associating this Parameter Group with your database
Apply changes immediately and reboot your database

Configure Binary Log Retention

Connect to your MySQL database using MySQL command-line client
You have to set the binlog retention bigger than 24 hours. Execute the following command to set the retention to 30 hours:call mysql.rds_set_configuration('binlog retention hours', 30)

In this tutorial, my database name is school, with a table called students. It has some records like this:

Image not found

Step 2: Create a Target Group

Follow these steps to create a Target Group that will direct traffic to your Amazon RDS instance:

Navigate to the Target Groups section in the Amazon EC2 console
Select Create target group
Configure the basic settings:
- For Target type, select IP addresses
- Enter a name for your Target Group (e.g. rds-target-group)
- For Protocol: Port, select TCP and enter port 3306
- Choose your target VPC
Click Next
Configure the target IP address:
- Obtain your RDS instance IP address by performing a DNS lookup: nslookup <RDS-endpoint>
- Enter the IP address in Enter an IPv4 address from a VPC subnet
- Click Include as pending below
Click Create target group

Important: In production environments, you must update the target IP address manually when database failover occurs. While this is beyond the scope of this tutorial, you can refer to Access Amazon RDS across VPCs using AWS PrivateLink and Network Load Balancer for an automated solution using AWS Lambda and Amazon SNS.

The Target Group will initially show None load balancer associated. In the next step, we will create a Network Load Balancer and associate it with this Target Group.

Step 3: Create a Network Load Balancer

Follow these steps to create an internal Network Load Balancer that will route traffic to your Amazon RDS instance:

Navigate to the Load Balancers section in the Amazon EC2 console
Select Create load balancer
Choose Network Load Balancer
Configure the basic settings:
- Enter a name (e.g., rds-load-balancer)
- For Scheme, select internal
- In your VPC, select all subnets where your RDS instances reside
Configure the listener:
- Set Protocol to TCP
- Set Port to 3306
Configure routing:
- Select your previously created target group (e.g., rds-target-group)
Review your settings and click Create load balancer

After creation, the Network Load Balancer will automatically begin routing traffic to your RDS instance through the specified target group.

Step 4: Create a VPC Endpoint Service

Follow these steps to create and configure a VPC Endpoint Service that will enable Amazon Data Firehose to connect to your RDS database:

Navigate to the Endpoint services section in the Amazon VPC console
Select Create endpoint service
Configure the basic settings:
- Enter a name (e.g., rds-endpoint-service)
- For Load balancer type, select Network
- Choose your previously created Network Load Balancer (e.g., rds-load-balancer)
Click Create
Configure service access:
- Select your newly created endpoint service
- Navigate to the Allow principals tab
- Click Allow principals
- Enter the service principal ARN: firehose.amazonaws.com
- Click Allow principals
Disable acceptance requirements:
- Select your endpoint service
- From the Actions menu, choose Modify endpoint acceptance setting
- Uncheck Acceptance required
- Click Save changes

Important: Ensuring that Acceptance required is disabled is crucial. Amazon Data Firehose will not be able to establish a connection if this setting remains enabled.

Create Amazon Data Firehose Stream

After completing the preparation steps, we can now create and configure the Data Firehose stream to replicate data from the MySQL database to Apache Iceberg tables.

Configure Basic Settings

Navigate to Amazon Data Firehose in the AWS Management Console
Select Create Firehose stream
For source and destination, select:
- Source: MySQL database
- Destination: Apache Iceberg Tables
Enter a stream name (e.g., rds-stream)

Configure Source Settings

Provide the database connectivity details:
- Enter the fully qualified DNS name of your Database endpoint
- Enter the Database VPC endpoint service name
  Image not found
Set Secret name to the RDS database credentials stored in Secrets Manager. The format is like rds!db-<id>.
Configure data capture specifications:
- Specify the databases, tables, and columns using explicit names or regular expressions
- Important: Use fully qualified names for tables (format: <database-name>.<table-name>)

Image not found

4. Configure watermark settings:

Specify a watermark table name (e.g., school.watermark)
Note: Data Firehose uses watermarks to track incremental snapshot progress
You can either create the watermark table manually or let Data Firehose create it automatically
Ensure your database credentials have table creation permissions if using automatic creation

Configure Destination Settings

Check Enable automatic table creation
Select an existing S3 bucket as the Warehouse location
Optionally enable Schema evolution to handle database schema changes
Set the S3 backup bucket (can be the same as warehouse location)

Additional Configuration

Amazon CloudWatch error logging is enabled by default to monitor stream progress and errors
The final step is to set the IAM roles. You need to choose an existing role. To make the example simple, we already created one role with the following trust policy and full administrator permissions. In production usage, you should refer to the following document for least privilege access control: Controlling Access with Amazon Data Firehose.

{
"Version": "2012-10-17",
"Statement": [{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "firehose.amazonaws.com"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<account-id>"
}
}
}]
}

Finally, everything is setup. Click Create Firehose stream and your Firehose stream will be ready in a few minutes.

Monitor Firehose Stream

After creating your Data Firehose stream, monitoring its performance and health is crucial. Once the Firehose stream is created, it takes a few minutes to connect the data source and start ingesting data. After that, the status should shows as Active.

In the Monitoring tab, you can see some metrics points, e.g. Records read from database source (Sum). You should also pay attention to Error logs status to make sure there is no errors. If there are some errors, you can go to the Destination error logs and Backup error logs for error details.

Image not found

You should also check Source data snapshots tab. In Table snapshot overview, you can check the amount of tables in Complete state, and the correspondent Snapshot status for all the tables.

Image not found

Query the Iceberg Table Data

Now the Iceberg table should have been created in AWS Glue Data Catalog. We can query the data from AWS Glue console.

Open AWS Glue console. In the left side navigation menu, choose Data Catalog → Databases → Tables.

You should be able to see the tables. Select the table, you can see Table details. As you can see, the table name format is Apache Iceberg version 2. You can also check the Schema here.

Image not found

Choose View data from the top-right Actions button. A pop-up will direct you to Athena to preview the data. Click Proceed.

Image not found

You will be directed to Athena → Query editor tabs, where the pre-built query will be running automatically and soon you will be able to see the table data.

Image not found

You can try to insert some new records into the database table, wait for a few minutes, and then query the data again to see CDC is actually working and capturing the data changes in near realtime.

Troubleshooting Guide

Here are solutions to common issues you might encounter when using Amazon Data Firehose for database replication:

No Data Ingestion

If you observe:

No errors in the logs
No metrics in the Monitoring tab charts (including Records read and Bytes read)
Tables not showing Complete status in Source data snapshots

Resolution: After fixing all known issues, if the stream still shows no activity, create a new Data Firehose stream. This often resolves the issue immediately.

Destination Error Logs

Navigate to Amazon Data Firehose → Destination error logs to review detailed error messages.
If you see errors like “Firehose was not able to connect to Database endpoint. Please check that the networking set up of the Database endpoint and VPC Endpoint Service is correct, and that the SSL certificate is signed by trusted CA. You may need to run 'mysqladmin flush-hosts' to unblock the connectivity from Firehose”, go to your MySQL command line and perform the afore-mentioned command to unblock the connectivity from Firehose. Note: depending on the MySQL server version, the actual command might be FLUSH HOSTS instead.

Image not found

Amazon Athena Query Issues

If you encounter query execution errors in Athena:

Image not found

The Resolution is to configure the query result location:

Navigate to Workgroups in the left navigation bar
Select the primary workgroup
Choose Edit
Expand Query result configuration - Optional
Specify an S3 bucket for query results

Best Practice: Create a dedicated S3 bucket for Athena query results to maintain better organization and control over your query output data.

Conclusion

In this post, we explored how Amazon Data Firehose enables near real-time data replication from Amazon RDS for MySQL databases to Apache Iceberg tables in your Amazon S3 data lake. We provided a comprehensive guide covering:

Setting up secure connectivity using Network Load Balancer and VPC Endpoint Service
Configuring and creating a Data Firehose delivery stream
Monitoring the replication process
Querying replicated data using AWS Glue and Amazon Athena

This solution enables you to maintain fresh, analytics-ready data in your data lake without implementing complex ETL processes or managing infrastructure. The fully managed nature of Data Firehose helps reduce operational overhead while ensuring reliable data delivery with minimal impact on your source database.

This implementation serves as a foundation for building more sophisticated data analytics solutions. You can extend this pattern to support multiple databases, implement more complex transformations, or integrate with other AWS analytics services to derive deeper insights from your data.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.

Near Real-Time Database Replication to Apache Iceberg Table on S3 Using Amazon Data Firehose

Maintain fresh, analytics-ready data in data lake without implementing complex ETL processes or managing infrastructure

Preparation

Step 1: Prepare your RDS database

Step 2: Create a Target Group

Step 3: Create a Network Load Balancer

Step 4: Create a VPC Endpoint Service

Create Amazon Data Firehose Stream

Configure Basic Settings

Configure Source Settings

Configure Destination Settings

Additional Configuration

Monitor Firehose Stream

Query the Iceberg Table Data

Troubleshooting Guide

No Data Ingestion

Destination Error Logs

Amazon Athena Query Issues

Conclusion

3 Comments