AWS Logo
Menu
Near Real-Time Database Replication to Apache Iceberg Table on S3 Using Amazon Data Firehose

Near Real-Time Database Replication to Apache Iceberg Table on S3 Using Amazon Data Firehose

Maintain fresh, analytics-ready data in data lake without implementing complex ETL processes or managing infrastructure

Pengfei Zhang
Amazon Employee
Published Jan 28, 2025
In November 2024, Amazon Web Services (AWS) announed a powerful new capability in Amazon Data Firehose that enables near real-time Change Data Capture (CDC) from operational databases like PostgreSQL and MySQL, with direct replication to Apache Iceberg tables in Amazon Simple Storage Service (Amazon S3).
Traditionally, to achieve near real-time data replication from databases to Amazon S3 data lakes, AWS customers needed to develop custom extract, transform, and load (ETL) jobs that periodically poll databases for changes. This approach presented two significant challenges: the ETL processes often impacted database transaction performance, and the batch-oriented nature of these jobs introduced latency, sometimes delaying data availability for analytics by several hours.
This new capability provides a simple, scalable, and fully managed solution for streaming database updates. Amazon Data Firehose leverages native database replication logs to capture changes, significantly reducing the impact on database transaction performance. You can configure a Data Firehose delivery stream in minutes to capture and process CDC events from your database. This enables seamless replication of data from various databases into Apache Iceberg tables on Amazon S3, making fresh data readily available for large-scale analytics and machine learning (ML) applications.
In this blog post, I will demonstrate this capability using an Amazon Relational Database Service (Amazon RDS) for MySQL database. I will walk you through the step-by-step process of configuring a Amazon Data Firehose delivery stream and validating the data synchronization. While this tutorial uses Amazon RDS, the solution also works with self-managed databases. Although we will use the AWS Management Console for this walkthrough, you can achieve the same results using the AWS Command Line Interface (AWS CLI), AWS SDKs, or AWS CloudFormation.

Preparation

The initial setup primarily focuses on establishing secure database connectivity. To enable Amazon Data Firehose to access your database within your Virtual Private Cloud (VPC) without exposing traffic to the internet, we need to configure AWS PrivateLink VPC endpoint service. This setup is necessary because Data Firehose runs in AWS's service VPC and requires secure access to your database environment in your VPC.
The blog post Access Amazon RDS across VPCs using AWS PrivateLink and Network Load Balancer demonstrates the key aspects to create a private connection between your VPC resources and an external consumer. With the same idea, the high level steps to enable the connectivity between RDS database and Data Firehose include:
  1. Expose RDS as a service that runs behind a Network Load Balancer in your VPC
  2. Data Firehose connects via VPC interface endpoint
  3. In this way, all traffic remains within the AWS network infrastructure
  4. You can control access through endpoint policies and service permissions

Step 1: Prepare your RDS database

This tutorial assumes you already have a RDS MySQL database created. If not, create one through the Amazon RDS console. During the database creation, for Credentials Settings, you should choose Managed in AWS Secrets Manager, which is required at the later stage when you create Data Firehose stream.
To enable Change Data Capture, you need to configure MySQL binary logging with specific settings:
Set Binary Log Format
  • Create a new Parameter Group in the Amazon RDS console
  • Give Parameter group name and Description
  • Select the appropriate Engine type (e.g. MySQL Community)
  • Choose the matching Parameter group family for your MySQL version (e.g. mysql8.0)
  • Click Create
  • Edit the Parameter Group and modify the binlog_format parameter from MIXED to ROW, and Save Changes
  • In RDS console, Modify the database by associating this Parameter Group with your database
  • Apply changes immediately and reboot your database
Configure Binary Log Retention
  • Connect to your MySQL database using MySQL command-line client
  • You have to set the binlog retention bigger than 24 hours. Execute the following command to set the retention to 30 hours:call mysql.rds_set_configuration('binlog retention hours', 30)
In this tutorial, my database name is school, with a table called students. It has some records like this:

Step 2: Create a Target Group

Follow these steps to create a Target Group that will direct traffic to your Amazon RDS instance:
  1. Navigate to the Target Groups section in the Amazon EC2 console
  2. Select Create target group
  3. Configure the basic settings:
    • For Target type, select IP addresses
    • Enter a name for your Target Group (e.g. rds-target-group)
    • For Protocol: Port, select TCP and enter port 3306
    • Choose your target VPC
  4. Click Next
  5. Configure the target IP address:
    • Obtain your RDS instance IP address by performing a DNS lookup: nslookup <RDS-endpoint>
    • Enter the IP address in Enter an IPv4 address from a VPC subnet
    • Click Include as pending below
  6. Click Create target group
Important: In production environments, you must update the target IP address manually when database failover occurs. While this is beyond the scope of this tutorial, you can refer to Access Amazon RDS across VPCs using AWS PrivateLink and Network Load Balancer for an automated solution using AWS Lambda and Amazon SNS.
The Target Group will initially show None load balancer associated. In the next step, we will create a Network Load Balancer and associate it with this Target Group.

Step 3: Create a Network Load Balancer

Follow these steps to create an internal Network Load Balancer that will route traffic to your Amazon RDS instance:
  1. Navigate to the Load Balancers section in the Amazon EC2 console
  2. Select Create load balancer
  3. Choose Network Load Balancer
  4. Configure the basic settings:
    • Enter a name (e.g., rds-load-balancer)
    • For Scheme, select internal
    • In your VPC, select all subnets where your RDS instances reside
  5. Configure the listener:
    • Set Protocol to TCP
    • Set Port to 3306
  6. Configure routing:
    • Select your previously created target group (e.g., rds-target-group)
  7. Review your settings and click Create load balancer
After creation, the Network Load Balancer will automatically begin routing traffic to your RDS instance through the specified target group.

Step 4: Create a VPC Endpoint Service

Follow these steps to create and configure a VPC Endpoint Service that will enable Amazon Data Firehose to connect to your RDS database:
  1. Navigate to the Endpoint services section in the Amazon VPC console
  2. Select Create endpoint service
  3. Configure the basic settings:
    • Enter a name (e.g., rds-endpoint-service)
    • For Load balancer type, select Network
    • Choose your previously created Network Load Balancer (e.g., rds-load-balancer)
  4. Click Create
  5. Configure service access:
    • Select your newly created endpoint service
    • Navigate to the Allow principals tab
    • Click Allow principals
    • Enter the service principal ARN: firehose.amazonaws.com
    • Click Allow principals
  6. Disable acceptance requirements:
    • Select your endpoint service
    • From the Actions menu, choose Modify endpoint acceptance setting
    • Uncheck Acceptance required
    • Click Save changes
Important: Ensuring that Acceptance required is disabled is crucial. Amazon Data Firehose will not be able to establish a connection if this setting remains enabled.

Create Amazon Data Firehose Stream

After completing the preparation steps, we can now create and configure the Data Firehose stream to replicate data from the MySQL database to Apache Iceberg tables.

Configure Basic Settings

  1. Navigate to Amazon Data Firehose in the AWS Management Console
  2. Select Create Firehose stream
  3. For source and destination, select:
    • Source: MySQL database
    • Destination: Apache Iceberg Tables
  4. Enter a stream name (e.g., rds-stream)

Configure Source Settings

  1. Provide the database connectivity details:
    • Enter the fully qualified DNS name of your Database endpoint
    • Enter the Database VPC endpoint service name
  2. Set Secret name to the RDS database credentials stored in Secrets Manager. The format is like rds!db-<id>.
  3. Configure data capture specifications:
    • Specify the databases, tables, and columns using explicit names or regular expressions
    • Important: Use fully qualified names for tables (format: <database-name>.<table-name>)
4. Configure watermark settings:
  • Specify a watermark table name (e.g., school.watermark)
  • Note: Data Firehose uses watermarks to track incremental snapshot progress
  • You can either create the watermark table manually or let Data Firehose create it automatically
  • Ensure your database credentials have table creation permissions if using automatic creation

Configure Destination Settings

  • Check Enable automatic table creation
  • Select an existing S3 bucket as the Warehouse location
  • Optionally enable Schema evolution to handle database schema changes
  • Set the S3 backup bucket (can be the same as warehouse location)

Additional Configuration

  1. Amazon CloudWatch error logging is enabled by default to monitor stream progress and errors
  2. The final step is to set the IAM roles. You need to choose an existing role. To make the example simple, we already created one role with the following trust policy and full administrator permissions. In production usage, you should refer to the following document for least privilege access control: Controlling Access with Amazon Data Firehose.
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "firehose.amazonaws.com"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<account-id>"
}
}
}]
}
Finally, everything is setup. Click Create Firehose stream and your Firehose stream will be ready in a few minutes.

Monitor Firehose Stream

After creating your Data Firehose stream, monitoring its performance and health is crucial. Once the Firehose stream is created, it takes a few minutes to connect the data source and start ingesting data. After that, the status should shows as Active.
In the Monitoring tab, you can see some metrics points, e.g. Records read from database source (Sum). You should also pay attention to Error logs status to make sure there is no errors. If there are some errors, you can go to the Destination error logs and Backup error logs for error details.
You should also check Source data snapshots tab. In Table snapshot overview, you can check the amount of tables in Complete state, and the correspondent Snapshot status for all the tables.

Query the Iceberg Table Data

Now the Iceberg table should have been created in AWS Glue Data Catalog. We can query the data from AWS Glue console.
Open AWS Glue console. In the left side navigation menu, choose Data Catalog → Databases → Tables.
You should be able to see the tables. Select the table, you can see Table details. As you can see, the table name format is Apache Iceberg version 2. You can also check the Schema here.
Choose View data from the top-right Actions button. A pop-up will direct you to Athena to preview the data. Click Proceed.
You will be directed to Athena → Query editor tabs, where the pre-built query will be running automatically and soon you will be able to see the table data.
You can try to insert some new records into the database table, wait for a few minutes, and then query the data again to see CDC is actually working and capturing the data changes in near realtime.

Troubleshooting Guide

Here are solutions to common issues you might encounter when using Amazon Data Firehose for database replication:

No Data Ingestion

If you observe:
  • No errors in the logs
  • No metrics in the Monitoring tab charts (including Records read and Bytes read)
  • Tables not showing Complete status in Source data snapshots
Resolution: After fixing all known issues, if the stream still shows no activity, create a new Data Firehose stream. This often resolves the issue immediately.

Destination Error Logs

Navigate to Amazon Data Firehose → Destination error logs to review detailed error messages.
If you see errors like “Firehose was not able to connect to Database endpoint. Please check that the networking set up of the Database endpoint and VPC Endpoint Service is correct, and that the SSL certificate is signed by trusted CA. You may need to run 'mysqladmin flush-hosts' to unblock the connectivity from Firehose”, go to your MySQL command line and perform the afore-mentioned command to unblock the connectivity from Firehose. Note: depending on the MySQL server version, the actual command might be FLUSH HOSTS instead.

Amazon Athena Query Issues

If you encounter query execution errors in Athena:

The Resolution is to configure the query result location:
  • Navigate to Workgroups in the left navigation bar
  • Select the primary workgroup
  • Choose Edit
  • Expand Query result configuration - Optional
  • Specify an S3 bucket for query results
Best Practice: Create a dedicated S3 bucket for Athena query results to maintain better organization and control over your query output data.

Conclusion

In this post, we explored how Amazon Data Firehose enables near real-time data replication from Amazon RDS for MySQL databases to Apache Iceberg tables in your Amazon S3 data lake. We provided a comprehensive guide covering:
  • Setting up secure connectivity using Network Load Balancer and VPC Endpoint Service
  • Configuring and creating a Data Firehose delivery stream
  • Monitoring the replication process
  • Querying replicated data using AWS Glue and Amazon Athena
This solution enables you to maintain fresh, analytics-ready data in your data lake without implementing complex ETL processes or managing infrastructure. The fully managed nature of Data Firehose helps reduce operational overhead while ensuring reliable data delivery with minimal impact on your source database.
This implementation serves as a foundation for building more sophisticated data analytics solutions. You can extend this pattern to support multiple databases, implement more complex transformations, or integrate with other AWS analytics services to derive deeper insights from your data.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

2 Comments