
Near Real-Time Database Replication to Apache Iceberg Table on S3 Using Amazon Data Firehose
Maintain fresh, analytics-ready data in data lake without implementing complex ETL processes or managing infrastructure
- Expose RDS as a service that runs behind a Network Load Balancer in your VPC
- Data Firehose connects via VPC interface endpoint
- In this way, all traffic remains within the AWS network infrastructure
- You can control access through endpoint policies and service permissions
- Create a new Parameter Group in the Amazon RDS console
- Give Parameter group name and Description
- Select the appropriate Engine type (e.g. MySQL Community)
- Choose the matching Parameter group family for your MySQL version (e.g. mysql8.0)
- Click Create
- Edit the Parameter Group and modify the binlog_format parameter from MIXED to ROW, and Save Changes
- In RDS console, Modify the database by associating this Parameter Group with your database
- Apply changes immediately and reboot your database
- Connect to your MySQL database using MySQL command-line client
- You have to set the binlog retention bigger than 24 hours. Execute the following command to set the retention to 30 hours:
call mysql.rds_set_configuration('binlog retention hours', 30)
- Navigate to the Target Groups section in the Amazon EC2 console
- Select Create target group
- Configure the basic settings:
- For Target type, select IP addresses
- Enter a name for your Target Group (e.g.
rds-target-group
) - For Protocol: Port, select TCP and enter port 3306
- Choose your target VPC
- Click Next
- Configure the target IP address:
- Obtain your RDS instance IP address by performing a DNS lookup:
nslookup <RDS-endpoint>
- Enter the IP address in Enter an IPv4 address from a VPC subnet
- Click Include as pending below
- Click Create target group
Important: In production environments, you must update the target IP address manually when database failover occurs. While this is beyond the scope of this tutorial, you can refer to Access Amazon RDS across VPCs using AWS PrivateLink and Network Load Balancer for an automated solution using AWS Lambda and Amazon SNS.
- Navigate to the Load Balancers section in the Amazon EC2 console
- Select Create load balancer
- Choose Network Load Balancer
- Configure the basic settings:
- Enter a name (e.g.,
rds-load-balancer
) - For Scheme, select internal
- In your VPC, select all subnets where your RDS instances reside
- Configure the listener:
- Set Protocol to TCP
- Set Port to 3306
- Configure routing:
- Select your previously created target group (e.g.,
rds-target-group
)
- Review your settings and click Create load balancer
- Navigate to the Endpoint services section in the Amazon VPC console
- Select Create endpoint service
- Configure the basic settings:
- Enter a name (e.g.,
rds-endpoint-service
) - For Load balancer type, select Network
- Choose your previously created Network Load Balancer (e.g.,
rds-load-balancer
)
- Click Create
- Configure service access:
- Select your newly created endpoint service
- Navigate to the Allow principals tab
- Click Allow principals
- Enter the service principal ARN:
firehose.amazonaws.com
- Click Allow principals
- Disable acceptance requirements:
- Select your endpoint service
- From the Actions menu, choose Modify endpoint acceptance setting
- Uncheck Acceptance required
- Click Save changes
Important: Ensuring that Acceptance required is disabled is crucial. Amazon Data Firehose will not be able to establish a connection if this setting remains enabled.
- Navigate to Amazon Data Firehose in the AWS Management Console
- Select Create Firehose stream
- For source and destination, select:
- Source: MySQL database
- Destination: Apache Iceberg Tables
- Enter a stream name (e.g.,
rds-stream
)
- Provide the database connectivity details:
- Enter the fully qualified DNS name of your Database endpoint
- Enter the Database VPC endpoint service nameImage not found
- Set Secret name to the RDS database credentials stored in Secrets Manager. The format is like rds!db-<id>.
- Configure data capture specifications:
- Specify the databases, tables, and columns using explicit names or regular expressions
- Important: Use fully qualified names for tables (format:
<database-name>.<table-name>
)
- Specify a watermark table name (e.g.,
school.watermark
) - Note: Data Firehose uses watermarks to track incremental snapshot progress
- You can either create the watermark table manually or let Data Firehose create it automatically
- Ensure your database credentials have table creation permissions if using automatic creation
- Check Enable automatic table creation
- Select an existing S3 bucket as the Warehouse location
- Optionally enable Schema evolution to handle database schema changes
- Set the S3 backup bucket (can be the same as warehouse location)
- Amazon CloudWatch error logging is enabled by default to monitor stream progress and errors
- The final step is to set the IAM roles. You need to choose an existing role. To make the example simple, we already created one role with the following trust policy and full administrator permissions. In production usage, you should refer to the following document for least privilege access control: Controlling Access with Amazon Data Firehose.
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "firehose.amazonaws.com"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<account-id>"
}
}
}]
}
- No errors in the logs
- No metrics in the Monitoring tab charts (including Records read and Bytes read)
- Tables not showing Complete status in Source data snapshots
If you see errors like “Firehose was not able to connect to Database endpoint. Please check that the networking set up of the Database endpoint and VPC Endpoint Service is correct, and that the SSL certificate is signed by trusted CA. You may need to run 'mysqladmin flush-hosts' to unblock the connectivity from Firehose”, go to your MySQL command line and perform the afore-mentioned command to unblock the connectivity from Firehose. Note: depending on the MySQL server version, the actual command might be
FLUSH HOSTS
instead.The Resolution is to configure the query result location:
- Navigate to Workgroups in the left navigation bar
- Select the primary workgroup
- Choose Edit
- Expand Query result configuration - Optional
- Specify an S3 bucket for query results
Best Practice: Create a dedicated S3 bucket for Athena query results to maintain better organization and control over your query output data.
- Setting up secure connectivity using Network Load Balancer and VPC Endpoint Service
- Configuring and creating a Data Firehose delivery stream
- Monitoring the replication process
- Querying replicated data using AWS Glue and Amazon Athena
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.