
Migrating Large Data from Azure Blob Storage to Amazon S3
This guide outlines best practices, step-by-step methods, and practical examples for a successful migration.
Marin Frankovic
Amazon Employee
Published May 28, 2025
Migrating large datasets from Azure Blob Storage to Amazon S3 is a common requirement for organizations adopting a multi-cloud strategy or moving workloads to AWS. However, this process can be complex due to the scale of data, network constraints, security, and the need for minimal disruption. This guide outlines best practices, step-by-step methods, and practical examples for a successful migration.
- Region Alignment: Deploy AWS resources (like EC2 or DataSync agents) in the same region as your Azure Blob Storage to minimize latency and data transfer costs.
- Network Optimization: Avoid using NAT Gateways to reduce transfer fees; use VPC gateway endpoints for S3 to improve performance.
- Resource Selection: Consider AWS Graviton2-based EC2 instances for better cost efficiency and performance when using tools like Rclone.
- Security: Use least privilege principles for IAM roles (AWS) and Azure AD service principals (Azure) to secure access to storage resources.
- Data Validation: Always verify the integrity and completeness of migrated data before decommissioning source resources.
- Deploy DataSync Agent
- Deploy the DataSync agent close to your Azure Blob Storage, such as on an Azure VM using the provided VHD image.
Detailed steps: https://docs.aws.amazon.com/datasync/latest/userguide/deploy-agents.html
- Configure DataSync Task
- Set Azure Blob Storage as the source and Amazon S3 as the destination in the DataSync console.
- Provide necessary credentials and configure transfer settings (e.g., filters, bandwidth limits).
Detailed steps: https://docs.aws.amazon.com/datasync/latest/userguide/create-task-how-to.html
- Start and Monitor Transfer
- Initiate the DataSync task.
- Monitor progress and logs via the AWS console.
Detailed steps: https://docs.aws.amazon.com/datasync/latest/userguide/monitoring-overview.html
Advantages:
- Built-in scheduling, monitoring, and error handling.
- Supports incremental and large-scale transfers.
- Network optimization features like in-line compression to reduce Azure egress costs.
Rclone is an open-source, command-line tool designed for cloud storage synchronization and transfer.
- Prepare AWS and Azure Resources
- Create or select an S3 bucket in the appropriate AWS region.
- Set up an IAM role for the EC2 instance with write permissions to the S3 bucket.
- Create an Azure AD service principal with read-only access to the Blob container.
- Launch EC2 Instance
- Deploy an Amazon Linux EC2 instance with the IAM role attached1.
- Install and Configure Rclone
- Download and install Rclone on the EC2 instance.
- Configure Rclone remotes for both Azure Blob and Amazon S3 using credentials.
- Run Migration Commands
To copy data
To synchronize (mirror) data
Note: The sync command deletes files in the destination that are not present in the source
5. Validate Migration
- Example IAM Policy for S3 Access
- Azure Functions: Use event-driven Azure Functions to trigger transfers to S3 when new blobs are added, or schedule periodic batch transfers.
- Apache Airflow: Use the AzureBlobStorageToS3Operator for orchestrating and automating transfers as part of data pipelines.
- Large Files/High Volume: Use multi-threaded tools (like Rclone with --transfers flag) or DataSync for parallelism.
- Network Bottlenecks: Deploy transfer agents as close to the source as possible.
- Monitoring: Use AWS CloudWatch, Azure Monitor, or Application Insights for real-time transfer visibility.
- Error Recovery: Use sync tools that support checkpointing and retry logic to handle interruptions gracefully.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.