How to Download Kaggle Datasets to AWS S3
This step-by-step guide demonstrates how to efficiently transfer Kaggle datasets to AWS S3 using an EC2 instance
Published Jan 8, 2025
As data scientists and machine learning engineers, we often need to work with large datasets from Kaggle. In this tutorial, we'll walk through the process of downloading a Kaggle dataset directly to AWS S3 using an EC2 instance as an intermediate step. We'll use a retinal OCT dataset as our example, but you can adapt these steps for any Kaggle dataset.
You can explore dataset from here : https://www.kaggle.com/datasets/vidit210/retinaloctrekognition
Before we begin, make sure you have:
- An AWS account with appropriate permissions
- A Kaggle account
- Basic familiarity with AWS services and the command line
First, we'll create an S3 bucket to store our dataset:
- Log into the AWS Management Console
- Navigate to S3
- Click "Create bucket"
- Choose a unique bucket name
- Select your preferred region
- Keep default settings for other options
- Create a folder named "data" inside your bucket
This folder structure will help us maintain organization as we add more datasets in the future.
We'll use an EC2 instance to download and process the dataset:
- Navigate to EC2 in the AWS Console
- Click "Launch Instance"
- Configure your instance:
- Name: kaggle-downloader (or your preferred name)
- AMI: Amazon Linux 2023
- Instance type: t2.micro (Free Tier eligible)
- Storage: 15 GiB gp2(use more if your data set
- Skip key pair creation (we'll use Instance Connect)
- Allow default security group settings
- If you haven't already, create a Kaggle account at kaggle.com
- Go to your account settings
- Scroll to "API" section
- Click "Create New API Token"
- This will download a
kaggle.json
file - keep it secure!
Connect to your EC2 instance using Instance Connect and run these commands:
In the nano editor, paste your Kaggle API credentials:
Secure your credentials:
Now we're ready to download the dataset and upload it to S3:
To avoid unnecessary charges:
- Delete the Access Keys
- Expire your Kaggle API token:
- Go to your Kaggle account settings
- Revoke the API token
- Terminate your EC2 instance:
- Return to EC2 Dashboard
- Select your instance
- Choose Instance State → Terminate
- Always use IAM roles with minimal required permissions
- Never share your Kaggle API credentials
- Remove credentials from EC2 instance after use
- Consider using AWS Secrets Manager for production environments
- If the Kaggle API command fails, verify your API token is correctly formatted
- For S3 upload issues, check your AWS credentials and bucket permissions
- If unzip fails, ensure you have enough disk space
- For permission errors, verify file ownership and permissions
You now have a repeatable process for downloading Kaggle datasets to AWS S3. This approach can be automated further using shell scripts or AWS Lambda functions for regular dataset updates.
Remember to always check dataset licensing terms and Kaggle's terms of service before downloading and using datasets in your projects.
Happy data science!