AWS Logo
Menu
How to Download Kaggle Datasets to AWS S3

How to Download Kaggle Datasets to AWS S3

This step-by-step guide demonstrates how to efficiently transfer Kaggle datasets to AWS S3 using an EC2 instance

Published Jan 8, 2025
As data scientists and machine learning engineers, we often need to work with large datasets from Kaggle. In this tutorial, we'll walk through the process of downloading a Kaggle dataset directly to AWS S3 using an EC2 instance as an intermediate step. We'll use a retinal OCT dataset as our example, but you can adapt these steps for any Kaggle dataset.

Prerequisites

Before we begin, make sure you have:
  1. An AWS account with appropriate permissions
  2. A Kaggle account
  3. Basic familiarity with AWS services and the command line

Step 1: Setting Up Your S3 Bucket

First, we'll create an S3 bucket to store our dataset:
  1. Log into the AWS Management Console
  2. Navigate to S3
  3. Click "Create bucket"
  4. Choose a unique bucket name
  5. Select your preferred region
  6. Keep default settings for other options
  7. Create a folder named "data" inside your bucket
This folder structure will help us maintain organization as we add more datasets in the future.

Step 2: Creating an EC2 Instance

We'll use an EC2 instance to download and process the dataset:
  1. Navigate to EC2 in the AWS Console
  2. Click "Launch Instance"
  3. Configure your instance:
    • Name: kaggle-downloader (or your preferred name)
    • AMI: Amazon Linux 2023
    • Instance type: t2.micro (Free Tier eligible)
    • Storage: 15 GiB gp2(use more if your data set
    • Skip key pair creation (we'll use Instance Connect)
    • Allow default security group settings

Step 3: Setting Up Your Kaggle Account

  1. If you haven't already, create a Kaggle account at kaggle.com
  2. Go to your account settings
  3. Scroll to "API" section
  4. Click "Create New API Token"
  5. This will download a kaggle.json file - keep it secure!

Step 4: Configuring the EC2 Instance

Connect to your EC2 instance using Instance Connect and run these commands:
In the nano editor, paste your Kaggle API credentials:
Secure your credentials:

Step 5: Downloading and Uploading the Dataset

Now we're ready to download the dataset and upload it to S3:

Step 6: Clean Up

To avoid unnecessary charges:
  1. Delete the Access Keys
  2. Expire your Kaggle API token:
    • Go to your Kaggle account settings
    • Revoke the API token
  3. Terminate your EC2 instance:
    • Return to EC2 Dashboard
    • Select your instance
    • Choose Instance State → Terminate

Security Best Practices

  1. Always use IAM roles with minimal required permissions
  2. Never share your Kaggle API credentials
  3. Remove credentials from EC2 instance after use
  4. Consider using AWS Secrets Manager for production environments

Troubleshooting Tips

  • If the Kaggle API command fails, verify your API token is correctly formatted
  • For S3 upload issues, check your AWS credentials and bucket permissions
  • If unzip fails, ensure you have enough disk space
  • For permission errors, verify file ownership and permissions

Conclusion

You now have a repeatable process for downloading Kaggle datasets to AWS S3. This approach can be automated further using shell scripts or AWS Lambda functions for regular dataset updates.
Remember to always check dataset licensing terms and Kaggle's terms of service before downloading and using datasets in your projects.
Happy data science!
 

Comments