Building Secure Data Lakes with AWS: From S3 to ML
Follow a complete workflow from S3 data to ML models while maintaining robust security. Learn how to use AWS Lake Formation, Glue, CloudTrail, Macie, and SageMaker to create a secure, analytics-ready data ecosystem.
Ankit Patel
Amazon Employee
Published Dec 15, 2024
Last Modified Dec 16, 2024
This post was written in collaboration with Dylan Martin.
AWS Lake Formation is a fully managed service that makes it easy for you to securely set up a data lake in days. It helps you centrally govern and secure data for your analytics and machine learning workloads. With Lake Formation, you can import data from data stores already in AWS or from external sources. You can use AWS Glue Crawlers to extract your database and table schema and store it in an AWS Glue Data Catalog, which allows you to centrally store and manage your organization’s database metadata. Lake Formation also supports fine-grained access control which allows you to restrict access to specific columns and rows for your user, as well as setting up tag-based access control, which enables your users to access only data which has a specific tag.
The primary purpose of this post is to demonstrate AWS best practices for security when setting up a Lake Formation data lake for data originating from Amazon Simple Storage Service (S3).
In this post, we will be looking at the following aspects of AWS Lake Formation:
- secure access to data in S3 from Lake Formation,
- setup and integration of AWS Glue for crawling data and centralized storage/management of metadata
- Implementing audit logging through AWS CloudTrail for monitoring and showing compliance of defined access policies.
Additionally, this post will introduce Amazon Macie, a fully managed data security & privacy service that uses ML and pattern matching to discover and protect your sensitive data in S3. We will demonstrate how Amazon Macie can detect PII data and other types of sensitive information.
Finally, this post demonstrates an example ML-use case for Lake Formation-managed data. Specifically, we will highlight how you can perform data cleansing on your Lake Formation-managed data via Amazon SageMaker DataWrangler and how you can use that cleansed data to build and train your own machine learning (ML) models on Amazon SageMaker.
Familiarity with the AWS Management Console is a recommended prerequisite for this post.
Here is a high-level overview of the solution.
- Download the sample dataset and upload it to an S3 Bucket
- Set up an AWS Lake Formation data lake with tag-based access control
- Register your S3 Bucket as a data source to your data lake
- Set up an Amazon CloudTrail trail on your S3 Bucket
- Set up Amazon Macie on your S3 Bucket and review the findings
- Configure AWS Glue to crawl your S3 data and store the table metadata in the Glue Data Catalog
- Import your dataset into Amazon SageMaker Data Wrangler to anonymize it and prepare it for machine learning
- Train an ML model based on the cleansed data and analyze its metrics
The dataset we are going to be working with contains synthetic, computer-generated data about employee training data for an organization. Take a look at the header columns of the dataset in the image below. You can see that Column G contains the trainer’s name. This is an example of PII data. Depending on your organization’s governance & compliance requirements, it may be forbidden to use PII in data analytics or ML model training. Later in this post, we will demonstrate how you can use Amazon Macie to automatically detect various types of PII in your S3 data and how you can use Amazon SageMaker’s Data Wrangler feature to remove this unwanted data from the dataset prior to training a model on it.
Please download the dataset and unzip it. We will be working with the data contained within “training_and_development_data.csv”. You can delete the rest of the files.
Now, go to the AWS Management Console. For this exercise, we recommend using the us-east-1 (North Virginia) Region.
In the service search bar, type in “S3” and click the link that takes you to the Amazon S3 console page.
Click the “Create Bucket” button.
For the bucket type setting, leave it as the default value of “General purpose”.
For Bucket name, enter in a name for your bucket. Since bucket names must be globally unique, we recommend a naming convention of lake-formation-ml-bucket-xxxxxxxxxx, where xxxxxxxxxx is an arbitrary string such as your name or username.
Leave the rest of the settings as their default values, scroll all the way down to the bottom of the page, and click “Create bucket”.
Click the “Create Bucket” button.
For the bucket type setting, leave it as the default value of “General purpose”.
For Bucket name, enter in a name for your bucket. Since bucket names must be globally unique, we recommend a naming convention of lake-formation-ml-bucket-xxxxxxxxxx, where xxxxxxxxxx is an arbitrary string such as your name or username.
Leave the rest of the settings as their default values, scroll all the way down to the bottom of the page, and click “Create bucket”.
Once your bucket is created, click the bucket name among your list of buckets.
Click the “Upload” button, then Add files”. Choose the “training_and_development_data.csv” file. Finally, click the orange “Upload” button at the bottom of the page.
Click the “Upload” button, then Add files”. Choose the “training_and_development_data.csv” file. Finally, click the orange “Upload” button at the bottom of the page.
You have successfully uploaded the dataset to your S3 Bucket! Next, we will set up our data lake via AWS Lake Formation.
In the service search bar, type “Lake Formation” and click the link that takes you to the AWS Lake Formation console page.
If you get a “Welcome to Lake Formation” pop-up message, ensure “Add myself” option is checked and click “Get started”.
If you get a “Welcome to Lake Formation” pop-up message, ensure “Add myself” option is checked and click “Get started”.
Follow the "data lake setup" process within the AWS Lake Formation console to get started. You will need to register your data lake storage in a secure S3 bucket, create your Lake Formation databases to manage your tables and finally, leverage Lake Formations permissions management to main the principle of least privilege.
Within the AWS Lake Formation data lake setup, select the “Register your Amazon S3 Storage” option. Paste your S3 bucket path or browse and select the S3 bucket directory.
Register your S3 path with Lake Formation, paying special attention to the location permissions. When you register your data lake with Lake Formation, users may gain access to the lakes data. With the service-linked role, Lake Formation defines the permissions of its service-linked role and only Lake Formation can assume its roles, unless configured otherwise.
Register your S3 path with Lake Formation, paying special attention to the location permissions. When you register your data lake with Lake Formation, users may gain access to the lakes data. With the service-linked role, Lake Formation defines the permissions of its service-linked role and only Lake Formation can assume its roles, unless configured otherwise.
When creating any IAM identity for your workload, consider leveraging IAM roles for temporary credentials where possible. Roles allow you to grant fine-grained permissions with the least privileges while limiting AWS access to temporary security credentials via a role session.
Note that you have the option to choose your own IAM role or select the Lake Formation service linked role. By selecting the service linked role, you reduce the burden of manually configuring and fine tuning your role permissions and take advantage of principle of least-privilege by default.
Stage 2: Create a LakeFormation Database:
Stage 3: Grant Permissions:
Leverage LakeFormation Tags, assigning them to Data Catalog resources (like databases and tables) to control resource access. Only principals granted matching LF-Tags can access the resources. LF-TBAC (Tag Based Access Control) authorization strategy supports rapidly growing environments by simplifying your policy management. This is the recommended strategy for granting Lake Formation permissions when a large number of Data Catalog resources are in play.
While IAM Roles will allow us to control permissions, missteps can occur as well as insider threats. Anticipating this, a comprehensive data protection strategy also involves API and user activity logging with AWS CloudTrail. CloudTrail allows us to record all API calls for Amazon S3, AWS Glue, Lake Formation and other services that we might be using. This will cover things like S3 bucket uploads, metadata operations in Glue, etc. Additionally, we can track user activities as they access or modify data with our data lake's resources. We can then leverage CloudTrail's logs to perform security analyses, monitor unusual activity, and identify malicious or unauthorized activity. Note that CloudTrail event log files are encrypted using S3’s server-side encryption. You also have the option to encrypt log files with an AWS KMS key, store the log files in S3 for as long as needed and configure lifecycle rules to automate file handling.
Within CloudTrail, create a trail. If you want granular logging into events related to your Amazon S3 buckets, then enable collection of “Data events”.
Enabling Amazon Macie is essential for protecting customers, maintaining compliance and security best practices by discovering sensitive data within your data lakes. Macie allows you to select your buckets, define the scope, schedule data analysis and automate remediation of data security risks.
Within the AWS Macie Console, we will create a new job, selecting our S3 Bucket to perform object analysis. You have the option to select your buckets individually and then run the job on each buckets’ objects, or specify bucket criteria. In the latter option, you select job criteria and Macie will identify which buckets meet the criteria, only analyzing the objects in those buckets.
By using Amazon Macie, we automatically scan and identify any PII as well as evaluate and monitor our S3 data lakes for security and access control. If Macie detects an event that reduces security or privacy of an S3 Bucket, it creates a policy finding for us to review. In this step, we can see that Macie analyzed our data lake, and generated findings. In its findings, Macie ranks findings by severity, references the location of the data analyzed, and allows us to take actions or dismiss the alert. In our case, we discovered several cases of PII within our dataset which we do not want to include in our model. We will remove this PII in the coming steps, clearing this error.
AWS Glue Crawlers automatically discover and catalog metadata from your data sources, saving significant time and effort in manual schema definition. They keep your AWS Glue Data Catalog up-to-date, enabling seamless integration with other AWS analytics services. Crawlers support various data sources, detect schema changes over time, and facilitate efficient ETL job creation. By automating metadata management, they improve data discovery, governance, and overall analytics efficiency in your data ecosystem.
Here, we use an AWS Glue Crawler to extract the metadata from our S3 Bucket. This extracted metadata is stored in an AWS Glue Data Catalog.
First, create a Glue Database. This will be used to store our dataset’s schema and other metadata. In the AWS Glue Console, create a Database. You may need to expand the hamburger menu and expand the “Data Catalog” section to access Databases. Provide the database a name such as “ml-datalake-db”
Now, create a Crawler. Choose your csv dataset in S3 as the data source of the crawler. For the crawler’s security settings, choose “Create new IAM role”. Choose the database that you created in the previous step. Click “Create crawler“.
After the crawler is successfully created, run it.
It should take under two minutes to crawl the csv dataset in S3. Afterwards the crawler finishes crawling, go to the “Tables” section. Here, you should see a table representing your S3 dataset.
Click on the Table Name. Your schema should look like the following:
Now that our data is centrally managed by a LakeFormation DataLake and has its metadata centrally stored in AWS Glue, we can proceed with actually using the data. We are going to use Amazon SageMaker Data Wrangler to clean the data and anonymize it. Then, we will train a model on our cleansed data.
In the Amazon SageMaker Console, create a SageMaker Domain. You may need to expand to “Admin configurations” section in the hamburger menu to access this. You can choose the “Quick setup“ option.
In the Amazon SageMaker Console, create a SageMaker Domain. You may need to expand to “Admin configurations” section in the hamburger menu to access this. You can choose the “Quick setup“ option.
After you click ”Set up” it can take a few minutes to configure the domain.
Once the SageMaker Domain is ready, “Launch” the SageMaker Studio from it.
Once the SageMaker Domain is ready, “Launch” the SageMaker Studio from it.
The SageMaker Studio should look similar to the following:
Expand the “Data” drop down menu and go to “Data Wrangler”. Then click “Open in Canvas”.
In the canvas, select your dataset. Continue through the configurations. Create a new model. Select “Predictive analysis”, then “Create”.
In the data transformation UI, select “Training Outcome” as the Target column.
Next, we are going to drop columns with PII data (“Trainer“), as well as columns that aren’t relevant to us (Training Date, Location, Employee ID, and Program Name). Finally, we only want to look at entries that resulted in pass or fail, so we are going to remove rows where the Training Outcome was not ”passed“ or ”failed“. This tells our model that we want to predict the training outcome (pass or fail) given the column and rows we keep as variables and training data respectively. After applying these transformations, select the “quick Build” option.
Next, we are going to drop columns with PII data (“Trainer“), as well as columns that aren’t relevant to us (Training Date, Location, Employee ID, and Program Name). Finally, we only want to look at entries that resulted in pass or fail, so we are going to remove rows where the Training Outcome was not ”passed“ or ”failed“. This tells our model that we want to predict the training outcome (pass or fail) given the column and rows we keep as variables and training data respectively. After applying these transformations, select the “quick Build” option.
Note: This data transformation is performed on the copy of the data used for model training. The original source data in our S3 bucket remained unchanged.
It will take around 20 minutes to train the model. Once the model is finished training, you will see something that looks like following:
The model has about a 54% accuracy of predicting the correct training outcome (pass or fail), which is only slightly better than a coin toss. This is the expected behavior because this dataset was synthetically generated, and therefore, was exposed to a lot of randomness.
Note: If you want to persist and share this model across your organization , consider adding it to the SageMaker Model Registry.
Terminate your SageMaker Domain. Delete the Glue Crawler, Table, and Database. Delete your dataset in S3, as well as your AWS CloudTrail Trails. Finally, delete your Tables and Database in AWS LakeFormation.
About the Authors:
Dylan Martin is a Tech Enablement Specialist working in the Generative AI space helping the AWS Technical Field build AI/ML workloads on AWS. He brings past experience as both a Security Solutions Architect and Software Engineer. Outside of work he enjoys motorcycling, the French Riviera and studying languages.
Ankit Patel is a Solutions Developer at AWS based in the NYC area. As part of the Prototyping And Customer Engineering (PACE) team, he helps customers bring their innovative ideas to life by rapid prototyping; using the AWS platform to build, orchestrate, and manage custom applications.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.