AWS Logo
Menu
Using Amazon Location Service and Aurora Serverless V2 to enrich addresses with US Census Block information

Using Amazon Location Service and Aurora Serverless V2 to enrich addresses with US Census Block information

Learn how to connect Amazon Location Service to Aurora Serverless V2 to lookup Census information about addresses using Sagemaker Notebook.

Zach Elliott
Amazon Employee
Published Jan 6, 2025

Introduction

In the United States, every address is represented in the Census by a Block and Tract. This information can be used to derive demographic data about a given location. In the US, over 8 million blocks exist, and just over 84k tracts exist. Customers may be interested in determining which tract or block a given address may fall into and use that data to make decisions. The US Census provides tract and block data in multiple formats, and even provides a web interface to get this information about a given address. But what if you have millions of addresses you want to process and don't want to go through a tedious batch upload process? Or what if you want to minimize calls to external APIs and web sources? For this, we can use Amazon Location Service combined with Aurora Serverless V2, PostGIS, and Amazon SageMaker Notebooks to perform large batch processing on millions of addresses.

CloudFormation

In order to make this easy to demonstrate, we've provided a sample CloudFormation template below. This template provisions a VPC specifically for our Database infrastructure and SageMaker Notebook, as well as any IAM roles required. Note the outputs section that we will use for database connections in later steps.
Note this example assumes you do not have any existing VPC or database infrastructure. If you are interested in running this example with your own network and database setup, please modify the template as desired.

Database Configuration

Out of the box, Aurora Serverless is not configured to run geospatial queries. For this, we will utilize the PostGIS extension. Luckily, it's very easy to set up using the RDS Data API in the console. We will follow a portion of the instructions contained on this page to configure PostGIS on our database.
First, navigate to the RDS Console and select Query Editor on the left menu.
Here you will fill out some details about your DB. Select the DB Cluster we provisioned earlier, and then we'll use Secrets Manager to retrieve our credentials. Retrieve the ARN for the Secret created in our CloudFormation earlier and paste it in the configuration window. For the database name we'll initially connect to the postgres database.
Image showing AWS Console and RDS Data API configuration
Database connection configuration

You'll now be greeted by the query editor. We'll now run a few commands to get our database setup with PostGIS.
First, let's create a new role (user) to use with PostGIS and the new database we'll create in a few steps.
In the query editor, copy/paste the following command, of course, setting the password to a secure password. Keep this password handy because we'll use it to connect to the Data API in a future step.
CREATE ROLE gis_admin LOGIN PASSWORD 'change_me';
Select Run to run the query and create our new user.
Now we need to create our database. Copy/paste the following command, then select Run to execute the command:
CREATE DATABASE census;
Finally, we need to grant privileges to our new user to our new database. Copy/paste the following command then select Run to execute the command:
GRANT ALL PRIVILEGES ON DATABASE census TO gis_admin;
Now we need to enable the PostGIS extension on our newly created database. To do this, we need to reconnect to the query editor and select our new database. Select Change database and go through the connection steps again with the same Secret ARN, but this time, entering census for the database name. Select Connect to database and wait a few seconds to be connected to our new database.
Now that we are connected to our new census database, we can enable PostGIS. In the query editor window, copy/paste the following command, then select Run to execute the command.
CREATE EXTENSION postgis;
 
We also need to grant privileges to the schema. Copy/paste the following command then select Run to execute the command:
GRANT ALL ON SCHEMA public TO gis_admin;
Now we can verify everything is working. Select Change database again, and this time, select Add new database credentials and enter the username we created earlier (gis_admin) and the secure password we set.
Image showing a database connection configuration
Database connection configuration
Select Connect to database and after a few seconds it will connect and allow you to run queries. This also stored your credentials in Secrets Manager, so we can use them with SageMaker Notebook in a later step.
To confirm everything is working as expected, copy/paste the following command and then select Run to execute the command:
SELECT PostGIS_version();
Image showing a query being executed showing the current postgis version
Results of query
We now have confirmed that PostGIS is installed in our database. We can now connect using the Data API in our Sagemaker Notebook.

Loading Census Data into Database

Before we can start querying Census Data, we need to load Census Shapefiles into our database. Each Shapefile contains all the geometries and associated metadata for each block. First, let's ensure SageMaker Notebook is set up correctly.
Navigate to the Amazon SageMaker AI console and select the notebook we created as part of the CloudFormation. Select Open Jupyter to gain access to our Jupyter notebook environment.
Image showing Sagemaker notebooks
Sagemaker Notebook
Now, let's create a new notebook. From the Jupyter interface, select New, then select conda_python3.
Image showing selection of a python notebook kernel
This will create a new notebook. Give the notebook a name such as CensusData or something meaningful. Select Rename.
Image showing renaming a Sagemaker Notebok
Now we can begin writing some code!
In the first cell, we need to install a couple packages. Enter the following, then run the cell using the Run button on the toolbar.
Once the cell has a [1] next to it, it's completed running and our prerequisite packages are installed.
Now in the next cell, we need to set up our imports and define our clients. Some of these may be used and some may not depending on how your data is prepared and the output format you desire. Copy/paste the following then select the Run button from the toolbar.
Note: If you want to shortcut running a cell, hold down Shift+Return which will run a cell without having to click the Run button.
At this point, it's important to note a limitation of the Data API. Data API only supports up to 64 KB in a query. This means that for larger geometries, Data API will not work. Because of this, we will set up a network connection to our database to do the initial load.
In the next cell, enter the following to set up our connection. Note we will dynamically retrieve the database cluster host and password to avoid hard-coding any variables. Enter your DB Cluster ARN and Secret (the one that was created using the gis_admin username) ARN.
Now in the next cell, we can define our function to load Shapefiles into the database. Note we are using multiprocessing in order to make it go quicker. In this instance, we are using a small SageMaker notebook. If you are processing more than a couple thousands of rows (entire US for example) you will benefit from a larger instance type with more CPU/Memory to handle this task. In this example however, we will just be loading a few states worth of block data, which will not take long.
Copy/paste the following code into the next cell, but don't run it yet. We need to load our Shapefiles onto the local storage first. Once you've copied the following into the cell, save the notebook.

Downloading and uploading shapefiles

If you have worked with US Census Data in the past, these next steps will seem very familiar. Before we can query any data, we need to download Block data for each state/territory we are interested in.
On your local machine with a web browser, navigate to https://www.census.gov/cgi-bin/geo/shapefiles/index.php and select Blocks, then Submit. Next you get to select a state. I'm going to be biased towards the Midwest US where I grew up, and select a few Midwest states. Select the state you want, then click Download. Depending on your internet connection, it may take a few moments to download. Repeat for any additional states you want to load into your database. Once these zip files are downloaded, we'll begin uploading them to our SageMaker Notebook.
Keep in mind that SageMaker notebooks only come with 5GB of attached storage. If you are processing many Blocks, ensure you have increased the size of the storage by stopping the instance, then modifying the storage size.
In your Jupyter Notebook, click the Jupyter logo in the upper left to return to your dashboard. From the dashboard, you will see your notebook and other options. Select New and select Folder.
This will create a new Untitled folder. Select the square radio button to the left of the folder, and select Rename.
Change the name to shapefiles and select Rename. Now navigate into the shapefiles directory. Either select the Upload button, or drag files into the list to upload them. You will probably get a dialog box asking you about a large file size. Confirm you want to upload the files.
Select Upload next to each file. And wait for them to upload. Once they are done uploading, you should see your files in a list like this:
Now that our files are uploaded, we can run our processing script.
Navigate back to your notebook, and run the cell we just added. After a few moments, you will see your files processed and imported.
Image showing a python notebook with output depicting a successful operation
Now that our data is loaded into our PostGIS database, we can begin querying against it.

Querying the database

Before we can begin querying the database, we need to geocode our addresses. To do this, we will utilize Amazon Location Service. In the next cell, enter the following code and run the cell.
This will take an address and geocode it, providing the coordinates as an output.
A note on Stored Queries: Amazon Location Service supports storing the results of Geocodes. If you are saving any portion of the returned value from Amazon Location Service including coordinates, or formatted address in your database or datastore, you will need to add IntendedUse='Storage' to the above function. In this example, we are not storing any values, and therefore are OK to use SingleUse.
Now we can define our query function. In the next cell, enter the following code and run it
Now that we have our database lookup function, we can put it into action.
In the next cell, enter the following code, changing the address variable to an address that exists within one of your Shapefiles that you downloaded and processed into the database.
For example, since I loaded Missouri into the database, I can perform the following query:
image showing output from a python command
The function returns not only the address, but also the GEOID20 which contains the Block the address exists in, and other information. You can read more about the GEOID here, and how the number breaks down.
Note that we have only looked at Census Block data. If you are looking for Tract data, the above query will need to be modified to match the Shapefile schema of tract data.

Cleanup

Because we deployed everything via CloudFormation, in order to clean up our database and SageMaker Notebook, simply delete the stack.

Next steps

In this example, we've just scratched the surface and only processed a single address. With the framework we've laid out, you can easily process millions of addresses using multiprocessing and output those files to your own database or datastore. For example data stored in S3 buckets can be read and processed and stored right back into S3. Alternatively, the portion of the code that looks up the address in the database can be containerized or deployed in AWS Lambda and placed behind an API Gateway for a scalable service that can be used across an organization.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments