Fixing... err... I mean "implementing" my Amazon DynamoDB approach with Gen-AI [Part 2]
Second part of the series where I build a serverless app to track my Green Card priority date and implement storing the data in DynamoDB.
Cobus Bernard
Amazon Employee
Published Jul 2, 2024
In the previous article, I started building a serverless app to scrape the USCIS priority dates from their website. I was able to extract the dates from the website, but the schema for my DynamoDB table needed a change to store the data. Today, I'm going to tackle this using Amazon Q Developer.
The issue I ran into was how I was storing my data. I had the primary key as
pk = f"{filing_type}#{category}"
and the secondary key as sk = f"{country}"
. The data that I'm trying to store has a date for when the Visa Bulletin was published (bulletin_date
), and each bulletin has two tables I'm interested in for the priority date (filing_type
for Final Date
and Application Date
). Each of those tables has a row per visa category (category
), and column for the country groupings. The date for each combination of country and category is what I'm trying to store. This will allow me to query the historic data specific to me, with category='3rd'
, filing_type='Final Date'
, and country='All Chargeability Areas Except Those Listed'
.First item on my list today is to ensure I have a table schema to store my data without overwriting entries like I had in my previous attempt. First, I update the schema using the current data structure, update the
store_data
function with structure below (and also implement handling table request throttling), with a quick detour to dig into PROVISIONED
vs PAY_PER_REQUEST
for my table access. Now we can update our Terraform with the new table definition:The
bulletin_date
will be used for the primary key (pk
), and a composite secondary key (sk
) using the filing_type
, country
, and category
to group the data. Our pk
and sk
will be defined in our code as:I went back and forth between using provisioned capacity vs on-demand, I do like the idea of using the retry logic, but in the end, I settled on going for a simpler approach. I know that once I have everything working, I will do another pass to improve the code for the app. Ensuring that I combine the database write / retry logic with how I currently handle getting the data will cause issues. In the
scrape_visa_bulletin
function, I'm looping over each of the bulletins to extract the data, and adding that URL to the ProcessedURLs
table so I don't process it again, but if writing to the database fails, it won't reprocess the pages that weren't saved. We now have the following for our store_data
function:After running
terraform apply
, grabbing some fresh coffee, finishing it, and cleaning my keyboard, I can test the updated code.TIL
: it takes time to change a DynamoDB table's access mode.
Commented out the lines that handle checking if the URL was previously processed, and I can run the app again. Woohoo!! I can see my data in my table! It looks like what I need, but now I need to update the
read_data
method to retrieve it.Updating
read_data
to use the new structure, and printing the output provides the following code (I'm using defaults for now) after needing to query using the wrong pk
and sk
:And š„. It does not work!!!
Turns out that the primary key needs to be an "equality condition". I also notice that the files referenced are from a different directory than where I moved my code to. Initially this was going to be part of a larger repo with Terraform examples, but decided to split it out. I think this is due to using
venv
from that folder, but as long as my code runs, I'm going to ignore it for now. Let's fix the database first.After checking if a schema change would fix it, I update both the
store_data
and read_data
methods to use the new schema:While I've used DynamoDB quite a bit in the past, I haven't had to deal with data when doing a schema migration. I'm going to save that for another day, for now I just want a way to delete all the data with the AWS CLI. Looking at the list of methods, I don't see an equivalent to SQL's
TRUNCATE TABLE
statement, and it doesn't look like it exists. I'm now very curious what will happen to the data, so I'm going to update the table definition in Terraform and YOLO it.Oh.
The table schema isn't changing, only how I store the data.
š”!!!
Ok, this makes sense. We aren't changing the table schema, only the data we are storing in it. I feel like I need to generate on of those 300 memes with "THIS IS NOSQL!!!!!". To clear out the "old" data, I can either manually click to delete them in batches via the AWS Console, write some extra code, or drop and recreate the table - dropping the table since this is Sparta!!! :P
After another
terraform apply
, the code works! Here's the first few lines of output:After the back and forth to get the table structure to store my data, and then that last change of the values of the
pk
and sk
for the data that didn't change the table schema, I think I'm starting to think about the data structures in the right way. And repeating the recap from the first piece, the overall process of trial and error by patching different solutions together is different. Instead of the searching and multiple tabs, I now rarely leave my IDE.I've added a todo for myself to read up on how to handle migrating existing data with DynamoDB, think that will be very interesting. In the past, I've done many large SQL based data migrations, and I honestly have no idea how I would do it for this app other than code to handle it. Probably would need to have that in a function, call it from the main method, and do another deployment to delete it. Looks like I have my next article topic after this series is done.
While generative-ai has helped speed up creating applications, you do need to take into account that it is very literal. Looking at the response when I asked it for the AWS CLI command to delete all items in a DynamoDB table, it still provided a technically accurate and feasible solution, just not one you would really use.
As for the previous piece, I've tagged this with
article-2
, and pushed the tag. Source code for this article.I've been adding the prompts at the bottom of each piece so far, and I'll keep doing it for the rest of this series as I'm trying to simulate watching and listening to someone live code, but in written form. The intent is to share the thought process instead of just the final, polished solution. This will still be turned into a single solution of how to do it from start to finish without the back and forth, but for that, I first need to finish up all the parts still waiting for me. Next up, I'll update the function so I can pass the parameters that are currently hard-coded via defaults for the
read_data
function. At least, that is the plan, right after I double check the message I had after pushing this code:Stay tuned for the next installment of this series...
Initially I used provisioned capacity for my DynamoDB table, along with logic to handle the retries for me, here is the code that was removed with this logic:
Ā
Can you help me define the table structure for DynamoDB using Terraform, I'm trying to store the following data? My data consists of a collection of objects, each with a BulletinDate for the month/year the data is from, and then split into 2 categories for the filing type called "priority date" and "filing date". For each of these filings, I will have a collection of data defined by 2 dimensions: Country, and Category. So for a single BulletinDate, I would have a collection of data using Country and Category as the key, with PriorityDate as the value, and the collection needs to be grouped by FilingType for "priority date" and "filing date". Can you please show me how to create this table?
Response:
Sure, I can help you define the table structure for DynamoDB using Terraform. Based on your data structure, we can create a single table with a composite primary key consisting of a partition key and a sort key. [1]
Here's how you can define the table structure in your Terraform code:
In this structure, we're using a composite primary key with a partition key (
pk
) and a sort key (sk
). Both keys are defined as strings (S
).To store your data, you can use the following format for the primary key:
For example, if you have a
BulletinDate
of "2023-05", a FilingType
of "priority_date", a Country
of "India", and a Category
of "EB3", the primary key would be: [2]With this structure, you can store the
PriorityDate
value as an attribute in the DynamoDB item.You can also consider creating a Global Secondary Index (GSI) on the
pk
and sk
attributes to allow efficient querying based on different combinations of the primary key attributes.Note: Make sure to replace the placeholders (
<BulletinDate>
, <FilingType>
, <Country>
, and <Category>
) with your actual values when constructing the primary key.Sources:
Thanks, given that I have my source data added to a flat collection with the code below, can you please provide the python code to store this data into the table above? Here's how I store the data in my collection:
Response:
Sure, I can provide you with the Python code to store the data in the DynamoDB table using the structure we defined earlier. Here's an example: [1]
Here's how the
store_data
function works:- We iterate over each item in the
employment_based_data
list. - For each item, we extract the
filing_type
,country
,category
,bulletin_date
, anddate
values. - We construct the partition key (
pk
) using theBULLETIN_DATE
prefix and thebulletin_date
in the formatYYYY-MM
. - We construct the sort key (
sk
) using theFILING_TYPE
,filing_type
,COUNTRY
,country
, andCATEGORY
,category
values. - We use the
put_item
method of the DynamoDB table to store the item with the constructed primary key (pk
andsk
) and the other attributes (filing_type
,country
,category
,bulletin_date
, anddate
). [2]
Note that we're using the
datetime
module to convert the bulletin_date
and date
strings to datetime
objects, and then back to strings in the desired format (YYYY-MM
for bulletin_date
and YYYY-MM-DD
for date
).After running this code, your data should be stored in the DynamoDB table with the defined structure. [3]
Sources:
Can you update this code to be able to handle exceeding the write capacity for my table with some kind of graceful backoff and retry logic?
Response:
Sure, I can update the code to handle exceeding the write capacity for your DynamoDB table by implementing a backoff and retry logic. Here's an example of how you can modify the
store_data
function to achieve this:Here's what we've added to the
store_data
function:- We define the maximum number of retries (
max_retries
) and the initial retry delay (retry_delay
) in seconds. - We also define the maximum retry delay (
max_delay
) in seconds to prevent excessive delays. - We wrap the
put_item
operation in afor
loop that runs up tomax_retries
times. - If the
ProvisionedThroughputExceededException
is raised, indicating that the write capacity has been exceeded, we apply an exponential backoff with jitter before retrying the write operation. - The exponential backoff delay is calculated as
retry_delay * (2 ** attempt) + random.uniform(0, 1)
, whereattempt
is the current retry attempt. This introduces a random jitter to prevent multiple clients from retrying at the same time. - The calculated delay is capped at
max_delay
to prevent excessive delays. - If the maximum number of retries is reached, the
ProvisionedThroughputExceededException
is raised, allowing you to handle it as needed. - For any other exceptions, we print an error message and break out of the retry loop.
With this implementation, the code will retry writing to the DynamoDB table if the write capacity is exceeded, with an exponential backoff and jitter to avoid overwhelming the table with retries. If the maximum number of retries is reached, the exception is raised, allowing you to handle it appropriately (e.g., logging, retrying later, etc.).
Note that this implementation assumes that you're using the provisioned capacity mode for your DynamoDB table. If you're using the on-demand capacity mode, you don't need to handle the
ProvisionedThroughputExceededException
specifically, as DynamoDB will automatically handle the scaling for you.Can you explain the difference between the provisioned capacity and on-demand for DynamoDB, and include how it would affect accessing my data?
Response:
Certainly! Amazon DynamoDB offers two capacity modes: provisioned capacity and on-demand. The choice between these modes affects the way you manage capacity and pay for the service. Here's an explanation of the differences and how they impact costs: [1]
Provisioned Capacity Mode:
- In this mode, you specify the number of read and write capacity units you expect your application to require.
- You are billed for the provisioned capacity, regardless of whether you consume it or not.
- Provisioned capacity is suitable for workloads with predictable or consistent traffic patterns, where you can forecast your capacity requirements. [2]
- You can enable auto scaling to automatically adjust the provisioned capacity based on your utilization, helping to optimize costs and performance.
- If your application exceeds the provisioned capacity, you may experience throttling, which can be mitigated by increasing the provisioned capacity or enabling auto scaling.
On-Demand Capacity Mode:
- In this mode, you don't need to provision any capacity upfront.
- You pay for the data reads and writes your application performs, based on the number of read and write request units consumed.
- On-demand capacity is ideal for workloads with unpredictable or [3]
Sources:
Thanks, can you show me how to update the read_data function using this new table schema so that I can pass filing_type, category, and country to return a sorted list? It should use the date bulletin_date for the sort.
Response:
Sure, here's how you can update the
read_data
function to retrieve and sort the data based on the filing_type
, category
, and country
parameters, using the bulletin_date
for sorting:Here's how the
read_data
function works:- We construct the partition key (
pk
) using theFILING_TYPE
,filing_type
, andCATEGORY
,category
values. - We construct the sort key prefix (
sk_prefix
) using theCOUNTRY
andcountry
values. - We use the
query
method of the DynamoDB table to retrieve items where the partition key (pk
) matches the constructed value, and the sort key (sk
) begins with the constructed prefix (sk_prefix
). - We set the
ScanIndexForward
parameter toFalse
to reverse the order of the results, so that the latestbulletin_date
is returned first. - We store the retrieved items in the
items
variable. - We sort the
items
list in descending order based on thebulletin_date
attribute using thesorted
function and a lambda function as thekey
. - We return the sorted list of items.
To use the
read_data
function, you can call it with the desired filing_type
, category
, and country
parameters:The
sorted_data
variable will contain a list of items sorted in descending order by bulletin_date
, filtered by the specified filing_type
, category
, and country
.Note that the
query
operation in DynamoDB is efficient for retrieving data based on the primary key (partition key and sort key). If you need to filter the data based on other attributes, you may need to consider using a Global Secondary Index (GSI) or performing a scan operation, which can be less efficient for large datasets.Thanks, can you now show me how I would use the returned data to print out a line per entry, with the values from the input parameters? It should have the following first (with the [] around the values for the variables): Retrieved the [Final Date] data for [EB3] for [All Chargeability Areas Except Those Listed]:
And then a line per entry with this format: Bulletin 2024-07-01: 2021-11-01
The code provided by this response isn't quite correct, it is using the previous table schema for thepk
andsk
, so I just copied the lines fromstore_data
to replace them.
Response: Certainly! Here's how you can modify the
read_data
function to print the desired output format:In this updated version of the
read_data
function, we've added the following changes:- After sorting the items, we print the header line using the
print
function and an f-string. The header line will be in the format:Retrieved the [Final Date] data for [EB3] for [All Chargeability Areas Except Those Listed]: - We then iterate over the
sorted_items
list using afor
loop. - For each item, we extract the
bulletin_date
anddate
values. - We print a line for each item using the
print
function and an f-string. The line will be in the format:Bulletin 2024-07-01: 2021-11-01
To use the updated
read_data
function, you can call it with the desired filing_type
, category
, and country
parameters:The output will be printed to the console, with the header line followed by a line for each item in the sorted list, displaying the
bulletin_date
and date
values.Note that this implementation assumes that you have the necessary data in your DynamoDB table with the expected attributes (
bulletin_date
and date
). If the data structure or attribute names are different, you'll need to adjust the code accordingly.This doesn't look correct, can you please recheck as earlier in the conversation we defined pk and sk for the table when storing data using this:
Response:
You're absolutely right, my apologies. I should have used the correct primary key structure that we defined earlier. Here's the updated
read_data
function with the correct primary key construction:Here are the changes:
- We define the
pk_prefix
as"BULLETIN_DATE#"
to match the partition key structure we defined earlier. - We construct the
sk_prefix
using thefiling_type
,country
, andcategory
values, following the sort key structure we defined earlier. - In the
KeyConditionExpression
of thequery
method, we use thebegins_with
condition for both the partition key (pk
) and the sort key (sk
) to retrieve items where thepk
begins with"BULLETIN_DATE#"
and thesk
begins with the constructedsk_prefix
.
With these changes, the
read_data
function should now correctly retrieve and print the data based on the primary key structure we defined earlier.Thank you for catching my mistake. Please let me know if this updated version looks correct.
When I use this code, it errors with:
Response:
Ah, I see the issue now. The
ValidationException
is occurring because the KeyConditionExpression
we're using is not supported for the query
operation in DynamoDB. [1]The
query
operation in DynamoDB requires the partition key to be an equality condition, and the sort key can either be an equality condition or a range condition (using comparison operators or the begins_with
function).To fix this issue, we need to modify the
read_data
function to use the partition key as an equality condition and the sort key as a range condition. Here's the updated code:Here are the changes:
- We construct the partition key (
pk
) using theBULLETIN_DATE
prefix and thebulletin_date
in the formatYYYY-MM
. - In the
KeyConditionExpression
of thequery
method, we use theeq
condition for the partition key (pk
) to match the constructedpk
value exactly. - For the sort key (
sk
), we use thebegins_with
condition with the constructedsk_prefix
.
By using the
eq
condition for the partition key and the begins_with
condition for the sort key, we should be able to query the data correctly without encountering the ValidationException
.Note that you'll need to pass the
bulletin_date
parameter to the read_data
function, or you can use a default value if you want to retrieve the latest data by default.Please let me know if this updated code works for you or if you have any further questions.
Sources:
Ok, given the requirement for an equality condition on the primary key, can you suggest what data we should use for the primary key given this constraint? I want to be able to retrieve the data for a specific filing_type, category, and country.
Response:
Given the requirement for an equality condition on the partition key, we can modify the primary key structure to accommodate retrieving data for a specific
filing_type
, category
, and country
.Here's a suggested primary key structure:
With this structure, the partition key (
pk
) will be a composite key consisting of the filing_type
, category
, and country
values, separated by a delimiter (#
). The sort key ( sk
) will be the bulletin_date
.Here's how you can update the
store_data
and read_data
functions to use this new primary key structure:In the
store_data
function:- We construct the partition key (
pk
) using theFILING_TYPE
,filing_type
,CATEGORY
,category
, andCOUNTRY
,country
values. - We construct the sort key (
sk
) using theBULLETIN_DATE
prefix and thebulletin_date
value in the formatYYYY-MM-DD
.
In the
read_data
function:- We construct the partition key (
pk
) using thefiling_type
,category
, andcountry
values passed as arguments. - We use the
query
method with theKeyConditionExpression
set toKey('pk').eq(pk)
to retrieve items where the partition key matches the constructedpk
value exactly. - We sort the retrieved items by the sort key (
sk
) in descending order. - When printing the data, we extract the
bulletin_date
from the sort key (sk
) by splitting it on the#
delimiter and taking the second part.
With this primary key structure, you can retrieve data for a specific
filing_type
, category
, and country
combination by passing those values to the read_data
function:This approach ensures that you can query the data efficiently using the partition key equality condition, while still being able to retrieve data for specific combinations of
filing_type
, category
, and country
.Please note that if you need to query the data based on different combinations of these attributes, you may need to consider creating additional Global Secondary Indexes (GSIs) to optimize those queries.
Thanks, what is the easiest way to delete all my data from my DynamoDB table since the schema has changed? If I can do it with the AWS CLI, can you show me the command please?
Response:
To delete all data from your DynamoDB table using the AWS CLI, you can use the
scan
operation to retrieve all items in the table, and then use the batch-write-item
operation to delete them in batches[1].Here's the command to scan and retrieve all items from the table:
This command will scan the
VisaBulletinData
table and store the retrieved items in a JSON file named items.json
.Next, you can use the following Python script to delete the items in batches using the
batch-write-item
operation:Here's how the script works:
- It loads the items from the
items.json
file created by thescan
operation. - It defines a
batch_size
for thebatch-write-item
operation. In this example, it's set to 25, but you can adjust it based on your requirements. - It iterates over the items in batches of
batch_size
. - For each batch, it uses the
batch_writer
context manager to perform thebatch-write-item
operation. - Inside the
batch_writer
context, it iterates over the items in the batch and calls thedelete_item
method with the primary key (pk
andsk
) of each item. - After each batch is processed, it prints a message indicating the batch number.
- Finally, it prints a message indicating that all items have been deleted from the table.
Make sure to replace
'VisaBulletinData'
with the name of your DynamoDB table.Note: This approach assumes that your table doesn't have any Global Secondary Indexes (GSIs). If you have GSIs, you'll need to delete the data from the GSIs as well, or consider recreating the table if it's more convenient.
Sources:
Does DynamoDB have the equivalent of SQL's "Truncate table"?
Response:
Yes, DynamoDB does have an equivalent of SQL's "TRUNCATE TABLE" operation. It's called the
DeleteTable
operation, which deletes the entire table and all its data. [1]Here's how you can use the AWS CLI to delete (truncate) a DynamoDB table:
Replace
VisaBulletinData
with the name of your DynamoDB table.This command will delete the entire table, including all its data and any Global Secondary Indexes (GSIs) associated with the table.
After running this command, the table will be in the
DELETING
state for a short period, and then it will be completely removed from DynamoDB.If you want to recreate the table with the same name and structure after deleting it, you'll need to wait for the
DeleteTable
operation to complete before creating a new table with the same name.Alternatively, you can use the AWS Management Console, AWS SDKs, or the DynamoDB API to perform the
DeleteTable
operation. [2]It's important to note that the
DeleteTable
operation is an irreversible operation, and all data in the table will be permanently deleted. If you need to keep a backup of the data, you should export it before running the DeleteTable
operation.Sources:
Thanks, can you provide me with the updated Terraform to create this table please? Also change it to use on-demand pricing instead of provisioned.
Interesting thatttl
andtags
suddenly pop up in this response, I'm going to strip that out as I don't need or want that.
Response:
Sure, here's the updated Terraform code to create the DynamoDB table with the new schema and using on-demand pricing: [1]
Here's what the code does:
- It defines an
aws_dynamodb_table
resource with the namevisa_bulletin_data
. - The
name
attribute sets the name of the DynamoDB table toVisaBulletinData
. - The
billing_mode
is set toPAY_PER_REQUEST
, which enables on-demand pricing for the table. [2] - The
hash_key
andrange_key
attributes define the partition key (pk
) and sort key (sk
) for the table, respectively. - The
attribute
blocks define the attributes for the partition key (pk
) and sort key (sk
), both of which are strings (S
). - The
ttl
block is used to configure the Time to Live (TTL) feature for the table. In this case, it's disabled by settingenabled
tofalse
. - The
tags
block defines two tags for the DynamoDB table:Name
andEnvironment
.
After applying this Terraform code, it will create a new DynamoDB table named
VisaBulletinData
with the specified primary key structure ( pk
and sk
) and using on-demand pricing.Note that you'll need to have the AWS provider configured in your Terraform code, and you'll need to run
terraform init
and terraform apply
to create the DynamoDB table.Also, keep in mind that with on-demand pricing, you'll be charged for the read and write requests made to the table, rather than provisioning read and write capacity units upfront. You can refer to the AWS DynamoDB pricing documentation for more details on the on-demand pricing model.
Sources:
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.