![Adding parameters to my Serverless Web Scraper API [part 3]](/_next/image?url=https%3A%2F%2Fassets.community.aws%2Fa%2F2il02zbfl1qAW0AdeunsL7GUnfL%2Fauto.webp%3FimgSize%3D544x146&w=1200&q=75)
Adding parameters to my Serverless Web Scraper API [part 3]
Third installment of the series where I build a serverless API to scrape the historic dates from the USCIS monthly visa bulletin pages.
/visa-bulletin-for-
, and then do a lookup in the DynamoDB table ProcessedURLS
to see if that URL was processed. Only the data from pages that are not in that database table will be scraped, added to a single collection, then that whole collection will be stored. We need to update the code so that it stores the data as it scrapes a URL, and before writing the processed entry - if there was an error, it currently will not reprocess that page.lambda_handler
for the loop where we process and store the data, updated version is now:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Scrape data from each visa bulletin page
for link in visa_bulletin_links:
if '2022' in link or '2023' in link or '2024' in link:
# Check if the URL has been processed
response = processed_urls_table.get_item(Key={'url': link})
if 'Item' in response:
print(f"Skipping URL: {link} (already processed)")
continue
# Process the URL
print(f"Processing URL: {link}")
url = f"https://travel.state.gov{link}"
url_data = scrape_visa_bulletin(url)
data.extend(scrape_visa_bulletin(url_data))
# Store the data
store_data(url_data)
# Store the processed URL in DynamoDB
processed_urls_table.put_item(Item={'url': link})
enum
would work, but I don’t know how this is done in Python:1
def read_data_locally(data, filing_type = 'Final Date', category = '3rd', country = 'All Chargeability Areas Except Those Listed'):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from enum import Enum
class FilingType(Enum):
FINAL_DATE = 'Final Date'
DATES_FOR_FILING = 'Dates for Filing'
class Category(Enum):
FIRST = '1st'
SECOND = '2nd'
THIRD = '3rd'
FOURTH = '4th'
OTHER_WORKERS = 'Other Workers'
class Country(Enum):
ALL_AREAS = 'All Chargeability Areas Except Those Listed'
CHINA = 'CHINA-mainland born'
INDIA = 'INDIA'
MEXICO = "MEXICO"
PHILIPPINES = 'PHILIPPINES'
PHILIPPINES
for me, I’m assuming it picked it up from the context of the linked URLs - I did a reboot of my laptop this morning and started a new chat, so don’t think it would be previous chat context, but could be mistaken:.value
.python3 local_test.py
which calls the handler.py
code via:1
2
3
4
5
6
7
8
9
mock_context = MockContext()
mock_event = {
"key1": "value1",
"key2": "value2",
# Add any other relevant data for your event
}
result = lambda_handler(mock_event, mock_context)
event
seems reasonable:1
2
3
4
5
6
7
mock_event = {
'queryStringParameters': {
'filing_type': 'FINAL_DATE',
'category': 'THIRD',
'country': 'ALL_AREAS'
}
}
1
2
3
4
query_params = event.get('queryStringParameters', {})
filing_type_str = query_params.get('filing_type', 'FINAL_DATE')
country_str = query_params.get('country', 'ALL_AREAS')
category_str = query_params.get('category', 'THIRD')
get()
method works, this does appear to be the correct way - that 2nd parameter is a default if it doesn’t find the value in the input. Having it as a string bugs me - if I ever change the name of the enum, this will not work. After a bit of back and forth (1, 2, 3) I had the following:1
2
3
4
5
6
7
8
9
10
11
12
13
14
query_params = event.get('queryStringParameters', {})
filing_type_str = query_params.get('filing_type', FilingType.FINAL_DATE.name)
country_str = query_params.get('country', Country.ALL_AREAS)
category_str = query_params.get('category', Category.THIRD.name)
# Convert string values to enum members
filing_type = FilingType[filing_type_str.upper()]
country = Country[country_str.upper()]
category = Category[category_str.upper()]
...
# Updated call to read_data() with the parameters
eb3_data = read_data(filing_type, category, country)
read_data()
and read_data_locally()
. Running the code locally returns the data as expected, I also checked with 'category': 'SECOND',
just to make sure.lambda_handler(event, context)
, and then set up a 2nd Lambda function using the same Lambda layer for the requirements
. It does mean I will include some libraries not needed by each function as it will contain all the ones across both functions, but I’m fine with that approach. Splitting this into 2 different projects, or even just splitting the requirements.txt
file feels like even more over the top than we are already are.scraper.py
and retriever.py
, and then to have a lambda_handler
function in each. I’m tempted to just create a 2nd function inside the 1 file, but let’s go with the suggestion and split it into those 2 files. After I rename handler.py
to scraper.py
and create a copy named retriever.py
, I realise we will need to split it further. At the top of the code, we set the tables via table = dynamodb.Table('VisaBulletinData')
and processed_urls_table = dynamodb.Table('ProcessedURLs')
, and we also have the enums defined as classes, and both of the functions need them. I look how I would do that, and merrily follow the suggestion.enums.py
, and add the import statement, I stare at the import line in scraper.py
for a few seconds:If all you have is a hammer, then all your problems look like a nail.
handler.py
. Pasting all the Terraform resources again would take up quite a bit of space, but you can look at the current version at this point in time. We can keep the following resources as-is:resource "null_resource" "pip_install"
- we aren’t splittingrequirements.txt
per source file, so a single one stays the same.data "archive_file" "layer"
- used to trigger updating the Lambda layer we create for our dependencies.resource "aws_lambda_layer_version" "layer"
- creates the Lambda layer.data "archive_file" "app"
combined with the linesource_code_hash = data.archive_file.app.output_base64sha256
to ensure we update the functions for any code changes. This will trigger for both functions even if we only update one of them, but I don’t feel it is worth the effort for this project.
scraper.py
function:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Define the Lambda function
resource "aws_lambda_function" "visa_bulletin_scraper" {
function_name = "visa-bulletin-scraper"
handler = "scraper.lambda_handler" # <--- I've updated the handler's filename as well
runtime = "python3.12"
filename = data.archive_file.app.output_path
source_code_hash = data.archive_file.app.output_base64sha256
role = aws_iam_role.lambda_role.arn
layers = [aws_lambda_layer_version.layer.arn]
environment {
variables = {
BULLETIN_DATA = aws_dynamodb_table.visa_bulletin_data.name,
PROCESSED_BULLETIN_URLS = aws_dynamodb_table.processed_urls.name
}
}
}
scraper.py
to the following to use the environment variables, with a fallback value:1
2
3
4
5
6
7
8
9
10
11
12
import os # <--- new import added
...
# Initialize DynamoDB client
dynamodb = boto3.resource('dynamodb')
table_name = os.environ.get('BULLETIN_DATA', 'VisaBulletinData')
table = dynamodb.Table(table_name)
processed_urls_table_name = os.environ.get('PROCESSED_BULLETIN_URLS', 'ProcessedURLs')
processed_urls_table = dynamodb.Table(processed_urls_table_name)
retriever.py
, we only need to add the VisaBulletinData
via BULLETIN_DATA
.retriever.py
, along with its own IAM policy and role. While doing this, I notice that past-Cobus was lazy with the resource names:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
resource "aws_iam_role" "lambda_role" {
name = "visa-bulletin-scraper-role"
...
}
# Attach the necessary IAM policies to the role
resource "aws_iam_policy_attachment" "lambda_basic_execution" {
name = "lambda_basic_execution"
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
roles = [aws_iam_role.lambda_role.name]
}
resource "aws_iam_policy_attachment" "dynamodb_access" {
name = "dynamodb_access"
policy_arn = aws_iam_policy.dynamodb_access_policy.arn
roles = [aws_iam_role.lambda_role.name]
}
# Define the IAM policy for DynamoDB access
resource "aws_iam_policy" "dynamodb_access_policy" {
...
}
retriever.py
. It is also a good time to split out our Terraform resources into smaller files to make it easier to find each one of them. I decided to split them from the single app.tf
into the following:dynamodb.tf
- defines the 2 tables we use.lambda_layer.tf
- builds and creates the Lambda layer used by both functions.lambda_zip.tf
- zips up all the application code into a single.zip
used by both functions.lambda_scraper.tf
- defines the Lambda function, IAM policy, and IAM role forscraper.py
.lambda_retriever.tf
- defines the Lambda function, IAM policy, and IAM role forretriever.py
.
terraform apply
, the following error is returned:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
│ Error: creating IAM Role (visa-bulletin-scraper-role): operation error IAM: CreateRole, https response error StatusCode: 409, RequestID: 94248aee-5615-4fbf-936e-81ceb9c24f0f, EntityAlreadyExists: Role with name visa-bulletin-scraper-role already exists.
│
│ with aws_iam_role.scraper_role,
│ on lambda_scraper.tf line 20, in resource "aws_iam_role" "scraper_role":
│ 20: resource "aws_iam_role" "scraper_role" {
│
╵
╷
│ Error: creating IAM Policy (visa-bulletin-scraper-dynamodb-access): operation error IAM: CreatePolicy, https response error StatusCode: 409, RequestID: da27e143-5a7e-4008-81a8-49bf23fb99d8, EntityAlreadyExists: A policy called visa-bulletin-scraper-dynamodb-access already exists. Duplicate names are not allowed.
│
│ with aws_iam_policy.scraper_dynamodb_access_policy,
│ on lambda_scraper.tf line 53, in resource "aws_iam_policy" "scraper_dynamodb_access_policy":
│ 53: resource "aws_iam_policy" "scraper_dynamodb_access_policy" {
│
╵
aws_iam_role
resource from lambda_role
to scraper_role
, we ran into a race condition where the delete for the old one didn’t complete before the new one’s create started. IAM role names need to be unique, and this is why we encountered this issue. Running terraform apply
a second time will fix this - worth keeping in mind if you ever run into this after doing a cleanup.1
Apply complete! Resources: 4 added, 1 changed, 0 destroyed.
terraform fmt
will format your .tf
files in the current directory and fix the indentation, be kind and ~rewind~ clean up your source files before you commit them (unlike past-Cobus who also forgot that step).If you deploy a Lambda function, but you never call it, does it even exist?
1
2
3
4
5
aws lambda invoke \
--function-name visa-bulletin-retriever \
--cli-binary-format raw-in-base64-out \
--payload '{"queryStringParameters": {"filing_type": "FINAL_DATE", "category": "THIRD", "country": "ALL_AREAS"}}' \
response.json
1
2
3
4
5
{
"StatusCode": 200,
"FunctionError": "Unhandled",
"ExecutedVersion": "$LATEST"
}
retriever
function with the same payload as above, I see the issue:import
statement. After fixing that, I run terraform apply
again, and I’m quite curious to see what the change looks like since we’re using Lambda layers. The following change to the IAM role makes me suspicious:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# aws_iam_policy_attachment.retriever_lambda_basic_execution will be updated in-place
~ resource "aws_iam_policy_attachment" "retriever_lambda_basic_execution" {
id = "retriever_lambda_basic_execution"
name = "retriever_lambda_basic_execution"
~ roles = [
- "visa-bulletin-scraper-role",
# (1 unchanged element hidden)
]
# (3 unchanged attributes hidden)
}
# aws_iam_policy_attachment.scraper_lambda_basic_execution will be updated in-place
~ resource "aws_iam_policy_attachment" "scraper_lambda_basic_execution" {
id = "scraper_lambda_basic_execution"
name = "scraper_lambda_basic_execution"
~ roles = [
- "visa-bulletin-retriever-role",
# (1 unchanged element hidden)
]
# (3 unchanged attributes hidden)
}
aws_iam_policy_attachment
so I can look at the documentation. Right at the top of the page, it has a big, red warning:aws_security_group
- in the case of aws_iam_policy_attachment
, it will play whack-a-mole since we need to attach that policy to 2 different IAM roles. On the first run, one of them will end up succeeding, but for future runs, it will try to attach it again since whichever one finished last would remove the one that finished first. To fix this, we need to use aws_iam_role_policy_attachment
instead:1
2
3
4
resource "aws_iam_role_policy_attachment" "retriever_lambda_basic_execution" {
role = aws_iam_role.retriever_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}
terraform apply
again, only for it to timeout after 2 minutes with:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
╷
│ Error: reading IAM Role Policy Attachment (visa-bulletin-retriever-role:arn:aws:iam::631077426493:policy/visa-bulletin-retriever-dynamodb-access): empty result
│
│ with aws_iam_role_policy_attachment.retriever_dynamodb_access,
│ on lambda_retriever.tf line 45, in resource "aws_iam_role_policy_attachment" "retriever_dynamodb_access":
│ 45: resource "aws_iam_role_policy_attachment" "retriever_dynamodb_access" {
│
╵
╷
│ Error: reading IAM Role Policy Attachment (visa-bulletin-scraper-role:arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole): empty result
│
│ with aws_iam_role_policy_attachment.scraper_lambda_basic_execution,
│ on lambda_scraper.tf line 40, in resource "aws_iam_role_policy_attachment" "scraper_lambda_basic_execution":
│ 40: resource "aws_iam_role_policy_attachment" "scraper_lambda_basic_execution" {
│
╵
╷
│ Error: reading IAM Role Policy Attachment (visa-bulletin-scraper-role:arn:aws:iam::631077426493:policy/visa-bulletin-scraper-dynamodb-access): empty result
│
│ with aws_iam_role_policy_attachment.scraper_dynamodb_access,
│ on lambda_scraper.tf line 45, in resource "aws_iam_role_policy_attachment" "scraper_dynamodb_access":
│ 45: resource "aws_iam_role_policy_attachment" "scraper_dynamodb_access" {
│
empty result
- when you try to attach a policy the 2nd time, it will not return an error if it is already attached. So we have another race condition (I think): the request to remove the attachment (since we replaced the aws_iam_policy_attachment
with aws_iam_role_policy_attachment
) was was made in parallel to the new one in an order where it didn’t return a successful response for the new one. At least this is what I would speculate without digging into it too much. Regardless, you can get around this issue by just running terraform apply
a 2nd time.aws invoke-lambda
command again returns:1
2
3
4
5
{
"StatusCode": 200,
"ExecutedVersion": "$LATEST"
}
(END)
ctrl + c
or cmd + c
. Alternatively, you can add the --no-cli-pager
parameter:1
2
3
4
5
6
aws lambda invoke \
--function-name visa-bulletin-retriever \
--cli-binary-format raw-in-base64-out \
--payload '{"queryStringParameters": {"filing_type": "FINAL_DATE", "category": "THIRD", "country": "ALL_AREAS"}}' \
--no-cli-pager \
response.json
response.json
, we can see the data!1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
"statusCode": 200,
"body": [
{
"date": "2021-12-01",
"filing_type": "Final Date",
"bulletin_date": "2024-07-01",
"category": "3rd",
"sk": "BULLETIN_DATE#2024-07-01",
"pk": "FILING_TYPE#Final Date#CATEGORY#3rd#COUNTRY#All Chargeability Areas Except Those Listed",
"country": "All Chargeability Areas Except Those Listed"
},
...
bulletin_date
along with the date
as a sorted list since all the rest of the data is what we sent in to filter on.filing_type
, category
, and country
once, and then have a list of key-value pairs for bulletin_date
and date
. It looks like we can do this by changing the return statement to the following:1
2
3
4
5
6
7
8
9
10
11
12
13
...
response = {
'filing_type': filing_type,
'category': category,
'country': country,
'data': data
}
return {
'statusCode': 200,
'body': response
}
read_data
is in the format that we need, which is that key-value pair list. We need to also update the return
of read_data
to accomplish this to the following:1
2
3
4
5
6
7
8
9
10
...
date_list = []
for item in sorted_items:
date = item['date']
bulletin_date = item['bulletin_date']
date_list.append({"bulletin_date":bulletin_date, "date": date})
return date_list
terraform apply
later, it is deployed. You can also see from the Terraform output below that it only replaced the Lambda function code, not the Lambda layer as we didn’t add any additional dependencies - json
is built into Python:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Terraform will perform the following actions:
# aws_lambda_function.visa_bulletin_retriever will be updated in-place
~ resource "aws_lambda_function" "visa_bulletin_retriever" {
id = "visa-bulletin-retriever"
~ last_modified = "2024-07-03T22:28:49.000+0000" -> (known after apply)
~ source_code_hash = "BmwRVIejGX4B1ual5JdhfulogEDuyTxyW9/3/g95WBE=" -> "eTHf0nHRxTy/Pi4u+tOvpKATsPWqagYcD3rEl0wwwcQ="
tags = {}
# (27 unchanged attributes hidden)
# (4 unchanged blocks hidden)
}
# aws_lambda_function.visa_bulletin_scraper will be updated in-place
~ resource "aws_lambda_function" "visa_bulletin_scraper" {
id = "visa-bulletin-scraper"
~ last_modified = "2024-07-03T22:29:02.000+0000" -> (known after apply)
~ source_code_hash = "BmwRVIejGX4B1ual5JdhfulogEDuyTxyW9/3/g95WBE=" -> "eTHf0nHRxTy/Pi4u+tOvpKATsPWqagYcD3rEl0wwwcQ="
tags = {}
# (27 unchanged attributes hidden)
# (4 unchanged blocks hidden)
}
Plan: 0 to add, 2 to change, 0 to destroy.
body
of the response will be JSON-encoded, so if we used this function as an API, we would need to deserialize the body
into an object to access the data. The contents of response.json
looks as follows:scraper
function to auto-update if a new file is found. Actually, what would be really nice to also include a notification if it finds a new bulletin - today is 3 July 2024, and I’m already refreshing the page to see when the draft of the August bulletin will be published…article-3
, and pushed the tag.processed_urls_table.put_item
?processed_urls_table.put_item
:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
for link in visa_bulletin_links:
if '2022' in link or '2023' in link or '2024' in link:
# Check if the URL has been processed
response = processed_urls_table.get_item(Key={'url': link})
if 'Item' in response:
print(f"Skipping URL: {link} (already processed)")
continue
# Process the URL
print(f"Processing URL: {link}")
url = f"https://travel.state.gov{link}"
url_data = scrape_visa_bulletin(url)
data.extend(url_data)
# Store the data
store_data(url_data)
# Store the processed URL in DynamoDB
processed_urls_table.put_item(Item={'url': link})
store_data(data)
- After scraping the visa bulletin data from the URL using
scrape_visa_bulletin(url)
, the scraped data is stored in a new variableurl_data
. - The
url_data
is then appended to thedata
list usingdata.extend(url_data)
. 3.A new linestore_data(url_data)
is added to store the data scraped from the current URL.
processed_urls_table.put_item(Item={'url': link})
.store_data
function is defined elsewhere in your code and correctly handles storing the data in the desired location (e.g., a database, a file, etc.).filing_type
, category
, and country
?enum
module, which was introduced in Python 3.4. It allows you to define a set of symbolic names (members) bound to unique, constant values. This can make your code more readable and less prone to errors caused by typos or invalid values. [1]enum
module to define the possible values for filing_type
, category
, and country
in your read_data_locally
function:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from enum import Enum
class FilingType(Enum):
FINAL_DATE = 'Final Date'
DATES_FOR_FILING = 'Dates for Filing'
class Category(Enum):
FIRST = '1st'
SECOND = '2nd'
THIRD = '3rd'
OTHER_WORKERS = 'Other Workers'
class Country(Enum):
ALL_AREAS = 'All Chargeability Areas Except Those Listed'
CHINA_MAINLAND = 'China-mainland born'
EL_SALVADOR = 'El Salvador, Guatemala, Honduras'
# Add more countries as needed
def read_data_locally(data, filing_type=FilingType.FINAL_DATE, category=Category.THIRD, country=Country.ALL_AREAS):
# Your function logic here
pass
FilingType
, Category
, and Country
. Each class has members representing the valid values for that category.read_data_locally
, you can now use the enum members instead of strings, like filing_type=FilingType.FINAL_DATE
. This has several advantages:- Type safety : The function will only accept values from the defined enums, preventing typos or invalid values.
- Readability : The code becomes more self-documenting and easier to understand.
- Consistency : You ensure that the same values are used throughout your codebase.
filtered_data
code?filtered_data
code, you can use the value
attribute of the enum member or call the str()
function on the member.read_data_locally
function to filter the data based on the string values of the enum members:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def read_data_locally(data, filing_type=FilingType.FINAL_DATE, category=Category.THIRD, country=Country.ALL_AREAS):
# Filter the data based on filing_type, category, and country
filtered_data = [entry for entry in data
if entry['filing_type'] == filing_type.value
and entry['category'] == str(category)
and entry['country'] == country.value]
# Sort the filtered data in descending order by bulletin_date
sorted_data = sorted(filtered_data, key=itemgetter('bulletin_date'), reverse=True)
# Print the sorted data
for entry in sorted_data:
print(f"Bulletin Date: {entry['bulletin_date']}, Date: {entry['date']}")
return sorted_data
- For
filing_type
, we usefiling_type.value
to get the string value of the enum member. - For
category
, we usestr(category)
to get the string representation of the enum member. - For
country
, we usecountry.value
to get the string value of the enum member.
filtered_data
list will contain only the entries where the filing_type
, category
, and country
values match the string values of the corresponding enum members.read_data_locally
function, you can still use the enum members directly, like read_data_locally(data, filing_type=FilingType.FINAL_DATE, category=Category.THIRD, country=Country.ALL_AREAS)
. By using the value
attribute or str()
function, you can access the string representation of the enum members, which is necessary for filtering the data based on the string values stored in the dictionary entries.lambda_handler
- I want to be able to inject it into the the mock_event object for local testing, but also be able to call the Lambda using query string parameters? It should then insert the values from the event when calling read_data()
.lambda_handler
, you can modify the function to accept the enum values as query string parameters or as part of the event payload. Then, you can use these values when calling read_data()
.lambda_handler
function to accept the enum values as query string parameters:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import urllib.parse
def lambda_handler(event, context):
# ... (existing code)
# Parse query string parameters
query_params = event.get('queryStringParameters', {})
filing_type_str = query_params.get('filing_type', 'FINAL_DATE')
country_str = query_params.get('country', 'ALL_AREAS')
# Convert string values to enum members
filing_type = FilingType[filing_type_str.upper()]
category = Category[category_str.upper()]
country = Country[country_str.upper().replace(' ', '_')]
# Call read_data with the enum values
eb3_data = read_data(filing_type=filing_type, category=category, country=country)
return {
'statusCode': 200,
'body': eb3_data
}
query_params
dictionary is created from the queryStringParameters
in the event object.filing_type_str
, category_str
, and country_str
values are retrieved from the query_params
dictionary, with default values provided if the parameters are not present.FilingType
, Category
, and Country
enums. The upper()
method is used to handle case-insensitive values, and the replace(' ', '_')
is used to handle spaces in the country string.read_data
function is called with the enum values.mock_event
object with the desired query string parameters:1
2
3
4
5
6
7
8
9
10
mock_event = {
'queryStringParameters': {
'filing_type': 'FINAL_DATE',
'category': 'THIRD',
'country': 'ALL_AREAS'
}
}
result = lambda_handler(mock_event, None)
print(result)
https://your-lambda-function.aws/path?filing_type=FINAL_DATE&category=THIRD&country=ALL_AREAS
.lambda_handler
function accordingly:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def lambda_handler(event, context):
# ... (existing code)
# Parse query string parameters
query_params = event.get('queryStringParameters', {})
filing_type_str = query_params.get('filing_type', 'FINAL_DATE')
country_str = query_params.get('country', 'ALL_AREAS')
# Convert string values to enum members
filing_type = FilingType[filing_type_str.upper()]
category = Category[category_str.upper()]
country = Country[country_str.upper().replace(' ', '_')]
# Call read_data with the enum values
eb3_data = read_data(filing_type=filing_type, category=category, country=country)
return {
'statusCode': 200,
'body': eb3_data
}
category_str = query_params.get('category', 'THIRD')
works? Is the 2nd parameter “THIRD” a default value if it can’t extract it?category_str = query_params.get('category', 'THIRD')
is using the get()
method of a dictionary (or dictionary-like object) to retrieve the value associated with the key 'category'
.get()
method takes two arguments:- The key to look up in the dictionary
- An optional default value to return if the key is not found
query_params
is likely a dictionary containing query parameters extracted from a URL or form data. The get('category', 'THIRD')
part is attempting to retrieve the value associated with the 'category'
key from query_params
.'category'
key exists in query_params
, its associated value will be returned and assigned to category_str
. However, if the 'category'
key is not present in query_params
, the default value 'THIRD'
will be used instead and assigned to category_str
.'THIRD'
is acting as a default value that will be used for category_str
if the 'category'
key cannot be found in query_params
.FINAL_DATE
with the name of the enum, how would I do that? Would something like FilingType.FINAL_DATE.str()
work?str()
. You can simply use FilingType.FINAL_DATE.value
to get the string value associated with the FINAL_DATE
member of the FilingType
enum.'FINAL_DATE'
with the value of the FilingType.FINAL_DATE
enum member:1
filing_type_str = query_params.get('filing_type', FilingType.FINAL_DATE.value)
'filing_type'
key is not present in query_params
, the value of FilingType.FINAL_DATE.value
(which is 'Final Date'
) will be used as the default value for filing_type_str
.country_str
and category_str
assignments:1
2
3
filing_type_str = query_params.get('filing_type', FilingType.FINAL_DATE.value)
country_str = query_params.get('country', Country.ALL_AREAS.value)
category_str = query_params.get('category', Category.THIRD.value)
value
attribute of the enum members, you can directly access the string values associated with each enum member, without needing any additional methods like str()
.filing_type = FilingType[filing_type_str.upper()]
?FilingType[filing_type_str.upper()]
, you need to use the string name of the enum member as the default value, not the value
attribute. [1]'FINAL_DATE'
with the name of the FilingType.FINAL_DATE
enum member, you can use the following:1
filing_type_str = query_params.get('filing_type', 'FINAL_DATE')
'filing_type'
key is not present in query_params
, the string 'FINAL_DATE'
will be assigned to filing_type_str
. Then, when you do filing_type = FilingType[filing_type_str.upper()]
, it will correctly look up the FilingType.FINAL_DATE
member.1
2
filing_type_str = query_params.get('filing_type', 'FINAL_DATE')
country_str = query_params.get('country', 'ALL_AREAS')
FINAL_DATE
, and instead somehow reference FilingType.FINAL_DATE
so that if I decide to rename that specific enum value to e.g. FINAL_DATE_NEW
, I would get an error when trying to the code as FilingDate.FINAL_DATE
would no longer exist.name
attribute of the enum member. The name
attribute returns the string name of the enum member.1
2
3
4
5
6
7
8
9
10
from enum import Enum
class FilingType(Enum):
FINAL_DATE = 'Final Date'
# other enum members
default_filing_type = FilingType.FINAL_DATE
filing_type_str = query_params.get('filing_type', default_filing_type.name)
filing_type = FilingType[filing_type_str.upper()]
- We define a
default_filing_type
variable and assign it theFilingType.FINAL_DATE
enum member. - When getting the
filing_type_str
fromquery_params
, we usedefault_filing_type.name
as the default value.
FilingType.FINAL_DATE
enum member to something else (e.g., FilingType.FINAL_DATE_NEW
), you’ll get an error when trying to access FilingType.FINAL_DATE
, because it no longer exists.name
attribute of the enum member is a string, so it can be used as the default value for filing_type_str
. Later, when you look up the enum member using FilingType[filing_type_str.upper()]
, it will work as expected, because you’re using the string name of the enum member.lambda_handler
function into two separate functions, you can follow these steps: [1]1
2
3
4
5
6
7
resource "aws_lambda_function" "scraper_function" {
filename = data.archive_file.scraper_lambda_zip.output_path
function_name = "uscis-scraper"
role = aws_iam_role.lambda_exec_role.arn
handler = "scraper.lambda_handler"
runtime = "python3.9"
}
1
2
3
4
5
6
7
resource "aws_lambda_function" "retriever_function" {
filename = data.archive_file.retriever_lambda_zip.output_path
function_name = "uscis-retriever"
role = aws_iam_role.lambda_exec_role.arn
handler = "retriever.lambda_handler"
runtime = "python3.9"
}
scraper.py
and retriever.py
. scraper.py
will contain the code for scraping the USCIS website and storing the data in the database. The lambda_handler
function in this file will be responsible for this task. retriever.py
will contain the code for retrieving data from the database based on the provided parameters. The lambda_handler
function in this file will be responsible for this task.handler
argument for each Lambda function to point to the correct file and function name. [3]1
2
3
4
5
6
7
8
9
resource "aws_lambda_function" "scraper_function" {
# ...
handler = "scraper.lambda_handler"
}
resource "aws_lambda_function" "retriever_function" {
# ...
handler = "retriever.lambda_handler"
}
scraper.py
and retriever.py
:enums.py
in the same directory as scraper.py
and retriever.py
.enums.py
, define your enum classes:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# enums.py
from enum import Enum
class FilingType(Enum):
FINAL_DATE = 'Final Date'
INITIAL_DATE = 'Initial Date'
# Add other enum members as needed
class Country(Enum):
ALL_AREAS = 'All Areas'
CHINA_MAINLAND_BORN = 'China Mainland-born'
# Add other enum members as needed
class Category(Enum):
FIRST = 'First'
SECOND = 'Second'
THIRD = 'Third'
# Add other enum members as needed
scraper.py
and retriever.py
, import the enum classes from enums.py
:1
2
3
4
5
6
7
8
9
# scraper.py
from enums import FilingType, Country, Category
# Your scraper code here
# retriever.py
from enums import FilingType, Country, Category
# Your retriever code here
FilingType
, Country
, and Category
in both scraper.py
and retriever.py
.retriever.py
, you can use the enum classes like this:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# retriever.py
from enums import FilingType, Country, Category
def lambda_handler(event, context):
query_params = event.get('queryStringParameters', {})
filing_type_str = query_params.get('filing_type', FilingType.FINAL_DATE.name)
filing_type = FilingType[filing_type_str.upper()]
country_str = query_params.get('country', Country.ALL_AREAS.name)
country = Country[country_str.upper()]
category_str = query_params.get('category', Category.THIRD.name)
category = Category[category_str.upper()]
# Your retrieval logic here
enums.py
), you can easily import and use them in both scraper.py
and retriever.py
. This approach promotes code reusability and maintainability, as you only need to define the enum classes in one place, and any changes to the enum classes will be reflected in both files.table = dynamodb.Table('VisaBulletinData')
to use an environment variable called “BULLETIN_DATA” instead of the hard-coded string, and only use that hard-coded string as a fallback if the environment variable is not set?table = dynamodb.Table('VisaBulletinData')
to use an environment variable called BULLETIN_DATA
and fallback to the hard-coded string if the environment variable is not set:1
2
3
4
import os
table_name = os.environ.get('BULLETIN_DATA', 'VisaBulletinData')
table = dynamodb.Table(table_name)
import os
imports theos
module, which provides a way to interact with the operating system, including accessing environment variables.os.environ.get('BULLETIN_DATA', 'VisaBulletinData')
retrieves the value of the environment variableBULLETIN_DATA
. If the environment variable is not set, it will use the fallback value'VisaBulletinData'
.- The retrieved value (either from the environment variable or the fallback value) is assigned to the
table_name
variable. dynamodb.Table(table_name)
creates a DynamoDB Table object using the value stored intable_name
. [1]
os.environ.get()
, you can easily switch between using the environment variable or the hard-coded string, depending on whether the environment variable is set or not.BULLETIN_DATA
environment variable when running your Lambda function, you can configure it in the AWS Lambda console or through your Terraform configuration. Here’s an example of how you can set the environment variable using Terraform:1
2
3
4
5
6
7
8
9
resource "aws_lambda_function" "your_lambda_function" {
# ... other configuration ...
environment {
variables = {
BULLETIN_DATA = "YourTableName"
}
}
}
"YourTableName"
with the desired name of your DynamoDB table.BULLETIN_DATA
environment variable as the DynamoDB table name. If the environment variable is not set (or set to an empty string), it will fallback to using the hard-coded string 'VisaBulletinData'
.visa_bulletin_retriever
Lambda function using the AWS CLI on my local terminal with an event payload of:1
2
3
4
5
6
7
{
'queryStringParameters': {
'filing_type': 'FINAL_DATE',
'category': 'THIRD',
'country': 'ALL_AREAS'
}
}
visa_bulletin_retriever
Lambda function using the AWS CLI on your local terminal with the provided event payload:1
2
3
4
5
aws lambda invoke \
--function-name visa_bulletin_retriever \
--cli-binary-format raw-in-base64-out \
--payload '{"queryStringParameters": {"filing_type": "FINAL_DATE", "category": "THIRD", "country": "ALL_AREAS"}}' \
response.json
aws lambda invoke
: This is the AWS CLI command to invoke a Lambda function.--function-name visa_bulletin_retriever
: Specifies the name of the Lambda function you want to invoke. Replacevisa_bulletin_retriever
with the actual name of your Lambda function.--cli-binary-format raw-in-base64-out
: This option tells the AWS CLI to send the payload as raw binary data and to decode the response from Base64.--payload '{"queryStringParameters": {"filing_type": "FINAL_DATE", "category": "THIRD", "country": "ALL_AREAS"}}'
: This is the JSON payload that you want to pass to your Lambda function. It matches the structure you provided in your example.response.json
: This is the name of the file where the Lambda function’s response will be saved. You can choose any file name you prefer.
response.json
file.response.json
part from the command:1
2
3
4
aws lambda invoke \
--function-name visa_bulletin_retriever \
--cli-binary-format raw-in-base64-out \
--payload '{"queryStringParameters": {"filing_type": "FINAL_DATE", "category": "THIRD", "country": "ALL_AREAS"}}'
--payload-json file://path/to/payload.json
option instead of embedding the JSON payload directly in the command.lambda_handler
in retriever.py
so that the data is in the following shape please?1
2
3
4
5
6
7
8
9
{
'filing_type': 'Final Date',
'category': '3rd',
'country': 'All Chargeability Areas Except Those Listed',
'data': [
{'bulletin_date': '2024-07-01', 'date': '2021-12-01' },
{'bulletin_date': '2024-08-01', 'date': '2021-12-05' }
]
}
lambda_handler
in retriever.py
to return the data in the desired shape, you can modify the code as follows:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import json
def lambda_handler(event, context):
query_params = event.get('queryStringParameters', {})
filing_type_str = query_params.get('filing_type', FilingType.FINAL_DATE.name)
filing_type = FilingType[filing_type_str.upper()].value
country_str = query_params.get('country', Country.ALL_AREAS.name)
country = Country[country_str.upper()].value
category_str = query_params.get('category', Category.THIRD.name)
category = Category[category_str.upper()].value
data = read_data(filing_type, category, country)
response = {
'filing_type': filing_type,
'category': category,
'country': country,
'data': data
}
return {
'statusCode': 200,
'body': json.dumps(response)
}
- We import the
json
module to convert the Python dictionary to a JSON string. - Inside the
lambda_handler
function, we retrieve thefiling_type
,country
, andcategory
values from thequery_params
dictionary, using the corresponding enum values as defaults. - Instead of using the string names of the enum members, we use the
value
attribute to get the actual string values associated with each enum member (e.g.,FilingType.FINAL_DATE.value
returns'Final Date'
). - We call the
read_data
function to get the data based on thefiling_type
,category
, andcountry
values. - We create a new dictionary
response
with the desired shape, including thefiling_type
,category
,country
, anddata
keys. - We return a dictionary with a
statusCode
of 200 and abody
containing the JSON-encodedresponse
dictionary.
1
2
3
4
{
"statusCode": 200,
"body": "{\"filing_type\": \"Final Date\", \"category\": \"3rd\", \"country\": \"All Chargeability Areas Except Those Listed\", \"data\": [{\"bulletin_date\": \"2024-07-01\", \"date\": \"2021-12-01\"}, {\"bulletin_date\": \"2024-08-01\", \"date\": \"2021-12-05\"}]}"
}
body
is a JSON-encoded string, so you may need to parse it on the client-side (e.g., using JSON.parse()
in JavaScript) to work with the data as a JavaScript object.Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.