Learn Web Scraping with AWS Bedrock Agents

In this blog post, I’ll discuss:

(PART 1) AWS Bedrock Agents Introduction
(PART 2) Hands-on project to implement a simple Bedrock agent that web scrapes a URL provided by the user

PART 1 — An introduction to Bedrock Agents

What are Bedrock Agents

Think of a Bedrock Agent as a smart assistant that can:
1. Understand Tasks
- Takes complex requests from users
- Breaks them down into smaller, manageable steps
- Figures out what needs to be done in what order

2. Take Actions
- Can call APIs to get things done
- Access your company’s data when needed
- Execute multiple steps automatically

Basic Components of Bedrock Agents

There are 2 core parts of an agent:
1. Instructions
- Like a manual that tells the agent what it can do
- Sets boundaries for the agent’s actions
- Defines its specific purpose

2. Action Groups
- The specific things an agent can do
- Usually connected to Lambda functions
- Example: searching a database, creating a ticket, or sending an email

3. Knowledge Base (Optional)
- Reference information the agent can use
- Company documents, FAQs, policies
- Helps the agent give accurate responses

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Define what the agent can do
actions = {
 "check_order_status": lambda order_id: get_order_details(order_id),
 "process_return": lambda order_id: create_return_request(order_id),
 "update_shipping": lambda order_id, address: update_shipping_address(order_id, address)
}
# Agent instructions (in plain English)
instructions = """
You are a customer service agent that can:
- Look up order status
- Process returns
- Update shipping addresses
Always verify order details before making changes.
"""

Quick Note on Action Groups

Action groups are a powerful pattern that can be applied beyond Bedrock Agents, helping organize and manage complex systems effectively.

Key components of action groups:
- API Schema (like webscrape-schema.json which will be used in the project below)
- Lambda function for business logic
- Action group configuration
- Parameters and response definitions

Agents vs RAG

The choice between using Agents or RAG often depends on whether you need simple information retrieval and augmented responses (RAG) or complex task orchestration with multiple steps and API interactions (via Agents).

For example, a customer service agent might use:
- RAG to retrieve accurate product information
- Agent capabilities to orchestrate the complete customer interaction, including checking inventory, processing returns, or updating customer records

PART 2 — Hands-on beginner friendly project (as a starter kit)

Thought I’d write this part of my blog leveraging Simon Sinek’s Golden Circle — Why, How, What behind this project.

WHY

I am doing this new thing, called ‘AWS RawRRR….’ 🦁series, where I ‘recreate review rebuild repos’ of AWS. For understanding agents, I chose this repo on using bedrock agents to webscrape a user provided URL — [Build-on-AWS] (https://github.com/build-on-aws/bedrock-agents-webscraper?tab=readme-ov-file#step-1-aws-lambda-function-configuration)

Here are my tweaks:
1. I used the console route to create one action group to scrape web URL of my choice
2. I used CloudFormation to create streamlit interface on EC2 and I modified the original interface elements using Vim editor (via EC2 instance connect)

As an amateur photographer and new Fuji user, I cannot get enough of film simulation recipes. I’m enjoying the world-idea-notion of Straight of the Camera (SOOC) pics. I found this awesome website called ‘Ross and his JPEGS’ and used it to test out the webcrawler capabilities.

💡Pick any website that gets you curious to keep the project more relevant and interesting!

You can review my implementation of the project on my Github here — https://github.com/lulu3202/bedrock_web_crawler_agent

The idea behind this is that in learning you teach, and in teaching you learn.

How (it Works)

1.Amazon Bedrock Agent Setup
— Use Anthropic Claude 3.5 Sonnet as the core model for the agent.
— Add an action group that links the agent to a Lambda function, enabling web scraping.
2. Lambda Function Deployment
— Write a Python-based function to fetch and parse webpage content.
— Package dependencies like `urllib.request` and `BeautifulSoup` into a Lambda layer.
— Deploy the function and integrate it into the Bedrock agent.
3. Streamlit App on EC2
— Deploy a Streamlit app using a CloudFormation template.
— The app enables users to send queries, view scraped data, and explore the agent’s capabilities interactively.

What (are the key steps in the project)

Step 1: Lambda Function Setup
Create and deploy a Lambda function to handle web scraping. The function will:
- Accept a URL input.
- Scrape the webpage using `urllib.request` and `BeautifulSoup`.
- Return the cleaned data in JSON format.

Step 2: Add Dependencies
Add a Lambda layer for libraries not natively supported, like `BeautifulSoup`.

Step 3: Bedrock Agent and Action Group Configuration
Define the agent’s behavior using an OpenAPI schema in json format and link it to the Lambda function through an action group.

Step 4: Test out bedrock agent set up
- Use Bedrock Console to test out webscaping functionality

Image not found

Screenshot capturing the output for my request to webscrape ‘jurassic jade’ recipe from 'rossandhisj

Step 5: Launch Streamlit App on EC2
- Deploy a Streamlit app for a user-friendly interface.
- Update the app with agent credentials to enable query handling.

Step 6: Test the Agent on External Streamlit App running on EC2
Test the agent’s functionality by providing URLs and viewing the scraped content on Streamlit interface.

Image not found

Streamlit running on EC2 instance; Mario (me) asking a question from ross and his jpeg website and t

Outro

💡Some helpful tip as I wrap up this blog post:

Reminder to clean up, clean up, clean up all your AWS resources! 🧹
It cost me between 0.25 cents to 1 dollar for this project to utilize Bedrock (so don’t forget to clean up after your session) 💲
I used Q developer in both VS Code and on AWS Console to understand the code, tweak or debug as needed 👩🏽‍💻
You can use both my readme files and build on AWS's build-on-project readme file for a more detailed step-by-step approach (links in references section below)

🙏🏽Once again, a huge shout-out to:
- Build-on-AWS GitHub profile (https://github.com/build-on-aws/bedrock-agents-webscraper?tab=readme-ov-file#step-1-aws-lambda-function-configuration)
- ‘Ross and his jpeg’ website which was my muse to test out Bedrock Agent’s web scraping capabilities!

I also did a YT video on this if you prefer the video format - https://www.youtube.com/watch?v=tKEu-K2YTTc

References and Acknowledgements

Build-on-AWS bedrock webscraper project (https://github.com/build-on-aws/bedrock-agents-webscraper?tab=readme-ov-file#step-1-aws-lambda-function-configuration)
Q Developer, for helping me debug and understand all the code artifacts!
https://www.rossandhisjpegs.com/fujifilm-recipes
My implementation of this project- https://github.com/lulu3202/bedrock_web_crawler_agent
My YT video on this content - https://www.youtube.com/watch?v=tKEu-K2YTTc

🌟🌟🌟 The expert in anything was once a beginner — Helen Hayes🌟🌟🌟

Select your cookie preferences

Site Terms, Privacy, and more.