Learn Web Scraping with AWS Bedrock Agents
A beginner friendly guide to set up and deploy AWS Bedrock Agents for web scraping with Lambda, Streamlit, and Anthropic Claude.
Published Nov 21, 2024
In this blog post, I’ll discuss:
- (PART 1) AWS Bedrock Agents Introduction
- (PART 2) Hands-on project to implement a simple Bedrock agent that web scrapes a URL provided by the user
Think of a Bedrock Agent as a smart assistant that can:
1. Understand Tasks
- Takes complex requests from users
- Breaks them down into smaller, manageable steps
- Figures out what needs to be done in what order
1. Understand Tasks
- Takes complex requests from users
- Breaks them down into smaller, manageable steps
- Figures out what needs to be done in what order
2. Take Actions
- Can call APIs to get things done
- Access your company’s data when needed
- Execute multiple steps automatically
- Can call APIs to get things done
- Access your company’s data when needed
- Execute multiple steps automatically
There are 2 core parts of an agent:
1. Instructions
- Like a manual that tells the agent what it can do
- Sets boundaries for the agent’s actions
- Defines its specific purpose
1. Instructions
- Like a manual that tells the agent what it can do
- Sets boundaries for the agent’s actions
- Defines its specific purpose
2. Action Groups
- The specific things an agent can do
- Usually connected to Lambda functions
- Example: searching a database, creating a ticket, or sending an email
- The specific things an agent can do
- Usually connected to Lambda functions
- Example: searching a database, creating a ticket, or sending an email
3. Knowledge Base (Optional)
- Reference information the agent can use
- Company documents, FAQs, policies
- Helps the agent give accurate responses
- Reference information the agent can use
- Company documents, FAQs, policies
- Helps the agent give accurate responses
Action groups are a powerful pattern that can be applied beyond Bedrock Agents, helping organize and manage complex systems effectively.
Key components of action groups:
- API Schema (like webscrape-schema.json which will be used in the project below)
- Lambda function for business logic
- Action group configuration
- Parameters and response definitions
- API Schema (like webscrape-schema.json which will be used in the project below)
- Lambda function for business logic
- Action group configuration
- Parameters and response definitions
The choice between using Agents or RAG often depends on whether you need simple information retrieval and augmented responses (RAG) or complex task orchestration with multiple steps and API interactions (via Agents).
For example, a customer service agent might use:
- RAG to retrieve accurate product information
- Agent capabilities to orchestrate the complete customer interaction, including checking inventory, processing returns, or updating customer records
- RAG to retrieve accurate product information
- Agent capabilities to orchestrate the complete customer interaction, including checking inventory, processing returns, or updating customer records
Thought I’d write this part of my blog leveraging Simon Sinek’s Golden Circle — Why, How, What behind this project.
I am doing this new thing, called ‘AWS RawRRR….’ 🦁series, where I ‘recreate review rebuild repos’ of AWS. For understanding agents, I chose this repo on using bedrock agents to webscrape a user provided URL — [Build-on-AWS] (https://github.com/build-on-aws/bedrock-agents-webscraper?tab=readme-ov-file#step-1-aws-lambda-function-configuration)
Here are my tweaks:
1. I used the console route to create one action group to scrape web URL of my choice
2. I used CloudFormation to create streamlit interface on EC2 and I modified the original interface elements using Vim editor (via EC2 instance connect)
1. I used the console route to create one action group to scrape web URL of my choice
2. I used CloudFormation to create streamlit interface on EC2 and I modified the original interface elements using Vim editor (via EC2 instance connect)
As an amateur photographer and new Fuji user, I cannot get enough of film simulation recipes. I’m enjoying the world-idea-notion of Straight of the Camera (SOOC) pics. I found this awesome website called ‘Ross and his JPEGS’ and used it to test out the webcrawler capabilities.
💡Pick any website that gets you curious to keep the project more relevant and interesting!
You can review my implementation of the project on my Github here — https://github.com/lulu3202/bedrock_web_crawler_agent
The idea behind this is that in learning you teach, and in teaching you learn.
1.Amazon Bedrock Agent Setup
— Use Anthropic Claude 3.5 Sonnet as the core model for the agent.
— Add an action group that links the agent to a Lambda function, enabling web scraping.
2. Lambda Function Deployment
— Write a Python-based function to fetch and parse webpage content.
— Package dependencies like `urllib.request` and `BeautifulSoup` into a Lambda layer.
— Deploy the function and integrate it into the Bedrock agent.
3. Streamlit App on EC2
— Deploy a Streamlit app using a CloudFormation template.
— The app enables users to send queries, view scraped data, and explore the agent’s capabilities interactively.
— Use Anthropic Claude 3.5 Sonnet as the core model for the agent.
— Add an action group that links the agent to a Lambda function, enabling web scraping.
2. Lambda Function Deployment
— Write a Python-based function to fetch and parse webpage content.
— Package dependencies like `urllib.request` and `BeautifulSoup` into a Lambda layer.
— Deploy the function and integrate it into the Bedrock agent.
3. Streamlit App on EC2
— Deploy a Streamlit app using a CloudFormation template.
— The app enables users to send queries, view scraped data, and explore the agent’s capabilities interactively.
Step 1: Lambda Function Setup
Create and deploy a Lambda function to handle web scraping. The function will:
- Accept a URL input.
- Scrape the webpage using `urllib.request` and `BeautifulSoup`.
- Return the cleaned data in JSON format.
Create and deploy a Lambda function to handle web scraping. The function will:
- Accept a URL input.
- Scrape the webpage using `urllib.request` and `BeautifulSoup`.
- Return the cleaned data in JSON format.
Step 2: Add Dependencies
Add a Lambda layer for libraries not natively supported, like `BeautifulSoup`.
Add a Lambda layer for libraries not natively supported, like `BeautifulSoup`.
Step 3: Bedrock Agent and Action Group Configuration
Define the agent’s behavior using an OpenAPI schema in json format and link it to the Lambda function through an action group.
Define the agent’s behavior using an OpenAPI schema in json format and link it to the Lambda function through an action group.
Step 4: Test out bedrock agent set up
- Use Bedrock Console to test out webscaping functionality
- Use Bedrock Console to test out webscaping functionality
Step 5: Launch Streamlit App on EC2
- Deploy a Streamlit app for a user-friendly interface.
- Update the app with agent credentials to enable query handling.
- Deploy a Streamlit app for a user-friendly interface.
- Update the app with agent credentials to enable query handling.
Step 6: Test the Agent on External Streamlit App running on EC2
Test the agent’s functionality by providing URLs and viewing the scraped content on Streamlit interface.
Test the agent’s functionality by providing URLs and viewing the scraped content on Streamlit interface.
💡Some helpful tip as I wrap up this blog post:
- Reminder to clean up, clean up, clean up all your AWS resources! 🧹
- It cost me between 0.25 cents to 1 dollar for this project to utilize Bedrock (so don’t forget to clean up after your session) 💲
- I used Q developer in both VS Code and on AWS Console to understand the code, tweak or debug as needed 👩🏽💻
- You can use both my readme files and build on AWS's build-on-project readme file for a more detailed step-by-step approach (links in references section below)
🙏🏽Once again, a huge shout-out to:
- Build-on-AWS GitHub profile (https://github.com/build-on-aws/bedrock-agents-webscraper?tab=readme-ov-file#step-1-aws-lambda-function-configuration)
- ‘Ross and his jpeg’ website which was my muse to test out Bedrock Agent’s web scraping capabilities!
- Build-on-AWS GitHub profile (https://github.com/build-on-aws/bedrock-agents-webscraper?tab=readme-ov-file#step-1-aws-lambda-function-configuration)
- ‘Ross and his jpeg’ website which was my muse to test out Bedrock Agent’s web scraping capabilities!
I also did a YT video on this if you prefer the video format - https://www.youtube.com/watch?v=tKEu-K2YTTc
- Build-on-AWS bedrock webscraper project (https://github.com/build-on-aws/bedrock-agents-webscraper?tab=readme-ov-file#step-1-aws-lambda-function-configuration)
- Q Developer, for helping me debug and understand all the code artifacts!
- My implementation of this project- https://github.com/lulu3202/bedrock_web_crawler_agent
- My YT video on this content - https://www.youtube.com/watch?v=tKEu-K2YTTc
🌟🌟🌟 The expert in anything was once a beginner — Helen Hayes🌟🌟🌟