logo
Menu
Scrape All Things: AI-powered web scraping with ScrapeGraphAI ๐Ÿ•ท๏ธ and Amazon Bedrock โ›ฐ๏ธ

Scrape All Things: AI-powered web scraping with ScrapeGraphAI ๐Ÿ•ท๏ธ and Amazon Bedrock โ›ฐ๏ธ

Learn how to extract information from documents and websites using natural language prompts.

Joรฃo Galego
Amazon Employee
Published May 21, 2024

Overview

"In the beginning was a graph..." โ€• Greg Egan, Schild's Ladder
A couple of weeks, I began hearing rumors about a new project that was taking GitHub trends by storm.
I usually don't pay much attention to such gossip, but everyone kept calling it a "revolution" in web scraping and commending it for its "ease of use" that I decided to give it a chance.
The project in question is of course ScrapeGraphAI ๐Ÿ•ท๏ธ, an open-source Python library that uses large language models (LLMs) ๐Ÿง ๐Ÿ’ฌ and directed graph logic ๐ŸŸขโ†’๐ŸŸกโ†’๐ŸŸฃ to create scraping pipelines for websites and documents, allowing us to extract information from multiple sources using only natural language prompts.
4 pull requests later, I can safely say that I have become a fan ๐Ÿคฉ (Most of my contributions are around improving the integration between ScrapeGraphAI and AWS AI services - big thanks to Marco Vinciguerra and the rest of team for reviewing them all ๐Ÿ™Œ).
So my plan for today is to show you how to to create next-generation, AI-powered scraping pipelines using ScrapeGraphAI and Amazon Bedrock โ›ฐ๏ธ, a fully managed AI service that let's you access state-of-the-art LLMs and embedding models.
But first, let's look at a simple example to illustrate how far we've come along...

The Old Way ๐Ÿ‘ด๐Ÿป

Take this XML file which contains a small list of books ๐Ÿ“š:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
<book id="bk110">
<author>O'Brien, Tim</author>
<title>Microsoft .NET: The Programming Bible</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-09</publish_date>
<description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description>
</book>
<book id="bk111">
<author>O'Brien, Tim</author>
<title>MSXML3: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-01</publish_date>
<description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description>
</book>
<book id="bk112">
<author>Galos, Mike</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.</description>
</book>
</catalog>
Let's say we want to parse this file, create a list of authors, titles and genres, and discard everything else. How would you do it?
Here's how I would've done it in the old days using lxml, a Python library for processing XML files:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
"""
Old XML scraper ๐Ÿ‘ด๐Ÿป
"""


import json

from lxml import etree

# Get catalog (root element)
catalog = etree.parse('books.xml').getroot()

# Find all books
books = catalog.findall('book')

# Get author, title and genre for each book
result = {'books': []}
for book in books:
author, title, genre = book.find('author'), \
book.find('title'), \
book.find('genre')
result['books'].append({
'author': author.text,
'title': title.text,
'genre': genre.text
})

# Print the final result
print(json.dumps(result, indent=4))
Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
{
"books": [
{
"author": "Gambardella, Matthew",
"title": "XML Developer's Guide",
"genre": "Computer"
},
{
"author": "Ralls, Kim",
"title": "Midnight Rain",
"genre": "Fantasy"
},
{
"author": "Corets, Eva",
"title": "Maeve Ascendant",
"genre": "Fantasy"
},
{
"author": "Corets, Eva",
"title": "Oberon's Legacy",
"genre": "Fantasy"
},
{
"author": "Corets, Eva",
"title": "The Sundered Grail",
"genre": "Fantasy"
},
{
"author": "Randall, Cynthia",
"title": "Lover Birds",
"genre": "Romance"
},
{
"author": "Thurman, Paula",
"title": "Splish Splash",
"genre": "Romance"
},
{
"author": "Knorr, Stefan",
"title": "Creepy Crawlies",
"genre": "Horror"
},
{
"author": "Kress, Peter",
"title": "Paradox Lost",
"genre": "Science Fiction"
},
{
"author": "O'Brien, Tim",
"title": "Microsoft .NET: The Programming Bible",
"genre": "Computer"
},
{
"author": "O'Brien, Tim",
"title": "MSXML3: A Comprehensive Guide",
"genre": "Computer"
},
{
"author": "Galos, Mike",
"title": "Visual Studio 7: A Comprehensive Guide",
"genre": "Computer"
}
]
}
In order to implement something like this, we would need to know beforehand that all books are represented by a book element ๐Ÿ“—, which contains children elements corresponding to the author, title and genre of the book, and that all books are themselves children nodes of the root node, which is called catalog ๐Ÿ—‚๏ธ.
Not-So-Fun Fact: in a past life, when I worked as a software tester, I used to write scripts like this all the time using frameworks like BeautifulSoup (BS will make an appearance later in this article, just keep reading) and Selenium. Glad those days are over! ๐Ÿคญ
This option is great if you don't mind going through XML hell ๐Ÿ”ฅ

The AI Way ๐Ÿ•ท๏ธโ›ฐ๏ธ

Fortunately, there is now a better way: we can simply ask for what we need.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
"""
New XML scraper ๐Ÿ•ท๏ธโ›ฐ๏ธ
"""


import json

from scrapegraphai.graphs import XMLScraperGraph

# Read XML file
with open("books.xml", 'r', encoding="utf-8") as file:
source = file.read()

# Define the graph configuration
graph_config = {
"llm": {"model": "bedrock/anthropic.claude-3-sonnet-20240229-v1:0"},
"embeddings": {"model": "bedrock/cohere.embed-multilingual-v3"}
}

# Create a graph instance and run it
graph = XMLScraperGraph(
prompt="List me all the authors, title and genres of the books. Skip the preamble.",
source=source,
config=graph_config
)
result = graph.run()

# Print the final result
print(json.dumps(result, indent=4))
Although the old ๐Ÿ‘ด๐Ÿป and the new ๐Ÿ•ท๏ธโ›ฐ๏ธ scraper implementations have exactly the same size (28 LoC) and both will generate the same output, the way they work is completely different.
In the new scraper, we're using two different Bedrock-provided models - Claude 3 Sonnet as the LLM and Cohere Embed Multilingual v3 as the embedder - that receive our file and generate a response.
As a prompt, we're simply sending our original request plus a Skip the preamble instruction to ensure that Claude goes straight to the point and generates the right output.
Tying everything together is the graph.
In ScrapeGraphAI parlance, graphs are just scraping pipelines aimed at solving specific tasks.
Each graph is composed of several nodes ๐ŸŸข๐ŸŸก๐ŸŸฃ, which can be configured individually to address different aspects of the task like fetching data or extracting information, and the edges that connect them Input โ†’ ๐ŸŸข โ†’ ๐ŸŸก โ†’ ๐ŸŸฃ โ†’ Output.
ScrapeGraphAI offers a wide range of pre-built graphs like the XMLScraperGraph, which we used in the example above, or the SmartScraperGraph (pictured below), and the possibility to create your own custom graphs.
AWS credentials and settings are usually injected via environment variables (AWS_*) but we can create a custom client and pass it along to the graph:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
"""
SmartScraperGraph with custom AWS client โ˜๏ธ
"""


import os
import json

import boto3

from scrapegraphai.graphs import SmartScraperGraph

# Initialize session
session = boto3.Session(
aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID"),
aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY"),
aws_session_token=os.environ.get("AWS_SESSION_TOKEN"),
region_name=os.environ.get("AWS_DEFAULT_REGION")
)

# Initialize client
client = session.client("bedrock-runtime")

# Define graph configuration
config = {
"llm": {
"client": client,
"model": "bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
"temperature": 0.0
},
"embeddings": {
"client": client,
"model": "bedrock/cohere.embed-multilingual-v3",
},
}

# Create graph instance and run it
graph = SmartScraperGraph(
prompt="List me all the articles.",
source="https://perinim.github.io/projects",
config=config
)
result = graph.run()

# Print the final result
print(json.dumps(result, indent=4))
Output:
1
2
3
4
5
6
7
8
{
"articles": [
"Rotary Pendulum RL",
"DQN Implementation from scratch",
"Multi Agents HAED",
"Wireless ESC for Modular Drones"
]
}

Demo โ˜•

As a way to explore different use cases with Amazon Bedrock, I've created a multi-page Streamlit application that showcases a small subset of scraper graphs.
๐Ÿ‘จโ€๐Ÿ’ป All code and documentation is available on GitHub.
1/ Clone the ScrapeGraphAI-Bedrock repository
1
2
git clone https://github.com/JGalego/ScrapeGraphAI-Bedrock
cd ScrapeGraphAI-Bedrock
and install the dependencies
1
2
3
4
5
6
7
8
9
10
# Install Python packages
pip install -r requirements.txt

# Install browsers
# https://playwright.dev/python/docs/browsers#install-browsers
playwright install

# Install system dependencies
# https://playwright.dev/python/docs/browsers#install-system-dependencies
playwright install-deps
2/ Don't forget to set up the AWS credentials
1
2
3
4
5
6
7
8
# Option 1: (recommended) AWS CLI
aws configure

# Option 2: environment variables
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=...
export AWS_DEFAULT_REGION=...
3/ Start the demo application
1
2
3
4
5
# Run the full application
streamlit run pages/scrapegraphai_bedrock.py

# or just a single demo
streamlit run pages/smart_scraper.py
Remember when I said that BeautifulSoup would make an appearance? You can use the Script Generator demo, which is backed by the ScriptCreatorGraph, to generate an old-style scraping pipeline powered by the beautifulsoup framework or any other scraping library.
1
2
3
4
5
6
7
8
9
10
11
12
from bs4 import BeautifulSoup

import requests

url = "https://perinim.github.io/projects"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

articles = soup.find_all("article")

for article in articles:
print(article.get_text())
Try it out, share it and let me know what you think in the comments section below โคต๏ธ
๐Ÿ‘ทโ€โ™‚๏ธ Want to contribute? Feel free to open an issue or a pull request!
See you next time! ๐Ÿ‘‹
ย 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

3 Comments