Scrape All Things: AI-powered web scraping with ScrapeGraphAI 🕷️ and Amazon Bedrock ⛰️

Overview

"In the beginning was a graph..." ― Greg Egan, Schild's Ladder

A couple of weeks, I began hearing rumors about a new project that was taking GitHub trends by storm.

I usually don't pay much attention to such gossip, but everyone kept calling it a "revolution" in web scraping and commending it for its "ease of use" that I decided to give it a chance.

The project in question is of course ScrapeGraphAI 🕷️, an open-source Python library that uses large language models (LLMs) 🧠💬 and directed graph logic 🟢→🟡→🟣 to create scraping pipelines for websites and documents, allowing us to extract information from multiple sources using only natural language prompts.

4 pull requests later, I can safely say that I have become a fan 🤩 (Most of my contributions are around improving the integration between ScrapeGraphAI and AWS AI services - big thanks to Marco Vinciguerra and the rest of team for reviewing them all 🙌).

So my plan for today is to show you how to to create next-generation, AI-powered scraping pipelines using ScrapeGraphAI and Amazon Bedrock ⛰️, a fully managed AI service that let's you access state-of-the-art LLMs and embedding models.

But first, let's look at a simple example to illustrate how far we've come along...

The Old Way 👴🏻

Take this XML file which contains a small list of books 📚:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description>
   </book>
   <book id="bk104">
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-03-10</publish_date>
      <description>In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant.</description>
   </book>
   <book id="bk105">
      <author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-09-10</publish_date>
      <description>The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy.</description>
   </book>
   <book id="bk106">
      <author>Randall, Cynthia</author>
      <title>Lover Birds</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-09-02</publish_date>
      <description>When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled.</description>
   </book>
   <book id="bk107">
      <author>Thurman, Paula</author>
      <title>Splish Splash</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>A deep sea diver finds true love twenty 
      thousand leagues beneath the sea.</description>
   </book>
   <book id="bk108">
      <author>Knorr, Stefan</author>
      <title>Creepy Crawlies</title>
      <genre>Horror</genre>
      <price>4.95</price>
      <publish_date>2000-12-06</publish_date>
      <description>An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.</description>
   </book>
   <book id="bk109">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems 
      of being quantum.</description>
   </book>
   <book id="bk110">
      <author>O'Brien, Tim</author>
      <title>Microsoft .NET: The Programming Bible</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-09</publish_date>
      <description>Microsoft's .NET initiative is explored in 
      detail in this deep programmer's reference.</description>
   </book>
   <book id="bk111">
      <author>O'Brien, Tim</author>
      <title>MSXML3: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-01</publish_date>
      <description>The Microsoft MSXML3 parser is covered in 
      detail, with attention to XML DOM interfaces, XSLT processing, 
      SAX and more.</description>
   </book>
   <book id="bk112">
      <author>Galos, Mike</author>
      <title>Visual Studio 7: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>49.95</price>
      <publish_date>2001-04-16</publish_date>
      <description>Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are 
      integrated into a comprehensive development 
      environment.</description>
   </book>
</catalog>

Let's say we want to parse this file, create a list of authors, titles and genres, and discard everything else. How would you do it?

Here's how I would've done it in the old days using lxml, a Python library for processing XML files:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
"""
Old XML scraper 👴🏻
"""

import json

from lxml import etree

# Get catalog (root element)
catalog = etree.parse('books.xml').getroot()

# Find all books
books = catalog.findall('book')

# Get author, title and genre for each book
result = {'books': []}
for book in books:
    author, title, genre = book.find('author'), \
        				   book.find('title'), \
                           book.find('genre')
    result['books'].append({
        'author': author.text,
        'title': title.text,
        'genre': genre.text
    })

# Print the final result
print(json.dumps(result, indent=4))

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
{
    "books": [
        {
            "author": "Gambardella, Matthew",
            "title": "XML Developer's Guide",
            "genre": "Computer"
        },
        {
            "author": "Ralls, Kim",
            "title": "Midnight Rain",
            "genre": "Fantasy"
        },
        {
            "author": "Corets, Eva",
            "title": "Maeve Ascendant",
            "genre": "Fantasy"
        },
        {
            "author": "Corets, Eva",
            "title": "Oberon's Legacy",
            "genre": "Fantasy"
        },
        {
            "author": "Corets, Eva",
            "title": "The Sundered Grail",
            "genre": "Fantasy"
        },
        {
            "author": "Randall, Cynthia",
            "title": "Lover Birds",
            "genre": "Romance"
        },
        {
            "author": "Thurman, Paula",
            "title": "Splish Splash",
            "genre": "Romance"
        },
        {
            "author": "Knorr, Stefan",
            "title": "Creepy Crawlies",
            "genre": "Horror"
        },
        {
            "author": "Kress, Peter",
            "title": "Paradox Lost",
            "genre": "Science Fiction"
        },
        {
            "author": "O'Brien, Tim",
            "title": "Microsoft .NET: The Programming Bible",
            "genre": "Computer"
        },
        {
            "author": "O'Brien, Tim",
            "title": "MSXML3: A Comprehensive Guide",
            "genre": "Computer"
        },
        {
            "author": "Galos, Mike",
            "title": "Visual Studio 7: A Comprehensive Guide",
            "genre": "Computer"
        }
    ]
}

In order to implement something like this, we would need to know beforehand that all books are represented by a book element 📗, which contains children elements corresponding to the author, title and genre of the book, and that all books are themselves children nodes of the root node, which is called catalog 🗂️.

Not-So-Fun Fact: in a past life, when I worked as a software tester, I used to write scripts like this all the time using frameworks like BeautifulSoup (BS will make an appearance later in this article, just keep reading) and Selenium. Glad those days are over! 🤭

This option is great if you don't mind going through XML hell 🔥

The AI Way 🕷️⛰️

Fortunately, there is now a better way: we can simply ask for what we need.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
"""
New XML scraper 🕷️⛰️
"""

import json

from scrapegraphai.graphs import XMLScraperGraph

# Read XML file
with open("books.xml", 'r', encoding="utf-8") as file:
    source = file.read()

# Define the graph configuration
graph_config = {
    "llm": {"model": "bedrock/anthropic.claude-3-sonnet-20240229-v1:0"},
    "embeddings": {"model": "bedrock/cohere.embed-multilingual-v3"}
}

# Create a graph instance and run it
graph = XMLScraperGraph(
    prompt="List me all the authors, title and genres of the books. Skip the preamble.",
    source=source,
    config=graph_config
)
result = graph.run()

# Print the final result
print(json.dumps(result, indent=4))

Although the old 👴🏻 and the new 🕷️⛰️ scraper implementations have exactly the same size (28 LoC) and both will generate the same output, the way they work is completely different.

In the new scraper, we're using two different Bedrock-provided models - Claude 3 Sonnet as the LLM and Cohere Embed Multilingual v3 as the embedder - that receive our file and generate a response.

As a prompt, we're simply sending our original request plus a Skip the preamble instruction to ensure that Claude goes straight to the point and generates the right output.

Tying everything together is the graph.

In ScrapeGraphAI parlance, graphs are just scraping pipelines aimed at solving specific tasks.

Each graph is composed of several nodes 🟢🟡🟣, which can be configured individually to address different aspects of the task like fetching data or extracting information, and the edges that connect them Input → 🟢 → 🟡 → 🟣 → Output.

ScrapeGraphAI offers a wide range of pre-built graphs like the XMLScraperGraph, which we used in the example above, or the SmartScraperGraph (pictured below), and the possibility to create your own custom graphs.

AWS credentials and settings are usually injected via environment variables (AWS_*) but we can create a custom client and pass it along to the graph:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
"""
SmartScraperGraph with custom AWS client ☁️
"""

import os
import json

import boto3

from scrapegraphai.graphs import SmartScraperGraph

# Initialize session
session = boto3.Session(
    aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID"),
    aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY"),
    aws_session_token=os.environ.get("AWS_SESSION_TOKEN"),
    region_name=os.environ.get("AWS_DEFAULT_REGION")
)

# Initialize client
client = session.client("bedrock-runtime")

# Define graph configuration
config = {
    "llm": {
        "client": client,
        "model": "bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
        "temperature": 0.0
    },
    "embeddings": {
        "client": client,
        "model": "bedrock/cohere.embed-multilingual-v3",
    },
}

# Create graph instance and run it
graph = SmartScraperGraph(
    prompt="List me all the articles.",
    source="https://perinim.github.io/projects",
    config=config
)
result = graph.run()

# Print the final result
print(json.dumps(result, indent=4))

Output:

1
2
3
4
5
6
7
8
{
    "articles": [
        "Rotary Pendulum RL",
        "DQN Implementation from scratch",
        "Multi Agents HAED",
        "Wireless ESC for Modular Drones"
    ]
}

Demo ☕

As a way to explore different use cases with Amazon Bedrock, I've created a multi-page Streamlit application that showcases a small subset of scraper graphs.

👨‍💻 All code and documentation is available on GitHub.

1/ Clone the ScrapeGraphAI-Bedrock repository

1
2
git clone https://github.com/JGalego/ScrapeGraphAI-Bedrock
cd ScrapeGraphAI-Bedrock

and install the dependencies

1
2
3
4
5
6
7
8
9
10
# Install Python packages
pip install -r requirements.txt

# Install browsers
# https://playwright.dev/python/docs/browsers#install-browsers
playwright install

# Install system dependencies
# https://playwright.dev/python/docs/browsers#install-system-dependencies
playwright install-deps

2/ Don't forget to set up the AWS credentials

1
2
3
4
5
6
7
8
# Option 1: (recommended) AWS CLI
aws configure

# Option 2: environment variables
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=...
export AWS_DEFAULT_REGION=...

3/ Start the demo application

1
2
3
4
5
# Run the full application
streamlit run pages/scrapegraphai_bedrock.py

# or just a single demo
streamlit run pages/smart_scraper.py

Remember when I said that BeautifulSoup would make an appearance? You can use the Script Generator demo, which is backed by the ScriptCreatorGraph, to generate an old-style scraping pipeline powered by the beautifulsoup framework or any other scraping library.

1
2
3
4
5
6
7
8
9
10
11
12
from bs4 import BeautifulSoup

import requests

url = "https://perinim.github.io/projects"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

articles = soup.find_all("article")

for article in articles:
    print(article.get_text())

Try it out, share it and let me know what you think in the comments section below ⤵️

👷‍♂️ Want to contribute? Feel free to open an issue or a pull request!

See you next time! 👋

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.

Scrape All Things: AI-powered web scraping with ScrapeGraphAI 🕷️ and Amazon Bedrock ⛰️

Learn how to extract information from documents and websites using natural language prompts.

Overview

The Old Way 👴🏻

The AI Way 🕷️⛰️

Demo ☕

4 Comments