Web to Wisdom: Transforming Web Content with Amazon Bedrock for Knowledge Bases

Are you ready to enhance your Retrieval-Augmented Generation (RAG) applications? Amazon Bedrock now has a new feature (in preview) that makes its Knowledge Bases (KB) for Bedrock even more versatile. You can now integrate a wider range of data sources seamlessly, including web URLs, as well as data from Atlassian Confluence, Microsoft SharePoint, and Salesforce. Whether you're a developer or a data scientist, these new options can help streamline your data workflows and make your AI-powered responses more accurate and relevant. In this blog we explore how you can get started with the web URL connected with Knowledge Bases as a data source.

But before we dive into this new capability, let's take a moment to understand what Knowledge Bases for Amazon Bedrock is all about.

What is Knowledge Bases for Amazon Bedrock?

Let's start with a quick primer, in case you're new to Amazon Bedrock or the concept of Knowledge Bases (KB). KB is a fully managed RAG capability that enables you to link foundation models (FMs) to your internal company data sources. This connection allows FMs to retrieve contextual information from your data, delivering more relevant and precise responses.

RAG is a powerful technique that enhances the capabilities of foundation models by providing them access to structured and unstructured data. This helps in generating responses that are not only more accurate but also more tailored to the specific needs of your business. If you are new to RAG and would like to have a deeper understanding of RAG, vectors, embeddings, and all the goodness around that, you might like to check out the two-part blog post on Vector Embeddings and RAG Demystified: Leveraging Amazon Bedrock, Aurora, and LangChain.

What do the new data source connectors mean to you?

A data source connector is basically a bridge that links your proprietary data to a KB. Once set up, these connectors ensure that your data stays updated and is readily available for querying.

When you create a KB in Amazon Bedrock, you can configure a data source of your choice. Before we jump into how to set things up and make use of these connectors, let's see which new connectors are now available for you to use in addition to Amazon S3:

Web Data Source: Index public web pages so your RAG applications can access data from public web pages which you are authorized for. This is perfect for integrating up-to-date information from company blogs, news sites, and social media feeds.
Confluence: Connect to Confluence to tap into your organization's documentation, meeting notes, and collaborative content. This integration allows your AI applications to utilize the latest insights from your team's shared knowledge.
Salesforce: Integrate with Salesforce to query customer relationship management (CRM) data. This provides your RAG applications with contextually rich responses based on customer interactions, sales data, and more.
Microsoft SharePoint: Access a wide array of documents and resources stored in your organization's SharePoint sites. This is ideal for you if you rely on SharePoint for document management and collaboration.

Image not found

Data source connectors

In this blog, we will focus on Web Crawler as the data source, so lets get started.

Getting Started with the new Web Crawler

Setting up these new data sources is straightforward, whether you prefer using the AWS Management Console or the CreateDataSource API.

The Web Crawler is a powerful tool that connects to and crawls HTML pages, starting from a specified seed URL and traversing all child links within the same top-level domain and path. It can also fetch supported documents referenced by these HTML pages, even if they are outside the primary domain. You can customize the crawling behavior through the crawling configuration settings.

Here's what you can do with the Web Crawler:

Select Multiple URLs: Choose multiple seed URLs to start the crawl.
Respect robots.txt Directives: Follow standard robots.txt directives like Allow and Disallow to ensure compliance with web standards.
Scope Limitation: Restrict the crawling scope to specific URLs and optionally exclude URLs that match certain filter patterns.
Crawl Rate Limitation: Control the rate at which URLs are crawled to manage load and performance.
Monitor Crawling Status: Use Amazon CloudWatch to view the status of URLs visited during the crawling process.

Before using this connector, ensure you adhere to the Amazon Acceptable Use Policy to make sure you have permission to crawl the URLs of interest.

We will be using the Python SDK, boto3, but you can do the same via console or CLI

First, we need to create the vector database which the KB will use in the backend. This vector store will allow Bedrock to store, update, and manage embeddings (which are the web-scraped data from the web URL you provide). You can quickly create a new vector store or select from a supported vector store you have previously created. You can follow this GitHub repo for more details.

Step 1: Create the vector store

Let’s create a new vector store, using Amazon OpenSearch Serverless (OSS).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import json
import os
import time
import boto3
from utility import create_bedrock_execution_role, create_oss_policy_attach_bedrock_execution_role, create_policies_in_oss
import random
from retrying import retry
suffix = random.randrange(200, 900)
boto3_session = boto3.session.Session()
region_name = boto3_session.region_name
bedrock_agent_client = boto3_session.client('bedrock-agent', region_name=region_name)
service = 'aoss'
bucket_name = <PUT YOUR BUCKET NAME> # Provide your bucket name which is already created

## Step 1 - Create OSS policies and collection
vector_store_name = f'bedrock-sample-rag-{suffix}'
index_name = f"bedrock-sample-rag-index-{suffix}"
aoss_client = boto3_session.client('opensearchserverless')
bedrock_kb_execution_role = create_bedrock_execution_role(bucket_name)
bedrock_kb_execution_role_arn = bedrock_kb_execution_role['Role']['Arn']

# Create security, network and data access policies within OSS
encryption_policy, network_policy, access_policy = create_policies_in_oss( vector_store_name=vector_store_name,
                                                                           aoss_client=aoss_client,
                                                                           bedrock_kb_execution_role_arn=bedrock_kb_execution_role_arn)
collection = aoss_client.create_collection(name=vector_store_name,type='VECTORSEARCH')
collection_id = collection['createCollectionDetail']['id']
host = collection_id + '.' + region_name + '.aoss.amazonaws.com'
# Create OSS policy and attach it to Bedrock execution role
create_oss_policy_attach_bedrock_execution_role(collection_id=collection_id,
                                                bedrock_kb_execution_role=bedrock_kb_execution_role)
                                                
## Step 2 - Create Vector Index
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
credentials = boto3.Session().get_credentials()
awsauth = auth = AWSV4SignerAuth(credentials, region_name, service)
index_name = f"bedrock-sample-index-{suffix}"
body_json = {
   "settings": {
      "index.knn": "true"
   },
   "mappings": {
      "properties": {
         "vector": {
            "type": "knn_vector",
            "dimension": 1536,
            "method": {
                "name": "hnsw",
                "engine": "faiss",  
                "space_type": "l2",
                "parameters": {
                    "ef_construction": 200,
                    "m": 16
                }
            }
        },
         "text": {
            "type": "text"
         },
         "text-metadata": {
            "type": "text"         }
      }
   }
}

# Build the OpenSearch client
oss_client = OpenSearch(
    hosts=[{'host': host, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    timeout=300
)

# It can take up to a minute for data access rules to be enforced
time.sleep(60)     

# Create index
response = oss_client.indices.create(index=index_name, body=json.dumps(body_json))
print('\nCreating index:')
print(response)  

Step 2 : Create Knowledge Base

Now that we have the vector store created, let’s create the KB. Here we'll:

Initialize Open search serverless configuration which will include collection ARN, index name, vector field, text field and metadata field.
Initialize chunking strategy, based on which KB will split the documents into pieces of size equal to the chunk size mentioned in the chunkingStrategyConfiguration.
Initialize the web URL configuration, which will be used to create the data source object later.
Initialize the Titan embeddings model ARN, as this will be used to create the embeddings for each of the text chunks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
opensearchServerlessConfiguration = {
            "collectionArn": collection["createCollectionDetail"]['arn'],
            "vectorIndexName": index_name,
            "fieldMapping": {
                "vectorField": "vector",
                "textField": "text",
                "metadataField": "text-metadata"
            }
        }

chunkingStrategyConfiguration = {
    "chunkingStrategy": "FIXED_SIZE",
    "fixedSizeChunkingConfiguration": {
        "maxTokens": 512,
        "overlapPercentage": 20
    }
}

webConfiguration = {"sourceConfiguration": {
                          "urlConfiguration": {
                           "seedUrls": [{
                                    "url": "https://www.datascienceportfol.io/suman"    #### <<<<<------ <Change this to your Web URL>
                                }]
                            }
                        },
                     "crawlerConfiguration": {
                            "crawlerLimits": {
                                "rateLimit": 50
                            },
                            "scope": "HOST_ONLY"
                        }
                   }         
                   
embeddingModelArn = f"arn:aws:bedrock:{region_name}::foundation-model/amazon.titan-embed-text-v1"
name = f"bedrock-sample-knowledge-base-{suffix}"
description = "Bedrock Knowledge Bases for Web URL Connector"
roleArn = bedrock_kb_execution_role_arn

Let’s now provide the above configurations as input to the create_knowledge_base method, which will create the KB.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Create a KnowledgeBase (KB)
from retrying import retry

@retry(wait_random_min=1000, wait_random_max=2000,stop_max_attempt_number=7)
def create_knowledge_base_func():
    create_kb_response = bedrock_agent_client.create_knowledge_base(
        name = name,
        description = description,
        roleArn = roleArn,
        knowledgeBaseConfiguration = {
            "type": "VECTOR",
            "vectorKnowledgeBaseConfiguration": {
                "embeddingModelArn": embeddingModelArn
            }
        },
        storageConfiguration = {
            "type": "OPENSEARCH_SERVERLESS",
            "opensearchServerlessConfiguration":opensearchServerlessConfiguration
        }
    )
    return create_kb_response["knowledgeBase"]
    
try:
    kb = create_knowledge_base_func()
except Exception as err:
    print(f"{err=}, {type(err)=}")
    
# Get KnowledgeBase 
get_kb_response = bedrock_agent_client.get_knowledge_base(knowledgeBaseId = kb['knowledgeBaseId'])

Now finally we can create a Web URL data source, which will be associated with the knowledge base created above.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Create a DataSource in KnowledgeBase 
create_ds_response = bedrock_agent_client.create_data_source(
                                                    name = name,
                                                    description = description,
                                                    knowledgeBaseId = kb['knowledgeBaseId'],
                                                    dataDeletionPolicy = 'DELETE',
                                                    dataSourceConfiguration = {
                                                        "type": "WEB",
                                                        "webConfiguration":webConfiguration
                                                    },
                                                    vectorIngestionConfiguration = {
                                                        "chunkingConfiguration": chunkingStrategyConfiguration
                                                    }
                                                )
ds = create_ds_response["dataSource"]

# Get DataSource 
bedrock_agent_client.get_data_source(knowledgeBaseId = kb['knowledgeBaseId'], dataSourceId = ds["dataSourceId"])

Step 3: Data sync

Once the KB is ready and the data source is created, we can start the data sync. During the sync job, the KB will fetch the documents in the data source, pre-process them to extract text, chunk them based on the provided chunking size, create embeddings for each chunk, and then write them to the vector database, which in this case is our OpenSearch Serverless.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Start an ingestion job
start_job_response = bedrock_agent_client.start_ingestion_job(knowledgeBaseId = kb['knowledgeBaseId'], 
                                                              dataSourceId = ds["dataSourceId"])
job = start_job_response["ingestionJob"]
# Get job status  
while(job['status']!='COMPLETE' ):
  get_job_response = bedrock_agent_client.get_ingestion_job(
      knowledgeBaseId = kb['knowledgeBaseId'],
        dataSourceId = ds["dataSourceId"],
        ingestionJobId = job["ingestionJobId"]
  )
  job = get_job_response["ingestionJob"]
print(job)
time.sleep(40)

Test the knowledge base

Lets now use the RetrieveAndGenerate API to test the knowledge base. Behind the scenes, RetrieveAndGenerate API converts queries into embeddings, searches the knowledge base, and then augments the foundation model prompt with the search results as context information and returns the FM-generated response to the question. For multi-turn conversations, Knowledge Bases manage short-term memory of the conversation to provide more contextual results.

`RetrieveAndGenerate` API

The output of the RetrieveAndGenerate API includes the generated response, source attribution as well as the retrieved text chunks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# KB using RetrieveAndGenerate API
bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime", region_name=region_name)
model_id = "anthropic.claude-3-sonnet-20240229-v1:0" 
model_arn = f'arn:aws:bedrock:us-east-1::foundation-model/{model_id}'
query = "What are the projects Suman has worked on?"
response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        'text': query
    },
    retrieveAndGenerateConfiguration={
        'type': 'KNOWLEDGE_BASE',
        'knowledgeBaseConfiguration': {
            'knowledgeBaseId': kb_id,
            'modelArn': model_arn
        }
    },
)
generated_text = response['output']['text']
print(generated_text)

`Retrieve` API

Retrieve API converts user queries into embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom workﬂows on top of the semantic search results. The output of the Retrieve API includes the the retrieved text chunks, the location type and URI of the source data, as well as the relevance scores of the retrievals.

1
2
3
4
5
6
7
8
9
10
11
12
13
# Retreive API for fetching only the relevant context.
relevant_documents = bedrock_agent_runtime_client.retrieve(
    retrievalQuery= {
        'text': query
    },
    knowledgeBaseId=kb_id,
    retrievalConfiguration= {
        'vectorSearchConfiguration': {
            'numberOfResults': 3 # will fetch top 3 documents which matches closely with the query.
        }
    }
)
print(relevant_documents["retrievalResults"])

You can do the same using console, have a look at this quick walkthrough.

Conclusion

Integrating web and enterprise data sources with Amazon Bedrock significantly enhances the capabilities of your RAG applications. The new data connectors for Atlassian Confluence, Microsoft SharePoint, and Salesforce, alongside web data sources and Amazon S3, streamline data workflows and improve the accuracy and relevance of AI-powered responses. With these connectors, you can effortlessly integrate and manage data from various sources, empowering your applications to deliver more informed and contextually rich outputs. Whether you're a developer or a data scientist, these advancements in Amazon Bedrock open up new possibilities for creating sophisticated and responsive AI solutions.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.