
Web to Wisdom: Transforming Web Content with Amazon Bedrock for Knowledge Bases
Integrate web data with Amazon Bedrock for KBs. Boost RAG apps using public web data for precise AI responses.
- Web Data Source: Index public web pages so your RAG applications can access data from public web pages which you are authorized for. This is perfect for integrating up-to-date information from company blogs, news sites, and social media feeds.
- Confluence: Connect to Confluence to tap into your organization's documentation, meeting notes, and collaborative content. This integration allows your AI applications to utilize the latest insights from your team's shared knowledge.
- Salesforce: Integrate with Salesforce to query customer relationship management (CRM) data. This provides your RAG applications with contextually rich responses based on customer interactions, sales data, and more.
- Microsoft SharePoint: Access a wide array of documents and resources stored in your organization's SharePoint sites. This is ideal for you if you rely on SharePoint for document management and collaboration.
Web Crawler
as the data source, so lets get started.Management Console
or the CreateDataSource API
.- Select Multiple URLs: Choose multiple seed URLs to start the crawl.
- Respect robots.txt Directives: Follow standard robots.txt directives like
Allow
andDisallow
to ensure compliance with web standards. - Scope Limitation: Restrict the
crawling scope
to specific URLs and optionally exclude URLs that match certain filter patterns. - Crawl Rate Limitation: Control the
rate
at which URLs are crawled to manage load and performance. - Monitor Crawling Status: Use Amazon CloudWatch to view the
status of URLs visited
during the crawling process.
boto3,
but you can do the same via console
or CLI
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import json
import os
import time
import boto3
from utility import create_bedrock_execution_role, create_oss_policy_attach_bedrock_execution_role, create_policies_in_oss
import random
from retrying import retry
suffix = random.randrange(200, 900)
boto3_session = boto3.session.Session()
region_name = boto3_session.region_name
bedrock_agent_client = boto3_session.client('bedrock-agent', region_name=region_name)
service = 'aoss'
bucket_name = <PUT YOUR BUCKET NAME> # Provide your bucket name which is already created
## Step 1 - Create OSS policies and collection
vector_store_name = f'bedrock-sample-rag-{suffix}'
index_name = f"bedrock-sample-rag-index-{suffix}"
aoss_client = boto3_session.client('opensearchserverless')
bedrock_kb_execution_role = create_bedrock_execution_role(bucket_name)
bedrock_kb_execution_role_arn = bedrock_kb_execution_role['Role']['Arn']
# Create security, network and data access policies within OSS
encryption_policy, network_policy, access_policy = create_policies_in_oss( vector_store_name=vector_store_name,
aoss_client=aoss_client,
bedrock_kb_execution_role_arn=bedrock_kb_execution_role_arn)
collection = aoss_client.create_collection(name=vector_store_name,type='VECTORSEARCH')
collection_id = collection['createCollectionDetail']['id']
host = collection_id + '.' + region_name + '.aoss.amazonaws.com'
# Create OSS policy and attach it to Bedrock execution role
create_oss_policy_attach_bedrock_execution_role(collection_id=collection_id,
bedrock_kb_execution_role=bedrock_kb_execution_role)
## Step 2 - Create Vector Index
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
credentials = boto3.Session().get_credentials()
awsauth = auth = AWSV4SignerAuth(credentials, region_name, service)
index_name = f"bedrock-sample-index-{suffix}"
body_json = {
"settings": {
"index.knn": "true"
},
"mappings": {
"properties": {
"vector": {
"type": "knn_vector",
"dimension": 1536,
"method": {
"name": "hnsw",
"engine": "faiss",
"space_type": "l2",
"parameters": {
"ef_construction": 200,
"m": 16
}
}
},
"text": {
"type": "text"
},
"text-metadata": {
"type": "text" }
}
}
}
# Build the OpenSearch client
oss_client = OpenSearch(
hosts=[{'host': host, 'port': 443}],
http_auth=awsauth,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection,
timeout=300
)
# It can take up to a minute for data access rules to be enforced
time.sleep(60)
# Create index
response = oss_client.indices.create(index=index_name, body=json.dumps(body_json))
print('\nCreating index:')
print(response)
Initialize
Open search serverless configuration which will include collection ARN, index name, vector field, text field and metadata field.Initialize
chunking strategy, based on which KB will split the documents into pieces of size equal to the chunk size mentioned in thechunkingStrategyConfiguration
.Initialize
the web URL configuration, which will be used to create the data source object later.Initialize
the Titan embeddings model ARN, as this will be used to create the embeddings for each of the text chunks.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
opensearchServerlessConfiguration = {
"collectionArn": collection["createCollectionDetail"]['arn'],
"vectorIndexName": index_name,
"fieldMapping": {
"vectorField": "vector",
"textField": "text",
"metadataField": "text-metadata"
}
}
chunkingStrategyConfiguration = {
"chunkingStrategy": "FIXED_SIZE",
"fixedSizeChunkingConfiguration": {
"maxTokens": 512,
"overlapPercentage": 20
}
}
webConfiguration = {"sourceConfiguration": {
"urlConfiguration": {
"seedUrls": [{
"url": "https://www.datascienceportfol.io/suman" #### <<<<<------ <Change this to your Web URL>
}]
}
},
"crawlerConfiguration": {
"crawlerLimits": {
"rateLimit": 50
},
"scope": "HOST_ONLY"
}
}
embeddingModelArn = f"arn:aws:bedrock:{region_name}::foundation-model/amazon.titan-embed-text-v1"
name = f"bedrock-sample-knowledge-base-{suffix}"
description = "Bedrock Knowledge Bases for Web URL Connector"
roleArn = bedrock_kb_execution_role_arn
create_knowledge_base
method, which will create the KB.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Create a KnowledgeBase (KB)
from retrying import retry
def create_knowledge_base_func():
create_kb_response = bedrock_agent_client.create_knowledge_base(
name = name,
description = description,
roleArn = roleArn,
knowledgeBaseConfiguration = {
"type": "VECTOR",
"vectorKnowledgeBaseConfiguration": {
"embeddingModelArn": embeddingModelArn
}
},
storageConfiguration = {
"type": "OPENSEARCH_SERVERLESS",
"opensearchServerlessConfiguration":opensearchServerlessConfiguration
}
)
return create_kb_response["knowledgeBase"]
try:
kb = create_knowledge_base_func()
except Exception as err:
print(f"{err=}, {type(err)=}")
# Get KnowledgeBase
get_kb_response = bedrock_agent_client.get_knowledge_base(knowledgeBaseId = kb['knowledgeBaseId'])
Web URL
data source, which will be associated with the knowledge base created above. 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Create a DataSource in KnowledgeBase
create_ds_response = bedrock_agent_client.create_data_source(
name = name,
description = description,
knowledgeBaseId = kb['knowledgeBaseId'],
dataDeletionPolicy = 'DELETE',
dataSourceConfiguration = {
"type": "WEB",
"webConfiguration":webConfiguration
},
vectorIngestionConfiguration = {
"chunkingConfiguration": chunkingStrategyConfiguration
}
)
ds = create_ds_response["dataSource"]
# Get DataSource
bedrock_agent_client.get_data_source(knowledgeBaseId = kb['knowledgeBaseId'], dataSourceId = ds["dataSourceId"])
data sync
. During the sync job, the KB will fetch the documents in the data source, pre-process them to extract text, chunk them based on the provided chunking size, create embeddings for each chunk, and then write them to the vector database, which in this case is our OpenSearch Serverless.1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Start an ingestion job
start_job_response = bedrock_agent_client.start_ingestion_job(knowledgeBaseId = kb['knowledgeBaseId'],
dataSourceId = ds["dataSourceId"])
job = start_job_response["ingestionJob"]
# Get job status
while(job['status']!='COMPLETE' ):
get_job_response = bedrock_agent_client.get_ingestion_job(
knowledgeBaseId = kb['knowledgeBaseId'],
dataSourceId = ds["dataSourceId"],
ingestionJobId = job["ingestionJobId"]
)
job = get_job_response["ingestionJob"]
print(job)
time.sleep(40)
RetrieveAndGenerate
API to test the knowledge base. Behind the scenes, RetrieveAndGenerate
API converts queries into embeddings, searches the knowledge base, and then augments the foundation model prompt with the search results as context information and returns the FM-generated response to the question. For multi-turn conversations, Knowledge Bases manage short-term memory of the conversation to provide more contextual results.RetrieveAndGenerate
API includes the generated response, source attribution as well as the retrieved text chunks.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# KB using RetrieveAndGenerate API
bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime", region_name=region_name)
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
model_arn = f'arn:aws:bedrock:us-east-1::foundation-model/{model_id}'
query = "What are the projects Suman has worked on?"
response = bedrock_agent_runtime_client.retrieve_and_generate(
input={
'text': query
},
retrieveAndGenerateConfiguration={
'type': 'KNOWLEDGE_BASE',
'knowledgeBaseConfiguration': {
'knowledgeBaseId': kb_id,
'modelArn': model_arn
}
},
)
generated_text = response['output']['text']
print(generated_text)
Retrieve
API includes the the retrieved text chunks, the location type and URI of the source data, as well as the relevance scores of the retrievals.1
2
3
4
5
6
7
8
9
10
11
12
13
# Retreive API for fetching only the relevant context.
relevant_documents = bedrock_agent_runtime_client.retrieve(
retrievalQuery= {
'text': query
},
knowledgeBaseId=kb_id,
retrievalConfiguration= {
'vectorSearchConfiguration': {
'numberOfResults': 3 # will fetch top 3 documents which matches closely with the query.
}
}
)
print(relevant_documents["retrievalResults"])
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.