Web to Wisdom: Transforming Web Content with Amazon Bedrock for Knowledge Bases
Integrate web data with Amazon Bedrock for KBs. Boost RAG apps using public web data for precise AI responses.
Suman Debnath
Amazon Employee
Published Jul 11, 2024
Last Modified Jul 16, 2024
Are you ready to enhance your Retrieval-Augmented Generation (RAG) applications? Amazon Bedrock now has a new feature (in preview) that makes its Knowledge Bases (KB) for Bedrock even more versatile. You can now integrate a wider range of data sources seamlessly, including web URLs, as well as data from Atlassian Confluence, Microsoft SharePoint, and Salesforce. Whether you're a developer or a data scientist, these new options can help streamline your data workflows and make your AI-powered responses more accurate and relevant. In this blog we explore how you can get started with the web URL connected with Knowledge Bases as a data source.
But before we dive into this new capability, let's take a moment to understand what Knowledge Bases for Amazon Bedrock is all about.
Let's start with a quick primer, in case you're new to Amazon Bedrock or the concept of Knowledge Bases (KB). KB is a fully managed RAG capability that enables you to link foundation models (FMs) to your internal company data sources. This connection allows FMs to retrieve contextual information from your data, delivering more relevant and precise responses.
RAG is a powerful technique that enhances the capabilities of foundation models by providing them access to structured and unstructured data. This helps in generating responses that are not only more accurate but also more tailored to the specific needs of your business. If you are new to RAG and would like to have a deeper understanding of RAG, vectors, embeddings, and all the goodness around that, you might like to check out the two-part blog post on Vector Embeddings and RAG Demystified: Leveraging Amazon Bedrock, Aurora, and LangChain.
A data source connector is basically a bridge that links your proprietary data to a KB. Once set up, these connectors ensure that your data stays updated and is readily available for querying.
When you create a KB in Amazon Bedrock, you can configure a data source of your choice. Before we jump into how to set things up and make use of these connectors, let's see which new connectors are now available for you to use in addition to Amazon S3:
- Web Data Source: Index public web pages so your RAG applications can access data from public web pages which you are authorized for. This is perfect for integrating up-to-date information from company blogs, news sites, and social media feeds.
- Confluence: Connect to Confluence to tap into your organization's documentation, meeting notes, and collaborative content. This integration allows your AI applications to utilize the latest insights from your team's shared knowledge.
- Salesforce: Integrate with Salesforce to query customer relationship management (CRM) data. This provides your RAG applications with contextually rich responses based on customer interactions, sales data, and more.
- Microsoft SharePoint: Access a wide array of documents and resources stored in your organization's SharePoint sites. This is ideal for you if you rely on SharePoint for document management and collaboration.
In this blog, we will focus on
Web Crawler
as the data source, so lets get started.Setting up these new data sources is straightforward, whether you prefer using the AWS
Management Console
or the CreateDataSource API
.The Web Crawler is a powerful tool that connects to and crawls HTML pages, starting from a specified seed URL and traversing all child links within the same top-level domain and path. It can also fetch supported documents referenced by these HTML pages, even if they are outside the primary domain. You can customize the crawling behavior through the crawling configuration settings.
Here's what you can do with the Web Crawler:
- Select Multiple URLs: Choose multiple seed URLs to start the crawl.
- Respect robots.txt Directives: Follow standard robots.txt directives like
Allow
andDisallow
to ensure compliance with web standards. - Scope Limitation: Restrict the
crawling scope
to specific URLs and optionally exclude URLs that match certain filter patterns. - Crawl Rate Limitation: Control the
rate
at which URLs are crawled to manage load and performance. - Monitor Crawling Status: Use Amazon CloudWatch to view the
status of URLs visited
during the crawling process.
Before using this connector, ensure you adhere to the Amazon Acceptable Use Policy to make sure you have permission to crawl the URLs of interest.
We will be using the Python SDK,
boto3,
but you can do the same via console
or CLI
First, we need to create the vector database which the KB will use in the backend. This vector store will allow Bedrock to store, update, and manage embeddings (which are the web-scraped data from the web URL you provide). You can quickly create a new vector store or select from a supported vector store you have previously created. You can follow this GitHub repo for more details.
Let’s create a new vector store, using Amazon OpenSearch Serverless (OSS).
Now that we have the vector store created, let’s create the KB. Here we'll:
Initialize
Open search serverless configuration which will include collection ARN, index name, vector field, text field and metadata field.Initialize
chunking strategy, based on which KB will split the documents into pieces of size equal to the chunk size mentioned in thechunkingStrategyConfiguration
.Initialize
the web URL configuration, which will be used to create the data source object later.Initialize
the Titan embeddings model ARN, as this will be used to create the embeddings for each of the text chunks.
Let’s now provide the above configurations as input to the
create_knowledge_base
method, which will create the KB.Now finally we can create a
Web URL
data source, which will be associated with the knowledge base created above. Once the KB is ready and the data source is created, we can start the
data sync
. During the sync job, the KB will fetch the documents in the data source, pre-process them to extract text, chunk them based on the provided chunking size, create embeddings for each chunk, and then write them to the vector database, which in this case is our OpenSearch Serverless.Lets now use the
RetrieveAndGenerate
API to test the knowledge base. Behind the scenes, RetrieveAndGenerate
API converts queries into embeddings, searches the knowledge base, and then augments the foundation model prompt with the search results as context information and returns the FM-generated response to the question. For multi-turn conversations, Knowledge Bases manage short-term memory of the conversation to provide more contextual results.The output of the
RetrieveAndGenerate
API includes the generated response, source attribution as well as the retrieved text chunks.Retrieve API converts user queries into embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom workflows on top of the semantic search results. The output of the
Retrieve
API includes the the retrieved text chunks, the location type and URI of the source data, as well as the relevance scores of the retrievals.You can do the same using console, have a look at this quick walkthrough.
Integrating web and enterprise data sources with Amazon Bedrock significantly enhances the capabilities of your RAG applications. The new data connectors for Atlassian Confluence, Microsoft SharePoint, and Salesforce, alongside web data sources and Amazon S3, streamline data workflows and improve the accuracy and relevance of AI-powered responses. With these connectors, you can effortlessly integrate and manage data from various sources, empowering your applications to deliver more informed and contextually rich outputs. Whether you're a developer or a data scientist, these advancements in Amazon Bedrock open up new possibilities for creating sophisticated and responsive AI solutions.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.