A Developer’s Guide to Advanced Chunking and Parsing with Amazon Bedrock
Learn to optimize your data retrieval with Knowledge Bases for Amazon Bedrock's advanced parsing, chunking, and query reformulation techniques.
Suman Debnath
Amazon Employee
Published Jul 19, 2024
Last Modified Jul 22, 2024
As a developer working with Amazon Bedrock, you know how crucial it is to efficiently manage and retrieve information from vast and complex datasets. With the new features introduced in Amazon Bedrock, you can now take your knowledge bases to the next level by implementing advanced parsing, chunking, and query reformulation techniques. This guide will walk you through these features, providing code examples and detailed explanations to help you optimize your RAG (Retrieval Augmented Generation) workflows.
When dealing with large documents or complex datasets, traditional retrieval methods can fall short, leading to suboptimal results. Advanced parsing and chunking techniques enable you to split your documents into manageable, contextually meaningful chunks that improve the accuracy and relevance of the retrieved information.
Advanced parsing involves breaking down unstructured documents into their constituent parts, such as text, tables, images, and metadata. This process is crucial for understanding the structure and context of the information. Knowledge Bases (KB) for Amazon Bedrock now supports the use of foundation models (FMs) for parsing complex documents.
You can use advanced parsing techniques for parsing non-textual information from supported file types, such as PDF. This feature allows you to select a FM for parsing complex data, such as tables and charts. Additionally, you can tailor this to your specific needs by overwriting the default prompts for data extraction, ensuring optimal performance across a diverse set of use cases. Currently,
Claude 3 Sonnet
and Claude 3 Haiku
are supported.Here’s how you can leverage this feature:
For more detailed code and examples, please refer to the source code on GitHub.
You can do the same via the console while creating a KB, by selecting the Custom option within the
Chunking and parsing configurations
settings.Chunking isn't just about breaking data into pieces; it's about transforming it into a format that makes future retrieval efficient and valuable. Instead of asking, "How should I chunk my data?", the more pertinent question is, "What is the most optimal approach to transform the data so that the foundation model (FM) can effectively use it?"
To achieve this goal, we introduced two new data chunking options within KB in addition to the fixed chunking, no chunking, and default chunking options:
Semantic chunking
Hierarchical chunking
The other options we have are:
Fixed-size chunking
: Configure chunks by specifying the number of tokens per chunk and an overlap percentage.Default chunking
: Splits content into chunks of approximately 300 tokens, preserving sentence boundaries.No chunking
: Treats each document as a single text chunk.
Lets talk about these new types of chunking we added in KB.
Semantic chunking
is a natural language processing technique that divides text into meaningful and complete chunks based on the semantic similarity calculated by the embedding model. By focusing on the text's meaning and context, semantic chunking significantly improves the quality of retrieval in most use cases, rather than blind, syntactic chunking.When configuring semantic chunking on your data source, you can specify several parameters, including:
- Maximum tokens: The maximum number of tokens that should be included in a single chunk, while honoring sentence boundaries.
- Buffer size: For a given sentence, the buffer size defines the number of surrounding sentences to be added for embedding creation. For example, a buffer size of 1 results in 3 sentences (current, previous, and next sentence) being combined and embedded. A larger buffer size might capture more context but can also introduce noise, while a smaller buffer size might miss important context but ensures more precise chunking.
- Breakpoint percentile threshold: This threshold determines where to divide the text into chunks based on semantic similarity. The threshold helps identify natural breaking points in the text to create coherent and meaningful chunks. For example, a breakpoint threshold of 90% results in the creation of a new chunk when its embeddings similarity falls below 90%.
For more detailed code and examples, please refer to the source code on GitHub.
And again, you can do the same via the console while creating a Knowledge Bases, by selecting the Custom option within the
Chunking and parsing configurations
settings.Understanding the contextual boundaries within complex documents like legal papers, technical manuals, or academic articles can be challenging. Traditional chunking methods often struggle to capture the nested structures and intricate relationships within these documents. This is where
hierarchical chunking
excels.Dynamic chunking automatically identifies and groups related content into coherent chunks based on semantic similarity. For example, in a legal document with various clauses and sub-clauses, dynamic chunking ensures that related content stays together, enhancing document parsing and analysis accuracy.
Hierarchical chunking
goes a step further by organizing documents into parent and child chunks. Semantic searches are performed on the child chunks, but results are presented at the parent level. This approach often results in fewer, more relevant search results, as one parent chunk can encompass multiple child chunks. This method is particularly effective for documents with a nested or hierarchical structure, such as:- Technical Manuals: Detailed guides with complex formatting and nested tables.
- Legal Documents: Papers with various sections and sub-sections.
- Academic Papers: Articles with layered information and references.
By structuring the document hierarchically, the model gains a better understanding of the relationships between different parts of the content, enabling it to provide more contextually relevant and coherent responses.
Hierarchical chunking allows defining parent chunk size, child chunk size, and the number of tokens overlapping between each chunk. During retrieval, the system retrieves child chunks but replaces them with broader parent chunks to provide more relevant context. This approach enhances efficiency by providing concise, higher-level summaries instead of granular details.
For this, KB support specifying two levels or the following depth for chunking:
- Parent: Maximum parent chunk token size.
- Child: Maximum child chunk token size.
- Overlap Tokens: Number of overlap tokens between each parent chunk and between each parent and its children.
To further enhance the accuracy of document analysis, you can combine hierarchical chunking with FM parsing (which we have seen earlier). This combination allows for more precise parsing of complex documents, ensuring that every relevant detail is captured and appropriately organized. This synergy improves the quality of generated responses by maintaining the integrity of the document's structure.
For more detailed code and examples, please refer to the source code on GitHub.
You can do the same via the console while creating a KB, by selecting the Custom option within the
You can do the same via the console while creating a KB, by selecting the Custom option within the
Chunking and parsing configurations
settings, like we did in the case of semantic chunking
but this time by selecting hierarchical chunking
as your chunking strategy.Lastly for those seeking even greater flexibility in document analysis, custom processing logic can be integrated using Lambda functions. This approach allows for the addition of metadata or the definition of custom chunking logic tailored to specific needs.
Handling large datasets can often be cumbersome, especially when dealing with CSV files that combine content and metadata. Thankfully, KB now offer a more streamlined approach for processing CSV files, making it easier for developers to manage and manipulate data.
Custom Chunking Logic:
- If native strategies don’t meet your needs, select the No chunking strategy.
- Specify a Lambda function with your custom chunking logic.
- Define an S3 bucket for the knowledge base to write and read the chunked files.
- The Lambda function processes the files, chunks them, and writes them back to the S3 bucket.
- Optionally, use your AWS KMS key for encryption.
Adding Chunk-Level Metadata:
- Select a predefined strategy (e.g., Default or Fixed-size) if you need chunk-level metadata.
- Reference your Lambda function and specify the S3 bucket.
- The knowledge base stores parsed files in the S3 bucket.
- The Lambda function adds custom metadata and writes back to the S3 bucket.
This custom processing ensures your document analysis is tailored to your needs, providing precise and contextually relevant analysis. For more details, refer to the documentation.
With the new enhancement, you can now separate content and metadata within your CSV files, transforming the way you handle data ingestion. This feature allows you to designate specific columns as content fields while marking others as metadata fields. The result? Fewer files to manage and a more efficient data processing workflow.
Imagine you’re working with a massive CSV dataset. Previously, you might have needed multiple content and metadata file pairs. Now, a single CSV file, accompanied by a metadata JSON file, suffices. This not only simplifies your file management but also enhances your data’s usability.
To start leveraging this feature, you need to prepare your CSV files and corresponding
metadata
JSON files. Ensure your CSV follows the RFC4180 format, with the first row containing header information. The metadata JSON file, which should share the same name as your CSV file and have a .csv.metadata.json
suffix, guides how each column in your CSV is treated.Here’s a sneak peek at what your metadata JSON file might look like:
Use the following steps to experiment with the .csv file improvement feature:
- Upload the .csv file and corresponding
<filename>.csv.metadata.json
file in the same Amazon S3 prefix. - Create a knowledge base using either the console or the Amazon Bedrock SDK.
- Start ingestion using either the console or the SDK.
- Retrieve API and RetrieveAndGenerate API can be used to query the structured .csv file data using either the console or the SDK.
For more details refer to the documentation.
In this blog, we've explored the powerful new features in Knowledge Bases for Amazon Bedrock that enable
advanced parsing
, chunking
, and custom processing
. These capabilities are game-changers for developers handling complex datasets, significantly enhancing the efficiency and accuracy of document analysis. Whether you're managing vast CSV datasets or intricate legal documents, Amazon Bedrock equips you with the tools needed to optimize your data retrieval processes. Dive into these features, experiment, and transform your data management and analysis workflows into something more powerful and effective. Embrace the enhancements, and watch how they revolutionize the way you handle and interpret your data.Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.