
A Developer’s Guide to Advanced Chunking and Parsing with Amazon Bedrock
Learn to optimize your data retrieval with Knowledge Bases for Amazon Bedrock's advanced parsing, chunking, and query reformulation techniques.
Claude 3 Sonnet
and Claude 3 Haiku
are supported.Chunking and parsing configurations
settings.Semantic chunking
Hierarchical chunking
Fixed-size chunking
: Configure chunks by specifying the number of tokens per chunk and an overlap percentage.Default chunking
: Splits content into chunks of approximately 300 tokens, preserving sentence boundaries.No chunking
: Treats each document as a single text chunk.
Semantic chunking
is a natural language processing technique that divides text into meaningful and complete chunks based on the semantic similarity calculated by the embedding model. By focusing on the text's meaning and context, semantic chunking significantly improves the quality of retrieval in most use cases, rather than blind, syntactic chunking.- Maximum tokens: The maximum number of tokens that should be included in a single chunk, while honoring sentence boundaries.
- Buffer size: For a given sentence, the buffer size defines the number of surrounding sentences to be added for embedding creation. For example, a buffer size of 1 results in 3 sentences (current, previous, and next sentence) being combined and embedded. A larger buffer size might capture more context but can also introduce noise, while a smaller buffer size might miss important context but ensures more precise chunking.
- Breakpoint percentile threshold: This threshold determines where to divide the text into chunks based on semantic similarity. The threshold helps identify natural breaking points in the text to create coherent and meaningful chunks. For example, a breakpoint threshold of 90% results in the creation of a new chunk when its embeddings similarity falls below 90%.
Chunking and parsing configurations
settings.hierarchical chunking
excels.Hierarchical chunking
goes a step further by organizing documents into parent and child chunks. Semantic searches are performed on the child chunks, but results are presented at the parent level. This approach often results in fewer, more relevant search results, as one parent chunk can encompass multiple child chunks. This method is particularly effective for documents with a nested or hierarchical structure, such as:- Technical Manuals: Detailed guides with complex formatting and nested tables.
- Legal Documents: Papers with various sections and sub-sections.
- Academic Papers: Articles with layered information and references.
- Parent: Maximum parent chunk token size.
- Child: Maximum child chunk token size.
- Overlap Tokens: Number of overlap tokens between each parent chunk and between each parent and its children.
You can do the same via the console while creating a KB, by selecting the Custom option within the
Chunking and parsing configurations
settings, like we did in the case of semantic chunking
but this time by selecting hierarchical chunking
as your chunking strategy.- If native strategies don’t meet your needs, select the No chunking strategy.
- Specify a Lambda function with your custom chunking logic.
- Define an S3 bucket for the knowledge base to write and read the chunked files.
- The Lambda function processes the files, chunks them, and writes them back to the S3 bucket.
- Optionally, use your AWS KMS key for encryption.
- Select a predefined strategy (e.g., Default or Fixed-size) if you need chunk-level metadata.
- Reference your Lambda function and specify the S3 bucket.
- The knowledge base stores parsed files in the S3 bucket.
- The Lambda function adds custom metadata and writes back to the S3 bucket.
metadata
JSON files. Ensure your CSV follows the RFC4180 format, with the first row containing header information. The metadata JSON file, which should share the same name as your CSV file and have a .csv.metadata.json
suffix, guides how each column in your CSV is treated.- Upload the .csv file and corresponding
<filename>.csv.metadata.json
file in the same Amazon S3 prefix. - Create a knowledge base using either the console or the Amazon Bedrock SDK.
- Start ingestion using either the console or the SDK.
- Retrieve API and RetrieveAndGenerate API can be used to query the structured .csv file data using either the console or the SDK.
advanced parsing
, chunking
, and custom processing
. These capabilities are game-changers for developers handling complex datasets, significantly enhancing the efficiency and accuracy of document analysis. Whether you're managing vast CSV datasets or intricate legal documents, Amazon Bedrock equips you with the tools needed to optimize your data retrieval processes. Dive into these features, experiment, and transform your data management and analysis workflows into something more powerful and effective. Embrace the enhancements, and watch how they revolutionize the way you handle and interpret your data.Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.