GenAI under the hood [Part 8] - A visual guide to advanced chunking strategies

GenAI under the hood [Part 8] - A visual guide to advanced chunking strategies

Picking the right chunking strategy influences everything from cost to quality of your search system.

Shreyas Subramanian
Amazon Employee
Published Jul 12, 2024

What is chunking?

You have probably heard (enough) about RAG and search systems. As a quick one para recap, here are the main steps in any information retrieval process:
  1. Dividing documents into smaller sections for efficient data access.
  2. Converting these sections into numerical representations called embeddings.
  3. Storing the embeddings in a vector index that maintains a link to the original documents.
  4. Using the vector embeddings to analyze and find related or relevant information within the data.
Focusing on the first point alone, how do you know where to split (“chunk”) your documents? Let's imagine you start with chunking every sentence (that's too many chunks!), or every page (too few chunks, and semantic boundaries are blurred). Should you pick a fixed number of words/sentences/paragraphs? Should you ignore or preprocess some content before it gets embedded? As you can tell, this step has a major influence in the quality of search.
Both open source as well proprietary Knowledge Bases (Langchain, Llamaindex, Pinecone, Amazon Bedrock etc.) provide a variety of in-built and customizable chunking strategies. In this post, let’s go through some of the available chunking options on Knowledge Bases for Amazon Bedrock, launched at the New York Summit (July 10th, 2024):

Fixed chunking

Fixed-size chunking allows you to customize the size of the text chunks by specifying the number of tokens per chunk and the amount of overlap between consecutive chunks. This provides flexibility to align the chunking with your specific requirements. You can set the maximum number of tokens that a chunk must not exceed, as well as the percentage of overlap between consecutive chunks. Default chunking, on the other hand, splits the content into text chunks of approximately 300 tokens, but it respects sentence boundaries, ensuring that complete sentences are preserved within each chunk. Here’s a short clip of what fixed chunking with overlap looks like:

Hierarchical chunking

This approach involves dividing the information into nested structures of parent and child chunks. When creating a data source, the user can define the maximum token size for both parent and child chunks, as well as the number of tokens that overlap between each chunk. During the retrieval process, the system initially retrieves the child chunks, but then replaces them with broader parent chunks to provide the model with more relevant and concise summaries, rather than granular details, enhancing the overall efficiency and relevance of the information retrieval. Notice how the red child chunks belong to one green parent chunk:

Semantic chunking

Semantic chunking is a natural language processing technique that divides text into meaningful and complete chunks based on the semantic similarity calculated by the embedding model. By focusing on the text's meaning and context, semantic chunking significantly improves the quality of retrieval in most use cases, rather than blind, syntactic chunking.
When configuring semantic chunking, users can set parameters like the maximum number of tokens per chunk, the buffer size to capture surrounding context, and the threshold to determine natural breaking points, all of which help balance the need for coherent and manageable chunks of text. As a result, you may end up with chunks of varying sizes, but generally relevant/related content within a chunk:

LLM Parsing

Parsing documents using LLMs is particularly useful in scenarios where the documents to be parsed are complex, unstructured, or contain domain-specific terminology. You can instruct an LLM to parse in a very specific way, ignoring or specially handling non-textual information like tables. If users need to tailor prompts for data extraction to their specific needs, this is the right choice of chunking strategy. Currently on Knowledge bases for Amazon bedrock, Claude 3 Sonnet and Claude 3 Haiku are supported as the available foundation models for this feature. Here, the LLM parser is shown scanning the document, and deciding where to break the boundaries of green chunks, ignoring some portions of the document and processing the orange table chunk differently.

Custom Lambda for chunking

You can use a custom Lambda function to add chunk-level metadata to your knowledge base, even if you are using one of the above pre-defined chunking strategies provided by Amazon Bedrock. In this case, the knowledge base will store the pre-chunked files in an S3 bucket, and then call your Lambda function to add the custom metadata to each chunk. This is very useful for Hybrid search scenarios where along with finding semantically relevant results to your search query, you also want to filter results based on metadata.
Alternatively, if you have a completely custom chunking logic that is not natively supported, you can select the "No chunking strategy" option and provide a Lambda function to handle the chunking. Your Lambda function will then write the chunked files back to the same S3 bucket and return the references for further processing by the knowledge base. This can be completely custom; you may even call other services on AWS, store metadata, include/exclude certain types of content.
The good news is that all of these chunking options are available in Knowledge Bases for Amazon Bedrock today! Check out the following resources to dive deeper:

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments