Better RAG accuracy and consistency with Amazon Textract
Crafting a Retrieval-Augmented Generation (RAG) pipeline may seem straightforward, but optimizing it for accuracy, particularly during PDF ingestion and chunking phase, presents significant challenges. This article explores how Amazon Textract can enhance your RAG pipeline's ingestion capabilities, leading to more precise and reliable outputs in your GenAI question-answering systems.
Published Nov 1, 2024
(You can go directly to part II if you're already familiar with RAG)
RAG is the process of optimizing your LLM-based application's accuracy by leveraging knowledge bases and contextual information (the retrieval part) prior to generating answers to user's questions. LLMs are trained with vast amounts of data but know little about your business context and even less about your internal data. RAG extends the LLM capabilities to specific domains or internal knowledge base without the need to retrain the model.
On the left-hand side of the diagram, we have the ingestion part, the part we'll focus on along this article. It consists in three main steps:
- Ingesting the document and pre-processing it, typically splitting it into chunks of a reasonable size (around 512-1024 tokens, balancing contextual completeness with the need to stay well below the LLM's context window). The chunks should preserve semantic meaning – for example, keeping paragraphs or sections intact where possible.
- Embeddings are created from text chunks by transforming them into multi-dimensional vectors that capture the semantic meaning of the extracted text. This numerical representation allows to quantify semantic similarity between different pieces of text by measuring the distance between their corresponding vectors—smaller distances indicate greater semantic similarity. Many different models can generate embeddings: Amazon Titan Embeddings (Text or Multimodal) or Cohere Embed (English or Multimodal) as examples.
- We finally store these embeddings in a Vector Store, a specialized system designed for efficient storage and retrieval of multi-dimensional vectors. Opeansearch, pgVector for PostgreSQL, Pinecone, LanceDB are popular vector stores, each optimized for rapid similarity searches across large collections of vector embeddings.
On the other side of the picture, we have the RAG part itself, where we leverage the knowledge ingested in the Vector Store:
- The user is asking a question, this is the query.
- The system processes this query by converting it into an embedding vector, using the same embedding model used during ingestion to ensure compatibility.
- Using established similarity metrics like cosine or Euclidean distance, the system searches the vector store to identify the most semantically relevant content to the query.
- This "Retrieval" phase extracts the most pertinent contextual information from the knowledge base, which serves as the foundation for generating an accurate response.
- The Large Language Model (LLM) then synthesizes a response by combining three key elements: the original query, the retrieved context, and a carefully crafted prompt that guides its output.
- Finally, the system delivers this contextual augmented AI-generated response to the user.
Many great articles explain RAG systems in detail (have a look at this one). While each component of a RAG system plays a crucial role, this article focuses on one specific step: document chunking. As we'll see, the way we prepare and chunk documents before creating embeddings can significantly impact the quality of your RAG system's answers. Let's dive into different chunking approaches and their implications for retrieval accuracy.
Text chunking, also known as text splitting, is a crucial step that involves breaking down large documents into smaller, more manageable segments called "chunks." This process enables efficient handling of extensive documents while enhancing the precision of semantic search by creating focused, contextual segments that can be more accurately matched to user queries.
When working with PDFs, the quality of text extraction significantly impacts our chunking options. While basic PDF libraries (PyMuPDF, PyPDF2, pdfminer, ...), can extract text, they often struggle with maintaining correct reading order or preserving document structure. As a result, traditional chunking approaches end up treating documents as simple streams of text, losing valuable structural information that could help maintain context.
Let's examine some common approaches and their limitations (following screenshots were made with https://huggingface.co/spaces/m-ric/chunk_visualizer):
- Character/Word Count Chunking: The most basic approach, split text every N characters or words (eventually with an overlap). While straightforward to implement, it often breaks content mid-sentence or splits related information, leading to poor context preservation and inaccurate responses.
- Sentence-Based Chunking: A more natural approach that uses sentence boundaries as splitting points. However, it still treats documents as flat text, missing important structural relationships between sentences and often creating chunks that ignore document hierarchy.
- Recursive Chunking: An improvement over simple sentence splitting that attempts to create larger chunks by combining consecutive sentences up to a maximum size. While this can sometimes preserve more context than single-sentence chunks, it still lacks understanding of document structure and may create arbitrary splits in the content.
- Paragraph-Based Chunking Uses paragraph breaks ("\n\n") to create chunks. Though better at maintaining some content coherence, it struggles with real-world documents that contain tables, columns, headers, or other complex layouts.
- Semantic Chunking: More advanced methods that try to keep related content together based on meaning. While this method improves context preservation, implementing it reliably is complex.
Most of these methods either sacrifice accuracy for simplicity or require complex implementation for better results. However, they all share two common limitations: they rely on basic PDF text extraction that loses document structure, and they process the resulting text as a simple stream, ignoring valuable layout information. This is where Amazon Textract comes into play, offering a more sophisticated approach that not only extracts text from PDFs but also understands and preserves document structure and layout, enabling more intelligent chunking strategies.
Beyond simple text extraction, Amazon Textract provides rich layout analysis capabilities. Using the AnalyzeDocument API, you can request the features you want to analyze (especially the "
LAYOUT
"):It identifies the different structural elements within your documents:
- Titles (
LAYOUT_TITLE
) - Headers (
LAYOUT_HEADER
) and footers (LAYOUT_FOOTER
) - Sections / paragraphs and their boundaries (thanks to the
LAYOUT_SECTION_HEADER
) - Tables and their structure (
LAYOUT_TABLE
) - Lists, bulleted and numbered (
LAYOUT_LIST
) - Page numbers (
LAYOUT_PAGE_NUMBER
) - Key-value pairs / forms (
LAYOUT_KEY_VALUE
) - Figures (
LAYOUT_FIGURE
) - Obviously text (
LAYOUT_TEXT
), generally part of sections - And last but not least, multi-column layouts. This preserves the natural reading order, processing each column separately rather than concatenating lines across columns
Example:
Now that we understand Textract's capabilities for layout analysis, let's explore how to implement this approach in practice.
To effectively use Textract's layout capabilities for chunking, we can leverage powerful python libraries:
amazon-textract-textractor / amazon-textract-caller
: to call Textract APIs on top of boto3.amazon-textract-response-parser
: to get understandable objects out of the very verbose API's JSON response.amazon-textract-prettyprinter
: to visualize the extracted information into different formats (CSV, MD)
First, let's install these libraries (or add them to your requirements.txt):
Let's see how we can use them to extract structured content, here is a basic implementation that regroup content per section or chapter:
To compare both chunking methods (Structured-based vs RecursiveSplit), I've used a washing machine user manual (62 pages, ~ 3.4 MB), a structured document with well-defined sections and instructions on how to perform different operations on the machine:
The following picture depict the overall testing process:
- I have a series of questions related to this document (eg. "How to clean the mesh filter?", "What cycle should I select for my underwear?" or "The machine is doing excessive noise, what can I do?")
- Each question is processed the same way, but using different vector stores:
- Create embeddings from the question, using Amazon BedRock and
amazon.titan-embed-text-v1
model. - Search with cosine similarity in an in-memory FAISS vector store.
- Provide the question and context to the LLM (
anthropic.claude-3-sonnet-20240229-v1:0
) to get a response
- I also have a set of responses to these questions, that I've created based on the document and my expectations, this is the ground truth.
- Finally, I've evaluated their performance with the following metrics:
- Response accuracy, using ROUGE scores
- Chunking time and characteristics
- Query response time
As we are comparing chunking methods, this is the code I've used for the traditional approach (a standard
RecursiveCharacterTextSplitter
from Langchain), and different chunk sizes (250, 500, 750 and 1000):I'm using different vector stores like the following one to separate embeddings from the different approaches, Structured (Textract) and the different chunk sizes:
And after having chunked the document, generated and stored the embeddings in the vector stores, I can perform several queries using the following prompt and code:
Let's begin by examining the chunk characteristics. The distribution of chunk sizes across the different approaches is shown in the following figure. While the structured approach with Textract occasionally produces larger chunks for bigger sections (like chapters), it generally maintains relatively compact chunks (median: 382 characters, mean: 602 characters). This reduction in chunk size not only improves processing efficiency but also leads to lower costs in production environments, as LLM typically charge based on the number of input tokens.
However, the key distinction lies not in the size metrics but in the semantic coherence of the chunks. Textract's layout-aware processing ensures that each chunk represents a cohesive topic or section, preserving the document's logical structure. In contrast, the traditional approach, despite various chunk size configurations, may split content arbitrarily, potentially breaking apart related information or combining unrelated segments.
The performance analysis reveals significant differences in processing time between the two approaches. The traditional chunking method consistently completes in 5-10 seconds, regardless of the configured chunk size. In contrast, the structured-based approach requires substantially more time, averaging 40-50 seconds per document. This increased processing time is attributed to the additional overhead of OCR processing and layout analysis that enables Textract's structural understanding of the document.
It's important to note that this processing time is a one-time cost incurred during the document ingestion phase. However, for applications involving large document collections or requiring frequent updates, this performance difference should be carefully considered.
To evaluate the accuracy, I've used the ground truth (my "expected" answers) and computed the following ROUGE metrics:
- ROUGE-1 measures the overlap of individual words between the generated response and the ground truth answer, helping us evaluate how well the system captures the key vocabulary and concepts from the source document.
- ROUGE-2 evaluates the overlap of word pairs (bigrams), indicating how well the system preserves the original phrasing and word order, which is particularly important for maintaining accuracy in technical instructions or specific procedures.
- ROUGE-L looks at the longest common subsequence between the generated and reference texts, assessing how well the system maintains longer, coherent sequences of text and the overall flow of information.
And the results are represented in the following figure:
The results reveal an interesting pattern: while the structured approach (with Textract) consistently maintains ROUGE scores around 0.60 for ROUGE-1, 0.44 for ROUGE-2, and 0.52 for ROUGE-L, the traditional approach's performance varies significantly with chunk size. With smaller chunks (250 characters), traditional chunking underperforms considerably, but as chunk size increases to 1000 characters, it nearly matches the structured approach performance. This highlights a key trade-off: traditional chunking can achieve comparable results, but only after careful tuning of chunk size parameters. Structured chunking with Textract, on the other hand, delivers robust performance without prior configuration, making it a more reliable choice when dealing with diverse documents or when you want to avoid the overhead of chunk size optimization.
While these results demonstrate the potential benefits of Textract-based chunking, it's important to understand the full picture. Let's examine the practical considerations and limitations that should inform your choice of chunking method.
While layout-based chunking using Textract offers consistent performance without tuning, it's important to understand its trade-offs:
- Processing Time: Initial document processing with Textract takes longer
- Query Response Time: Slightly longer response times
- Accuracy: Comparable or slightly better ROUGE scores compared to well-tuned traditional chunking, but consistent.
Chunks follow the document's natural structure, leading to size variations. Some sections might be very large while others might be very small. For LLMs with context window limitations, you may need to:
- Implement maximum chunk size checks
- Split large sections using fallback chunking methods
Consider Textract when you need:
- Consistent performance without chunk size tuning
- Robust handling of different document types
- Quick deployment without extensive optimization
- Preservation of document structure and context
This layout-based chunking approach works particularly well for:
- Structured documents like technical documentation
- Academic papers with clear section hierarchy
- Business reports and white papers
- Documents with tables and multi-column layouts
- PDF forms and structured documents
This method might not be optimal for:
- Non-PDF documents
- Plain text documents without layout structure
- Very long narrative content (like books)
- Documents where section headers don't meaningfully relate to content
- Cases where very precise token-count control is required
This exploration of document chunking methods reveals that while traditional approaches can achieve good results with proper tuning, Amazon Textract offers a compelling alternative that combines consistency with structural awareness. The key takeaway isn't just about raw performance metrics—it's about finding the right balance between accuracy, ease of implementation, and maintenance overhead.
This analysis shows that Textract-based chunking:
This analysis shows that Textract-based chunking:
- Delivers consistent ROUGE scores without requiring chunk size optimization
- Preserves document structure and semantic coherence
- Trades slightly longer processing time for improved reliability
- Offers particular advantages for structured technical documentation
As RAG systems continue to evolve, the choice of chunking method becomes increasingly important. While Textract may require more initial processing time, its ability to maintain consistent performance across different document types and its zero-configuration approach make it an attractive choice for many real-world applications, especially when dealing with structured technical documentation.
Whether you choose Textract or traditional chunking methods ultimately depends on your specific use case, but understanding these trade-offs enables you to make an informed decision that aligns with your project's requirements and constraints.
Whether you choose Textract or traditional chunking methods ultimately depends on your specific use case, but understanding these trade-offs enables you to make an informed decision that aligns with your project's requirements and constraints.