AWS Logo
Menu
Amazon Bedrock's Knowledge Base: Parsing and Chunking

Amazon Bedrock's Knowledge Base: Parsing and Chunking

This blog summarizes our experiment using Amazon Bedrock Knowledge Bases to test how Parsing and Chunking settings impact retrieval accuracy in RAG applications.

Rajnish
Amazon Employee
Published Apr 25, 2025
This article was co-written by Richa Gupta and Porus Arora
If you're working with Retrieval-Augmented Generation (RAG), you already know that strong retrieval is key to generating accurate, meaningful responses. When using Amazon Bedrock Knowledge Bases, how you configure Parsing and Chunking can make a significant difference.The following is a summary of our experiment to observe how changes to these two key settings—Parsing and Chunking—can influence retrieval outcomes and overall response quality.
The success of Retrieval-Augmented Generation (RAG) applications heavily depends on their ability to accurately retrieve relevant information—it's the foundation that determines the quality of your AI-generated responses. Through practical experiments using Amazon's 2023 Annual Report, we demonstrate how optimizing these settings can enhance the quality of retrieved information and generate more comprehensive responses. Whether you're building a production RAG system or optimizing an existing one, these insights will help you make informed decisions about your Knowledge Base configuration.
When implementing RAG applications with Amazon Bedrock, developers often overlook the critical impact that document preprocessing can have on the final output quality. While default configurations might seem sufficient during initial testing, our experiments reveal that thoughtful adjustment of parsing and chunking parameters can lead to dramatically improved retrieval accuracy and more contextually relevant responses. This is particularly crucial for enterprise applications where precision and completeness of information are paramount.
In this article, we'll walk through:
  • A comparative analysis of default versus optimized configurations
  • Real-world examples demonstrating the impact on response quality
  • Practical guidelines for choosing the right parsing and chunking strategies
  • Performance implications and trade-offs to consider
The RAG Pipeline: Understanding Where Parsing and Chunking Fit
The RAG pipeline transforms documents into AI responses through three essential stages. During Document Ingestion, the system parses various document formats into machine-readable text and chunks them into manageable segments. The Embedding and Indexing stage converts these chunks into vector embeddings, creating searchable indexes for efficient retrieval. In the final Retrieval and Generation stage, the system matches user queries with relevant chunks to provide context for generating accurate responses.
Why Parsing and Chunking Matter
Parsing and chunking are critical preprocessing steps in RAG systems. Poor parsing can strip documents of essential structure, causing loss of headers, lists, and tables, while misinterpreting formatting and corrupting special characters. Ineffective chunking risks fragmenting content inappropriately, splitting sentences or paragraphs mid-thought and severing connections between related sections. These issues can lead to context loss and inefficient retrieval, ultimately compromising the system's ability to generate accurate, coherent responses. By optimizing these processes, RAG systems can maintain document integrity and create meaningful, context-aware segments, forming a solid foundation for precise information retrieval and generation.

Understanding the RAG Flow: Why Parsing and Chunking Matter

A RAG system generally includes:
  • Document Ingestion
  • Embedding and Indexing
  • Retriever and Generator
Parsing and chunking happen during the ingestion phase and directly affect the quality of retrieval in later stages
Default BehaviorPotential Pitfall
ParsingConverts PDF/DOC into raw textMay lose headers, lists, structure
ChunkingBreaks text by token countMay split logically connected content

Document Used for Testing

For this test, We used Amazon’s 2023 Annual Report: This document includes well-structured sections, bullet points, and formal language—making it an ideal candidate for analyzing how structural preservation impacts downstream performance. We uploaded this document to S3 bucket.

Knowledge Base A : Default Configuration

  • Go to Amazon Bedrock > Create Knowledge Base
  • Select S3 location of the uploaded document
  • Parsing strategy: Bedrock Default Parser
  • Chunking strategy: Default Chunking (chunk of about 300 tokens in size)
  • Embedding model: Titan Text Embeddings v2 (default settings)
Basic configuration

Knowledge Base B: Advanced Configuration

  • Go to Amazon Bedrock > Create Knowledge Base
  • Select same S3 location
  • Parsing strategy: Bedrock Data Automation (preserves structure)
  • Chunking strategy: Semantic Chunking with overlap (minimum 0, maximum 300 )
  • Embedding model: Titan Text Embeddings v2 (default settings)
Advanced setup

Questions and Findings

We tested both Knowledge Bases: Knowledge Base A (with Bedrock Default Parser / Default Chunking) and Knowledge Base B (with Bedrock Data Automation/Semantic Chunking) , by asking same questions. Below are a few samples:
QuestionKnowledge Base AKnowledge Base B
what is Leased Square Footage used by amazonBased on the retrieved results, the total leased square footage used by Amazon is 609,908.The total leased square footage used by Amazon is 609,908 thousand square feet, as per the retrieved results. This is the sum of leased square footage across North America (424,145 thousand square feet), International (165,329 thousand square feet), and AWS (20,434 thousand square feet) segments.
What new AWS chips were introduced in 2023 and how do they compare to previous versions?In 2023, AWS introduced two new chips: Graviton4 and Trainium2 - **Graviton4** offers up to 3 - **Trainium2** delivers up to four times faster machine learning training for generative AI applications and three times more memory capacity than Trainium1.In 2023, AWS introduced two new chips: Graviton4 and Trainium2 - **Graviton4**: This chip offers up to 3 - **Trainium2**: This chip is designed for machine learning training, specifically for generative AI applications. It provides up to four times faster training and three times more memory capacity than the previous version, Trainium1.
What were the key outcomes of Project Kuiper’s satellite launches?The key outcomes of Project Kuiper’s satellite launches were the successful validation of all key systems and sub-systems, which is rare for an initial launch This milestone indicates significant progress towards the project's goal of providing broadband connectivity to underserved areas.The key outcome of Project Kuiper's satellite launches was the successful validation of all key systems and sub-systems of the two end-to-end prototype satellites. This is considered rare for an initial launch of this nature. The project aims to provide broadband connectivity to underserved areas and is expected to be a significant revenue opportunity for Amazon.The first production satellites are planned to be launched in 2024.

Final Thoughts

Amazon Bedrock's default settings are a good starting point for prototyping. However, if you're aiming for production-grade performance in RAG applications, you can benefit from taking advantage of these settings in Knowledge Base. We focused this experiment on Parsing and Chunking, but there are additional configurations to explore as well.
Key Takeaways for Implementation:
  • Start with structure-aware parsing for maintaining document integrity
  • Implement semantic chunking with appropriate overlap (10-15% recommended)
  • Consider your specific use case when configuring chunk sizes
  • Monitor and adjust settings based on response quality
The journey to optimal RAG performance is iterative. While our experiments demonstrate the significant impact of parsing and chunking optimizations, each implementation will require its own fine-tuning based on unique requirements. Remember that the quality of your retrieved context directly influences the accuracy of your generated responses. Keep experimenting, measuring, and refining—the results are worth the effort.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

3 Comments