AWS Logo
Menu
Building an Enterprise Knowledge RAG Platform with LlamaIndex and Amazon Bedrock

Building an Enterprise Knowledge RAG Platform with LlamaIndex and Amazon Bedrock

Retrieving accurate information from large enterprise document repositories has long been a challenge for organizations. I recently architected a secure enterprise Retrieval-Augmented Generation (RAG) platform that handles over 200,000 proprietary documents using LlamaIndex as the framework and Amazon Bedrock for LLM integration. This blog post provides a technical deep dive into our implementation, complete with code samples, architecture decisions, and important limitations.

Published Apr 6, 2025

Introduction

Retrieving accurate information from large enterprise document repositories has long been a challenge for organizations. I recently architected a secure enterprise Retrieval-Augmented Generation (RAG) platform that handles over 200,000 proprietary documents using LlamaIndex as the framework and Amazon Bedrock for LLM integration. This blog post provides a technical deep dive into our implementation, complete with code samples, architecture decisions, and important limitations.

System Architecture

Enterprise Knowledge RAG Platform consists of five core components:
  1. Document Processing Pipeline - Ingests, chunks, and processes documents
  2. Vector Database - Stores and indexes embeddings
  3. Retrieval Engine - Implements hybrid retrieval strategies
  4. LLM Integration - Generates responses using Amazon Bedrock models
  5. Evaluation Framework - Measures performance and identifies issues
Here's the high-level architecture diagram:
Enterprise RAG Architecture

1. Document Processing Pipeline

The document processing pipeline is responsible for ingesting, chunking, and embedding documents. LlamaIndex makes this easy with its flexible document loading and processing capabilities, while leveraging Amazon Bedrock for embedding generation:

The HybridChunker Class

The HybridChunker class is the heart of the document processing pipeline. It implements a sophisticated chunking strategy that improves over simple fixed-length chunking by respecting natural semantic boundaries in the text. This improves the quality of the retrieval later on.

Key Components:

  1. TransformComponent Inheritance:
    • By inheriting from TransformComponent, this class fits into LlamaIndex's transformation pipeline architecture
    • This allows it to be chained with other processors like extractors
  2. Constructor Configuration:
    • chunk_size (512 by default): Maximum size of each chunk
    • chunk_overlap (50 by default): Amount of overlap between adjacent chunks, helping maintain context
    • paragraph_separator and sentence_terminators: Define how to recognize natural text boundaries
  3. Core Processing Logic (__call__ method):
    • It uses a two-pass approach:
      • First pass: Uses LlamaIndex's built-in SentenceSplitter for initial chunking
      • Second pass: Refines chunk boundaries based on semantic elements
  4. Boundary Detection Functions:
    • _ends_with_paragraph_break: Checks if a chunk already ends at a paragraph
    • _starts_with_section_header: Uses regex and formatting heuristics to identify section headers
  5. Refinement Process: The algorithm specifically:
    • Preserves existing good boundaries (paragraph breaks, section headers)
    • When boundaries aren't ideal, it searches for better split points in the overlap region
    • Prioritizes paragraph breaks, then sentence breaks
    • Falls back to the original chunks if no natural boundaries are found

The EnterpriseDocumentProcessor Class

This class orchestrates the entire document processing workflow, connecting multiple components:
  1. Amazon Bedrock Integration:
    • Sets up the Bedrock client for embedding generation
    • Uses Amazon's Titan embedding model by default
  2. Pipeline Configuration:
    • Creates a HybridChunker with the specified parameters
    • Adds TitleExtractor to automatically extract document titles
    • Adds KeywordExtractor to identify important keywords
  3. Vector Store Initialization:
    • Supports multiple vector databases (Weaviate, OpenSearch)
    • Provides appropriate connection logic for each
    • Includes a fallback SimpleVectorStore for testing
  4. Document Processing Workflow (process_documents method):
    • Loads documents using LlamaIndex's SimpleDirectoryReader
    • Enriches documents with metadata (like department, confidentiality level)
    • Runs documents through the ingestion pipeline
    • Creates and configures a vector index
    • Persists the index for later use

Why This Approach Is Innovative

  1. Semantic Awareness: Unlike basic chunkers that split text at fixed intervals, this hybrid approach respects the document's inherent structure, leading to more coherent chunks.
  2. Improved Retrieval Relevance: By aligning chunk boundaries with natural text divisions, retrieval becomes more accurate since the chunks contain complete thoughts or concepts.
  3. Flexible Vector Store Support: The system can work with different vector databases, making it adaptable to various enterprise environments.
  4. Metadata Enrichment: Documents are tagged with organizational metadata, enabling filtered searches based on department, confidentiality, etc.
  5. Persistence: The index is saved to disk, allowing for incremental updates and preventing the need to reprocess documents.
The 40% improvement in answer relevance mentioned in the blog post is likely due in large part to this sophisticated chunking strategy, as it ensures that the retrieval engine has access to coherent, contextually complete pieces of information rather than arbitrarily split text.

2. Retrieval Engine

The retrieval engine implements hybrid search strategies to maximize relevance. LlamaIndex makes it easy to combine multiple retrieval methods:

The EnterpriseRetrievalEngine Class

This class implements an advanced retrieval mechanism that combines multiple retrieval strategies to significantly improve the relevance of results.

Key Components:

  1. Hybrid Retrieval Configuration:
    • The class supports both vector similarity search and keyword-based (BM25) search
    • It can combine these approaches using a fusion method for better results
    • Parameters like use_hybrid and use_reranking control which techniques are active
  2. Amazon Bedrock Integration:
    • Uses the same Bedrock embedding model as the document processor
    • Ensures consistency between indexing and searching
  3. Index Loading:
    • Loads a previously created index from disk storage
    • Uses the storage context to reconnect to the vector database
    • Maintains the same embedding model for query encoding
  4. Retriever Initialization:
    • VectorIndexRetriever: Uses vector similarity to find semantically similar content
    • BM25Retriever: Uses classic information retrieval techniques based on keyword matching
    • QueryFusionRetriever: Combines both methods for improved results
  5. Retrieval Method (retrieve):
    • Takes a query string and optional metadata filters
    • Applies metadata filtering if provided (e.g., limiting to specific departments)
    • Performs the actual retrieval operation
    • Applies re-ranking if enabled
  6. Metadata Filtering:
    • Converts simple filter specifications into LlamaIndex's filter objects
    • Supports both exact matches and "OR" conditions for list values
    • Filters are applied before retrieval to narrow the search space
  7. Formatting Method (retrieve_and_format):
    • Retrieves documents and formats them into a context string
    • Includes source information from metadata
    • Returns both the formatted context and the raw nodes

Hybrid Retrieval in Detail

The most innovative aspect of this code is the hybrid retrieval approach, which combines:
  1. Vector Search:
    • Pros: Understands semantic meaning, can find relevant results even if keywords don't match
    • Cons: Might miss exact term matches, sensitive to embedding quality
  2. BM25 Search:
    • Pros: Excellent at exact keyword matching, well-established algorithm
    • Cons: Doesn't understand semantic relationships or synonyms
  3. Fusion Algorithm:
  • This uses the "reciprocal rank fusion" algorithm, which works by:* Running both retrieval methods independently
    • Assigning a score to each result based on its rank in each method
    • Combining the scores using a formula that favors documents ranked highly by both methods
    • Re-sorting the combined results
  • Optional Re-Ranking: The system can apply an additional re-ranking step using Cohere's re-ranker:
This further refines results by using a specialized model to assess query-document relevance.

Example Workflow

When you call retrieve_and_format with a query like "What is our policy on customer refunds for digital products?":
  1. The query text is encoded into an embedding using Amazon Bedrock
  2. The vector search finds documents semantically related to refunds and policies
  3. The BM25 search finds documents containing the keywords "refund", "policy", "digital products"
  4. The fusion algorithm combines these results, prioritizing documents that both contain the keywords and are semantically relevant
  5. If enabled, the re-ranker makes final adjustments to the ordering
  6. Results are filtered to only include documents from Legal and Finance departments with "Policy" document type
  7. The results are formatted into a context string that can be passed to an LLM

Why This Approach Is Powerful

  1. Precision and Recall Balance:
    • Vector search provides high recall (finding semantically relevant documents)
    • Keyword search provides high precision (finding exact matches)
    • Combining them gives the best of both worlds
  2. Metadata Filtering:
    • Enterprise documents often have rich metadata (department, document type, etc.)
    • The filtering system lets users narrow searches to relevant document subsets
    • This dramatically improves result relevance in large document collections
  3. Flexible Configuration:
    • The system can be tuned for different use cases:
      • Speed-critical applications might disable reranking
      • Quality-critical applications might use all features
      • Different retrieval parameters can be adjusted based on document characteristics
The 40% improvement in answer relevance mentioned in the blog post likely comes from this sophisticated retrieval strategy, especially the hybrid approach and reranking, which ensure that the most relevant documents are provided to the LLM for generating responses.

3. LLM Integration with Amazon Bedrock

The LLM service integrates with Amazon Bedrock using LlamaIndex's built-in Bedrock integration:

The LLMService Class

This class connects the retrieval components with the large language model (LLM) to generate coherent answers from the retrieved context. It's the "generation" part of Retrieval-Augmented Generation.

Key Components:

  1. Amazon Bedrock Integration:
    • Uses the Bedrock client to connect to Amazon's hosted LLMs
    • Default model is Claude 3 Sonnet, but can be configured with other Bedrock models
    • Configurable parameters include token limit and temperature
  2. LlamaIndex LLM Wrapper:
    • Utilizes LlamaIndex's Bedrock wrapper for standardized interaction
    • This wrapper handles the specifics of calling Amazon Bedrock's API
  3. Response Synthesizer:
    • Uses LlamaIndex's CompactAndRefine synthesizer
    • This is a specialized component that can handle large contexts by:
      • Breaking up large context into smaller chunks
      • Generating responses for each chunk
      • Refining answers across chunks for coherence
      • Handling contexts that exceed the model's token limit
  4. Dual Response Generation Paths:
  1. This shows two ways to generate responses:1. Using the advanced synthesizer when node objects are available (preferred)
    2. Using direct LLM completion when only text context is available (fallback)
Prompt Template
This prompt template:* Defines the AI's role as an enterprise knowledge assistant
  • Instructs it to only use the provided context
  • Includes explicit instructions not to hallucinate information
  • Has a clear structure for context and question
Citation Extraction
  1. This method:
    • Creates citation references for the sources used in the response
    • Extracts metadata like source document names
    • Ensures each source is only listed once
    • Returns a structured list of citations

How This Component Reduces Hallucinations

The significant hallucination reduction mentioned in the blog post (from 12% to under 3%) likely comes from several aspects of this design:
  1. Clear Instructions: The prompt explicitly tells the model to only use provided information and to admit when it doesn't know.
  2. Context Refinement: The CompactAndRefine synthesizer helps manage large contexts more effectively than simple prompt injection.
  3. Zero Temperature Setting: Setting temperature to 0.0 by default makes the model more deterministic and less likely to generate creative but potentially false information.
  4. Citation Tracking: By tracking which sources contribute to the answer, there's an implicit verification mechanism.

How It Works in Practice

When you call generate_llm_response with a query like "What is our policy on customer refunds for digital products?":
  1. The LLMService initializes with the Claude 3 Sonnet model
  2. The query and retrieved context are formatted into a prompt
  3. If node objects are provided, the advanced synthesizer handles generating the response:
    • It may break the context into multiple chunks if it's too large
    • Generate partial responses for each chunk
    • Refine these into a coherent final answer
  4. Otherwise, it sends the prompt directly to the model
  5. Citations are extracted from the nodes that were used
  6. The response and citations are returned together
This component is critical in ensuring that the generated responses are:
  • Accurate (based only on retrieved information)
  • Relevant (addressing the specific query)
  • Helpful (formatted in a clear, readable way)
  • Trustworthy (including citations to verify information)
The integration with LlamaIndex's synthesizer is particularly valuable for enterprise use cases with large document sets, as it allows handling more context than would fit in a single prompt, enabling more comprehensive answers from broader context.

4. Evaluation Framework

The custom evaluation framework leverages LlamaIndex's evaluation tools and Amazon Bedrock to measure hallucination rates and other performance metrics:

The EnterpriseEvaluationFramework Class

This class implements a multi-faceted evaluation system that uses both LLM-based evaluation techniques and embedding-based verification approaches.

Key Components:

  1. Evaluation Initialization:
    • Uses Amazon Bedrock for both the LLM evaluator (Claude 3 Sonnet) and embedding model (Titan)
    • Leverages LlamaIndex's built-in evaluators for standard metrics
    • Sets up DynamoDB for persistent storage of evaluation results
  2. Multiple Evaluation Metrics:
    • FaithfulnessEvaluator: Measures if the response contains only information supported by the context
    • RelevancyEvaluator: Assesses how well the response addresses the query
    • Custom hallucination detection: Analyzes response at the sentence level
  3. Main Evaluation Method (evaluate_response):
This orchestrates multiple evaluation approaches and compiles them into a comprehensive assessment.
  1. Hallucination Detection (_detect_hallucinations):
This method:
  • Breaks the response into individual sentences
  • Checks each sentence against the context
  • Identifies which specific sentences appear to be hallucinations
  • Calculates an overall hallucination rate
Multi-Level Verification (_check_sentence_support): This is the most sophisticated part of the framework, using a two-stage verification process:
Stage 1: Embedding Similarity
This calculates the semantic similarity between the sentence and each chunk of context.
Stage 2: LLM Verification for High-Risk Sentences
For sentences containing specific claims (numbers, named entities, etc.), the system performs an additional verification step using the LLM to check if the claim is actually supported by the context.

High-Risk Sentence Detection (_contains_specific_claims):
This method identifies sentences that are more likely to contain hallucinations, specifically those with:
  • Numerical values (which are easy to fabricate)
  • Named entities (people, organizations, etc.)
  • Claim indicators (phrases that suggest factual statements)
Persistent Storage (_store_evaluation):
This stores evaluation results in DynamoDB, creating a historical record for:
  • Tracking system performance over time
  • Identifying problematic query patterns
  • Supporting continuous improvement

Why This Approach Dramatically Reduces Hallucinations

The reduction in hallucination rate from 12% to under 3% likely comes from several innovative aspects of this framework:
  1. Sentence-Level Analysis: By evaluating at the sentence level rather than the entire response, the system can pinpoint specific hallucinations with greater precision.
  2. Two-Stage Verification: Using both embedding similarity and LLM verification creates a more robust detection system that catches different types of hallucinations.
  3. Special Handling for High-Risk Content: The framework applies stricter verification to sentences containing specific claims, numbers, or named entities, which are more likely to be hallucinated.
  4. Continuous Monitoring and Improvement: By storing all evaluation results, the system supports ongoing analysis and refinement of the RAG components.

Real-World Application

In practice, this evaluation framework serves several critical functions:
  1. Real-time Quality Control: It can be used in production to flag potentially hallucinated content before it reaches users.
  2. System Tuning: The detailed metrics help identify which components need improvement (retrieval, prompt design, etc.).
  3. Confidence Scoring: The evaluation results can be used to provide confidence levels with responses.
  4. Feedback Loop: Identified hallucinations can be fed back into the system to improve retrieval or fine-tune models.
This sophisticated evaluation framework is a key differentiator for enterprise RAG systems where accuracy and reliability are paramount. The granular, multi-faceted approach to hallucination detection helps ensure that responses are trustworthy, which is essential for business-critical applications.

5. Fine-Tuning Pipeline

The fine-tuning pipeline adapts foundation models to domain-specific terminology with 30% less training data:

The FineTuningPipeline Class

This class handles the entire workflow for creating custom-tuned models that better understand your organization's specific language and documents.

Key Components:

  1. Initialization and Setup:
    • Configures Amazon Bedrock clients (both runtime for inference and management for creating jobs)
    • Loads the existing vector index to access document content
    • Sets up S3 storage for training data and model artifacts
  2. Training Data Generation (generate_training_data):
    • Creates synthetic question-answer pairs for fine-tuning
    • The key innovation is generating these automatically rather than requiring manual labeling
    • Returns the S3 path where training data is stored
  3. Domain Terminology Extraction (_extract_domain_terminology):
This method uses the LLM itself to identify industry-specific terminology from your documents, creating a foundation for generating relevant training examples.
  1. Synthetic Query Generation (_generate_synthetic_query):
This creates realistic questions using your organization's terminology, generating diverse training examples that cover various query patterns.
  1. Data Augmentation (_augment_training_data):
This is the key innovation that reduces training data needs by 30%. It:
  • Creates variations of the same examples with rephrased queries
  • Generates alternative response formats for the same content
  • Effectively multiplies your training data without requiring additional documents
  1. Fine-Tuning Job Creation (create_finetuning_job):
  1. This method launches the actual fine-tuning job in Amazon Bedrock, configuring all necessary parameters and pointing to the generated training data.

How This System Enables Efficient Fine-Tuning

The 30% reduction in training data needs mentioned in the blog post is achieved through several innovative techniques:
  1. Automated Terminology Extraction: Instead of requiring manual identification of important domain terms, the system uses the LLM itself to analyze document content and extract key terminology.
  2. Template-Based Query Generation: By using templates filled with domain-specific terminology, the system creates realistic queries that match how users actually ask questions, covering a wide variety of query patterns.
  3. Two-Pronged Data Augmentation:
    • Query Reformulation: The same question is asked in different ways while preserving the meaning
    • Response Formatting Variation: Multiple presentation styles for the same content (narrative vs. structured)
  4. Leveraging Existing RAG Components: The system uses the already-configured RAG pipeline to generate answers to synthetic questions, creating training pairs without human annotation.

Real-World Benefits

In a practical enterprise setting, this fine-tuning pipeline delivers several key advantages:
  1. Reduced Manual Effort:
    • No need to manually create hundreds of training examples
    • Automated extraction of domain terminology
    • Self-generating training data
  2. Domain Adaptation:
    • Models learn your organization's specific terminology
    • Better understanding of company-specific concepts
    • Improved handling of acronyms and jargon
  3. Response Style Consistency:
    • Train models to match your organization's communication style
    • Support for both narrative and structured response formats
    • Consistent handling of company policies and procedures
  4. Continuous Improvement:
    • As documents are added to the system, new training data can be generated
    • Models can be periodically re-fine-tuned on expanded data
This component is particularly valuable for specialized industries with unique terminology (like healthcare, legal, finance, etc.) where general-purpose LLMs might struggle with domain-specific concepts. The automatic generation and augmentation of training data dramatically reduces the cost and effort typically associated with model fine-tuning.
The practical result is a system that understands your organization's language and can provide more accurate, relevant responses without requiring extensive manual annotation work.

Putting It All Together: The Enterprise RAG API

Here's how we integrate all components into a complete RAG API using FastAPI:

Limitations and Challenges

While our Enterprise Knowledge RAG Platform achieved significant improvements in answer relevance and reduced hallucinations, there are several important limitations to consider:

1. Scalability Challenges

  • Vector Database Scaling: As the document corpus grows beyond millions of documents, vector database performance can degrade without careful sharding and indexing strategies.
  • Embedding Computation: Generating embeddings for large document repositories is computationally expensive and time-consuming, requiring batch processing and efficient resource allocation.
  • Real-Time Updates: Incorporating new documents in real-time while maintaining index consistency presents challenges, especially with distributed systems.

2. Retrieval Limitations

  • Long Document Handling: The hybrid chunking approach still struggles with extremely long documents where critical information is scattered across the text.
  • Cross-Document Reasoning: The current architecture doesn't easily support synthesizing information across multiple documents when required for complex queries.
  • Query Ambiguity: Highly ambiguous queries may still result in suboptimal retrieval, especially when domain-specific contextual understanding is required.

3. Hallucination Detection Challenges

  • Subtle Inaccuracies: While our hallucination detection framework is effective, it can still miss subtle factual errors, especially when they appear plausible in context.
  • Computational Overhead: Comprehensive hallucination detection adds significant latency to response generation, creating a tradeoff between accuracy and performance.
  • False Positives: Overly aggressive hallucination detection can sometimes flag valid inferences as hallucinations, resulting in overly cautious responses.

4. Security and Compliance Considerations

  • Data Privacy: Embedding vectors can potentially leak sensitive information if not properly secured and anonymized.
  • Access Control: Implementing fine-grained access control at the document chunk level is complex and requires careful integration with enterprise authentication systems.
  • Audit Trails: Maintaining comprehensive audit trails for regulatory compliance adds complexity and storage requirements.

5. Technical Debt

  • Model Dependencies: The system relies on specific versions of foundation models, requiring careful change management when upgrading.
  • Pipeline Complexity: The multi-stage architecture introduces numerous failure points that must be monitored and maintained.
  • Integration Overhead: Enterprise integrations with existing document management systems, authentication services, and user interfaces add significant development overhead.

Future Improvements

Based on our experience, here are key areas for future improvement:
  1. Hierarchical Retrieval: Implement a multi-stage retrieval process that first identifies relevant document clusters before detailed chunk retrieval.
  2. Cross-Document Knowledge Graphs: Build knowledge graphs to enable reasoning across documents for complex queries.
  3. Adaptive Chunking: Develop more intelligent chunking strategies that adapt to document structure and content density.
  4. Privacy-Preserving Embeddings: Explore techniques for generating embeddings that preserve privacy while maintaining retrieval quality.
  5. Pre-Computation Optimization: Implement intelligent caching and pre-computation strategies for common query patterns.

Conclusion

Building an Enterprise Knowledge RAG Platform with LlamaIndex and Amazon Bedrock provides powerful capabilities for organizations with large document repositories. By implementing hybrid chunking, advanced retrieval techniques, custom evaluation frameworks, and fine-tuning pipelines, we've demonstrated how to significantly improve answer relevance and reduce hallucination rates.
The provided code samples offer a starting point for your implementation, but remember to adapt the architecture to your specific requirements and address the limitations outlined above. With careful planning and continuous improvement, an enterprise RAG system can transform how organizations leverage their proprietary knowledge at scale.
 

Comments