
Fix Your Broken RAG Pipeline with Amazon Bedrock
The problem isn’t the model — it’s usually everything around it.
Published Apr 6, 2025
Retrieval-Augmented Generation (RAG) is a widely adopted strategy for enterprise LLM applications — particularly for question answering, knowledge retrieval, and customer support. But if you’ve built RAG systems and found the answers to be vague, hallucinated, or just plain wrong, you’re not alone.
In this post, I’ll walk through the three common RAG pitfalls, how to solve them technically, and show how to implement solutions using Amazon Bedrock and related AWS services.
Why it fails:
Most pipelines chunk documents by fixed size — e.g. 500 tokens with 100-token overlap. This leads to loss of semantic boundaries, meaning each chunk might contain incomplete or unrelated thoughts.
Most pipelines chunk documents by fixed size — e.g. 500 tokens with 100-token overlap. This leads to loss of semantic boundaries, meaning each chunk might contain incomplete or unrelated thoughts.
The fix:
Use semantic chunking based on document structure. AWS doesn’t currently have a built-in semantic splitter, but you can preprocess using libraries like
Use semantic chunking based on document structure. AWS doesn’t currently have a built-in semantic splitter, but you can preprocess using libraries like
nltk
, spaCy
, or LangChain
, and then feed those chunks into Bedrock-compatible embeddings.Then, store these chunks in a vector DB like Amazon OpenSearch Serverless or Pinecone, using Bedrock embeddings.
Why it fails:
Vector similarity alone (ANN search) isn’t always reliable. Embedding models retrieve similar surface-level content, not necessarily relevant answers.
Vector similarity alone (ANN search) isn’t always reliable. Embedding models retrieve similar surface-level content, not necessarily relevant answers.
The fix:
Use hybrid retrieval: combine dense vector similarity with sparse (keyword-based) filtering and reranking.
Use hybrid retrieval: combine dense vector similarity with sparse (keyword-based) filtering and reranking.
How to implement hybrid search on AWS:
Use Amazon OpenSearch Serverless with k-NN plugin and BM25 scoring:
To further improve relevance, rerank the retrieved results using a local re-ranker like
bge-reranker
or Cohere Rerank
via API (until Bedrock supports it natively).Why it fails:
Most prompts simply paste in the context and ask the model to answer. Without structure, the model can hallucinate, ignore the context, or mix facts.
Most prompts simply paste in the context and ask the model to answer. Without structure, the model can hallucinate, ignore the context, or mix facts.
The fix:
Use structured prompting: define roles, expected behavior, and format.
Use structured prompting: define roles, expected behavior, and format.
Example Bedrock prompt using Anthropic Claude 3:
Then invoke the model via Bedrock:
Monitor these metrics:
- Precision of retrieved chunks (via human eval or auto-relevance scoring)
- Answer correctness (BLEU/ROUGE/F1 vs ground truth, or with LLM-as-a-judge techniques)
- Latency per RAG stage (embedding, retrieval, model inference)
You can integrate Amazon CloudWatch to track latency, or set up feedback loops with Amazon SageMaker Ground Truth or a custom feedback collector.
To build effective RAG pipelines on Amazon Bedrock:
- ✅ Use semantic-aware chunking
- ✅ Combine dense + sparse retrieval (hybrid)
- ✅ Add reranking
- ✅ Write structured prompts
- ✅ Track & iterate with eval loops
This isn’t “just call an LLM” — each stage of the RAG stack is a product in itself. But when you invest in the pieces, the payoff is much better answers.
The full repo can be found here
If you’re building production-grade LLM apps on AWS and want to jam on architecture, fine-tuning, or retrieval strategies, hit me up. Happy to share more patterns and pitfalls.