Streaming a response from Bedrock Knowledge Base
Improving Time To First Token and user experience in Bedrock fully-managed RAG solution
gengis
Amazon Employee
Published Aug 27, 2024
Last Modified Sep 4, 2024
RAG in Bedrock is easy: RetrieveAndGenerate command will query a Bedrock Knowledge Base to retrieve relevant document extracts and use those extracts to query your Bedrock LLM Endpoint. This approach combines retrieval and generation in one step, simplifying implementation of knowledge-grounded AI applications.
While this method is effective, as of August 2024, the RetrieveAndGenerate API doesn't support streaming responses. This limitation means users must wait for the entire response to be generated before seeing any output, which can feel less interactive compared to streaming responses where text appears progressively and also impact the time to first-token.
To address this need, you can implement an alternative approach:
1. First, query your Knowledge Base using the RetrieveCommand to get relevant document extracts.
2. Then, augment your query incorporating these extracts.
3. Finally, use the InvokeModelWithResponseStreamCommand to query your LLM and get back a streaming response .
2. Then, augment your query incorporating these extracts.
3. Finally, use the InvokeModelWithResponseStreamCommand to query your LLM and get back a streaming response .
This strategy allows you to leverage the streaming capabilities of the InvokeModelWithResponseStreamCommand, offering a more responsive and interactive experience for your users while still utilizing the power of your Knowledge Base.
Now let's look at implementing it.
In our case we're handling everything within a Lambda function. streaming back the response through Lambda response streaming.
1 - call RetrieveCommand, required input parameters are your Knowledge Base ID and your retrievalQuery
2 - Use the response to build your prompt. Note that here we just use direct insertion within the prompt. you could here apply multiple prompt engineering techniques such as relevance scoring where we would use the relevance score from our Retrieve API call to rank extracts.
3 - Call the bedrock endpoint using InvokeModelWithResponseStreamCommand
and that's it. If you want to learn more, have a look at the following resources:
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.