AWS Logo
Menu
Using WebSockets to support Bedrock Model response Streaming in Python Lambda functions

Using WebSockets to support Bedrock Model response Streaming in Python Lambda functions

Learn how to use model response streaming using Amazon Bedrock and AWS Lambda functions written in langchain python code

Published Jan 14, 2025
I recently developed a chatbot application for a work project, utilizing Python for the code development. For the frontend interface, I used Streamlit, an opensource python library that can easily create web applications that interact with data like chatbots, while the backend was constructed as a serverless architecture using the following Amazon services:
  • AWS Lambda - utilized for inferencing API calls with Python and Langchain
  • Amazon Bedrock - employed for foundational model access
  • Amazon API Gateway - used to initiate the Lambda function calls
  • Amazon S3 - used to store LanceDB vector store tables, enabling Retrieval Augmented Generation (RAG) on proprietary data using Amazon Bedrock models
  • Optional: For a comprehensive serverless experience, you may replace LanceDB over S3 with Amazon OpenSearch Serverless, leveraging its vector store engine.
This is an overview of the architecture:
Architecture Diagram
Architecture Diagram
As part of the building experience I noticed that in some cases getting answers back from the model takes a while. Sometimes even more than 30 seconds, which can be a long wait for an interactive chat experience. I used Claude 3 Sonnet model for this chatbot, but the experience can be similar with other LLMs, especially taking into consideration the RAG process that requires DB access and increases the input context. As I found in these performance benchmarks the total response time to receive a 100 token response can sometimes take a few seconds depending on the input token length, and since the document context used in RAG can increase the input significantly depending on the number of tokens in your document, it can sometimes take 10-30 seconds or even more to get the full response back.
To work around waiting for the full response and get a more responsive feeling in the chat, most models support response streaming. This means that once the first chunk of output data comes back from the model it will immediately start streaming it back to your application, and users won’t have to wait for the full response before output is displayed to the user.
As my backend was built using Python with Langchain running on AWS Lambda, I’ve looked into how to implement model response streaming from a Lambda function.
Natively, Amazon Lambda supports response streaming only with Node.js, so I wasn’t able to use this functionality for my Python function. So, I decided to look for alternative solutions. One solution is to use Amazon API Gateway to create a WebSocket interface between the Lambda function and the chatbot application.
Since I didn’t find a good guide for implementing this, I thought I’d share what I learned and some important code parts.

Create Amazon API Gateway Websocket

To start with, we need to create an API Gateway Websocket API endpoint and route it into 3 Lambda functions - Connect, Disconnect and our LLM Streaming code.
Below are a few Amazon CDK code snippets that can help you create the necessary resources for this solution.
  • Create AWS API Gateway WebSocket API
  • Create 3 Lambda Functions for Connect, Disconnect and Default
  • Add routes from the WebSocket API to the lambda functions you created
You can read more details about the different WebSocket API routes here.

Create the model inferencing streaming lambda

In order to understand the code flow to create the data streaming using WebSocket API we’ll focus on a few code sections that build the process:
  1. __init__: WebSocket connection and invoking parameters setup
  2. Set_prompt: Set inferencing prompt templates
  3. Init_lancedb: Setup Vector Database (LanceDB) retriever
  4. call_bedrock: Create conversational RAG chain using AWS Bedrock service and the vector retriever
  5. stream_chain: Steam data back to WebSocket Connection

WebSocket connection and invoking parameter setup

To stream the model output from the lambda to our client we’ll use boto3 client to manage our API Gateway WebSocket connection. We’ll create an invoke_bedrock class and initialize the connection and base parameters:

Set inferencing prompt templates

Next we’ll create a method that will initialize the prompt templates:

Setup Vector Database (LanceDB) retriever

In our sample code we’re using LanceDB to setup as a retriever for our RAG process. We’re using Amazon Titan embedding model. Notice we added metadata filtering to the retriever to filter out only relevant documents for context.

Create conversational RAG chain using AWS Bedrock service and the vector retriever

Next we’ll integrate our retriever and initiate a conversational rag chain. We’ll refer chat history to a DynamoDB table and will user Amazon Bedrock service to handle the conversion. We’ll make sure we enable streaming on the LLM definition (streaming=true).

Steam data back to WebSocket Connection

Next, we’ll get the response back from our streaming chain and extract the answer and context information (like the source file name and path) for reference in the chat.

Put everything together

To make everything work together we’ll put everything together in our lambda_handler. After a few input verifications we will initialize our invoke_bedrock, start our conversational rag chain, and post all the responses from the RAG chain back to the WebSocket connection we opened.

Frontend chatbot application

On the frontend side I used python with Streamlit to create the chatbot interface. To open the WebSocket connection with the Lambda function I used the websockets python package.
Streamlit provides a simple way to write a streaming response from our model to the chatbot interface. Since it gets a python iterator as input we’ll build a function to handle the WebSocket connection and return the streaming response as an iterator.
And then run the main chatbot like the following code. On the chatbot assistant response we’ll use the write_stream function:

Summary

When building applications in the GenAI space, Python is often the programming language of choice due to its simplicity and robust development libraries, such as LangChain for backend development and Streamlit for frontend interfaces. For applications using a model-as-a-service approach like with Amazon Bedrock, adopting a serverless infrastructure can streamline development, improve efficiency, and accelerate time-to-value. However, a current limitation of AWS Lambda with Python is the lack of support for model response streaming.
To achieve real-time responsiveness in GenAI applications, you can leverage solutions like API Gateway WebSockets to stream data from the model as it becomes available. This blog has outlined the steps to set up these components, enabling a more responsive and seamless experience for your Python-based serverless GenAI applications.
 

Comments