Managing Chat History at scale in Generative AI Chatbots

Managing Chat History at scale in Generative AI Chatbots

An architecture pattern to manage chat history and context at scale in Generative AI chatbots

Aravind Singirikonda
Amazon Employee
Published Jul 12, 2024

Introduction:


Generative AI chatbots are transforming customer engagement across industries by providing instant support, tailored recommendations, and seamless interactions. They excel in diverse use cases like automating helpdesk support, guiding users through onboarding, and answering queries with personalized responses. One of the critical aspects of effective chatbot interactions is the ability to maintain conversation context. By recalling previous interactions, chatbots can respond in a relevant and coherent manner, creating a more engaging experience.
One way to provide this context is by passing the chat history directly to the large language model (LLM). However, as chat history grows with continued user interaction, several challenges arise. The LLM’s token limit restricts how much data can be processed at once, making it challenging to convey complete conversation context in a concise manner. Additionally, managing the growing chat history efficiently becomes a significant data management problem.
In this post, I present a pattern that leverages a hybrid architecture combining Redis for in-memory storage and DynamoDB for persistent, scalable storage. This solution also uses a summarization algorithm that aggregates batches of messages to ensure concise context is passed to the LLM.

The Problem:


  1. Maintaining Accurate Conversation Context:
    • Users expect chatbots to understand previous interactions and provide relevant, personalized responses. Without efficient access to chat history, responses can become disjointed and impersonal.
  2. Token Limit Constraints of LLMs:
    • LLMs have a maximum input token limit, restricting how much context they can consider at once.
    • Directly passing large chat histories to an LLM can exceed this limit, leading to incomplete context or errors.
  3. Relevance vs. Redundancy:
    • Passing too much data to the LLM can introduce noise, reducing response quality.
    • Including redundant information overwhelms the LLM and consumes valuable tokens.
  4. Data Management Challenges:
    • The chat system must efficiently handle growing volumes of messages.
    • Ensuring each message is quickly stored and queried is critical for performance.

The Solution


Architecture
The solution involves three key components that work together to ensure chat history is stored, summarized, and queried efficiently:
  1. In-Memory Chat History Storage (Redis):
    • Purpose: Provides immediate access to recent chat messages, reducing latency during live conversations.
    • Structure: Messages are stored in user-specific Redis stacks, organized by batch for fast retrieval and summarization.
    • Use: Offers high-speed reads/writes, serving as a buffer before data is persisted to DynamoDB.
  2. Persistent Chat History Storage (DynamoDB):
    • Purpose: Ensures reliable and scalable storage of all historical chat data.
    • Structure: Uses a composite primary key to query messages based on UserId and Timestamp.
    • Use: Enables consistent long-term storage and retrieval, allowing access to complete historical data.
  3. Summarization Logic:
    • Purpose: Aggregates every 20 messages into a summary to provide concise conversation context.
    • Batch Summaries: Summaries contain essential details that are more relevant than passing all messages individually to the LLM.
    • Use: Provides a compact overview of conversations, reducing input size for LLMs while maintaining relevant historical context.

Algorithm Explanation:


New Message Workflow:
  • When a new chat message arrives, it is marked with a unique batch identifier, and a timestamp is added.
  • The message is then stored in the Redis stack (an in-memory cache) for quick access and in the DynamoDB table for long-term storage.
  • The system checks how many messages are in the current batch.
  • If 20 messages are reached:
    • A separate background process starts to summarize them.
    • The batch identifier is incremented so new messages can start forming a new batch.
Summary Creation Workflow:
  • In the background, the system retrieves the 20 messages for the current batch from Redis.
  • It aggregates these messages into a summary containing essential information about the conversation.
  • The summary is added to Redis for quick retrieval and persisted in DynamoDB for long-term storage.
  • The original 20 messages are removed from Redis to save memory.
Reloading Cache Workflow:
  • If the application restarts or the in-memory cache is reset, the system reloads the most recent 20 messages or summaries from DynamoDB.
  • This chat history is populated into Redis, ensuring that chatbots can access recent conversations.

Data Model


Redis:
  • Stack Keys: Chat messages are organized into individual stacks per user, using the pattern {user_id}:stack.
  • Summary Keys: Each user also has summary entries with the key pattern {user_id}:summary for batch summaries.
DynamoDB:
  • Partition Key (UserId): Messages are grouped by user for efficient retrieval.
  • Sort Key (Timestamp): Messages are ordered chronologically within each user’s partition

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments