Bridging the Efficiency Gap: Mastering LLM Caching for Next-Generation AI (Part 2)

In the first part of our blog series on caching large language models, we explored the fundamental principles and architectural patterns that can be used to optimize LLM performance. From single-layer caching to multi-tiered approaches, we examined how these techniques can dramatically improve the responsiveness and efficiency of these powerful AI systems.

Now, in this second installment, we're going to dive deep into the practical implementation of one of the most versatile caching approaches - semantic caching. Unlike exact key matching, which is limited to serving only identical inputs, semantic caching leverages advanced natural language processing to identify relevant cached responses even for slightly different or paraphrased queries. This added flexibility can be particularly valuable when working with LLMs in real-world applications.

We'll explore how to implement semantic caching for LLMs on the Amazon Web Services (AWS) cloud platform. We'll cover a range of AWS services and tools that can be leveraged to build a robust, scalable, and high-performing semantic caching layer, including Amazon ElastiCache, Amazon OpenSearch, and AWS Lambda. Additionally, we'll discuss key design considerations, such as cache eviction policies, cache stores types, and integration with your existing LLM-powered workflows.

Whether you're running your LLMs on AWS or exploring ways to enhance the performance of your AI-driven applications, this guide will equip you with the knowledge and strategies needed to harness the power of semantic caching for your large language models. Let's dive in!

Option 1: Do It Yourself (DIY)

The "Do It Yourself" route provides you with the greatest flexibility and control. This custom implementation involves building out the various components yourself, rather than relying solely on pre-packaged frameworks (which we'll explore later).

Image not found

DIY Implementation

Here are the core components of this DIY semantic caching solution:

Custom business logic, which will govern how the caching system handles incoming requests. This includes any pre/post processing, authentication and role-base caching, performs semantic analysis, and retrieves or stores the relevant responses. This could be implemented using AWS Lambda or any of the container-based services like Amazon EKS or ECS.
An embedding model to convert text inputs into high-dimensional vector representations. Because every request will be queried for past cached inputs, this model needs to be fast, light and cheap (after all, we are trying to avoid invoking the heavy LLM!). Such models can use Amazon Bedrock embeddings models like Titan Embeddings Text, Cohere Embed models or SageMaker Jumpstart embedding models like gte, e5 or bge.
These embeddings will need to be efficiently stored and queried in a purpose-built vector database. A vector database is a specialized data storage and retrieval system designed to efficiently store and query high-dimensional vector data, such as the embeddings produced by an embedding model. Again, since every request will be checked for past cached inputs, this database needs to be fast and scalable to support storing and querying large number of stored embedded inputs. Vector stores on AWS can use multiple options like Amazon Aurora PostgreSQ, Amazon O penSearch Service ,Vector search for Amazon MemoryDB or Amazon DocumentDB.
For the cache store itself, you could leverage a high-performance database. A cache store is a high-performance data storage system used to temporarily hold and quickly retrieve frequently accessed data, in this case, the results of semantic queries to the language model. The cache store can leverage various AWS database options, such as PostgreSQL, MySQL, MariaDB, SQL Server, Amazon DynamoDB, or Amazon DocumentDB, depending on factors like scalability, latency, and data modeling requirements.
Finally, to manage the cache eviction strategy, you'll need to implement your own policies, potentially considering factors like access frequency, recency, or predefined expiration periods.

While this DIY approach requires a bit more upfront investment in terms of development time and resources, it can ultimately provide you with a more tailored, optimized, and scalable semantic caching system for your large language models. This level of customization can be particularly valuable if you have unique business requirements, specialized data sources, or the need for tighter integration with your existing infrastructure and workflows.

Option 2: Langchain GPTCache

The GPTCache module is part of LangChain open source framework and supports different caching backends, including in-memory caching, Redis, and DynamoDB, allowing you to choose the most appropriate caching solution based on your specific requirements and infrastructure. It also supports different caching strategies, such as exact matching and semantic matching, enabling you to balance speed and flexibility in your caching approach.

Let's see GPTCache in action with a simple example using Amazon Bedrock.

First, Install the needed packages.

1
%pip install --upgrade --quiet langchain langchain_aws langchain-community gptcache

We then use the BedrockLLM class. For this example we are using Amazon Titan express model.

1
2
3
4
5
from langchain_aws import BedrockLLM

llm = BedrockLLM(
     model_id="amazon.titan-text-express-v1"
)

Next let's initiate a simple similarity cache with GPTCache. By default, GPTCache uses exact key caching.

To support semantic caching we will use the built-in init_similar_cache function.

The following setup will use Faiss as vector store.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import hashlib
from langchain_core.globals import set_llm_cache
from gptcache import Cache
from gptcache.adapter.api import init_similar_cache
from langchain_community.cache import GPTCache

def get_hashed_name(name):
    return hashlib.sha256(name.encode()).hexdigest()

def init_gptcache(cache_obj: Cache, llm: str):
    hashed_llm = get_hashed_name(llm)
    init_similar_cache(cache_obj=cache_obj, data_dir=f"similar_cache_{hashed_llm}")

set_llm_cache(GPTCache(init_gptcache))

That's it! The cache is set! Let's see a few example invocations!

Image not found

GPTCache example

First invocation, "What is Amazon Bedrock?", took 12 seconds for response.

Second invocation, "What is Amazon Bedrock?" (notice the extra space between "is" and "Amazon"), 881 ms

Third invocation, "Explain what is Amazon Bedrock?", 1.03 seconds

The GPTCache is designed with modularity in mind, allowing users to easily customize their own semantic caching system. The module provides a range of pre-built implementations for each of the core components, such as the cache backend, similarity search, and cache management. This gives users the flexibility to choose the implementation that best fits their requirements.

Furthermore, the modular architecture of GPTCache enables users to develop and integrate their own custom implementations for any of the module's components. This level of customization allows users to tailor the caching system to their specific use cases, data sources, and performance needs, ensuring the optimal efficiency and effectiveness of their large language model applications.

Option 3: LangChain built-in semantic cache

The langchain.cache module in LangChain is a feature that allows you to cache results of individual LLM calls using different cache stores. The full list of supported cacheing stores can be found here, however notice these include exact key caching as well. Let's look at an example using Amazon OpenSearch.

First, follow the instruction to create an Amazon OpenSearch Service domain here. We'll call our domain "cache". At the end of the creation process copy the domain address and user and password.

Next, Let's set up and use Bedrock chat module.

1
2
3
4
5
from langchain_aws import ChatBedrock

chat = ChatBedrock(
    model_id="anthropic.claude-3-haiku-20240307-v1:0", model_kwargs={"temperature": 0.5}
)

Format the domain URL with user and password from the OpenSearch setup process.

1
endpoint='https://<USER>:<PASSWORD>@<DOMAIN>.<REGION>.es.amazonaws.com'

Now, initialize the caching with OpenSearchSemanticCache. Notice we also need to provide the embedding model we want to use. Here, we are using Bedrock Titan embedding.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from langchain.globals import set_llm_cache
from langchain_aws import BedrockEmbeddings
from langchain_community.cache import OpenSearchSemanticCache

bedrock_embeddings = BedrockEmbeddings(
    model_id="amazon.titan-embed-text-v1", region_name="us-east-1"
)

# Enable LLM cache. Make sure OpenSearch is set up and running. Update URL accordingly.
set_llm_cache(
    OpenSearchSemanticCache(
        opensearch_url=endpoint, embedding=bedrock_embeddings, score_threshold=0.1
    )
)

In this example we provided and additional score_threshold parameter. The semantic matching threshold in the caching system controls the level of similarity required between the user's input and the cached results. A higher threshold value indicates a more strict matching criteria, where the input and cached results must be highly similar to be considered a match. Conversely, a lower threshold allows for more lenient semantic matching, where a broader range of inputs can be deemed similar enough to the cached results.

The default threshold value in the caching system is set to 0.2. This means that the user's input and the cached results must have a cosine similarity of at least 0.2 to be considered semantically similar and eligible for retrieval from the cache.

Let's ask our model about Bedrock.

Image not found

OpenSearch uncached response

The first invocation took more than 3 seconds.

Next, we'll ask a similar question.

Image not found

OpenSearch cached response

Now the response came in 332 ms! Not bad ;)

RAG implementation (using langchain retriever)

You can use the same setup in a RAG application. As an example, we use the Knowledge Bases and RAG workshop. Here we are storing information about Amazon Shareholder Letters. Once set, we can ask our RAG application questions regarding the information in the letters.

This time use the RetrievalQA module.

1
2
3
4
5
6
7
8
9
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": claude_prompt}
)

Caching is set exactly as before.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from langchain.globals import set_llm_cache
from langchain_aws import BedrockEmbeddings
from langchain_community.cache import OpenSearchSemanticCache

bedrock_embeddings = BedrockEmbeddings(
    model_id="amazon.titan-embed-text-v1", region_name="us-east-1"
)

# Enable LLM cache. Make sure OpenSearch is set up and running. Update URL accordingly.
set_llm_cache(
    OpenSearchSemanticCache(
        opensearch_url=endpoint, embedding=bedrock_embeddings, score_threshold=0.1
    )
)

Now, let's ask how many employees does Amazon have.

Image not found

RAG first invocation

Notice that because we are using the RetrievalQA, we are also seeing the referenced documents retrieved. The response time here was 5.4 seconds.

See the result below when we ask the question a second time.

Image not found

RAG Cached

We get the same response (cached), including reference documents but the response was within 1 second.

Summary

In the first part of our blog series on caching strategies for large language models (LLMs), we explored the fundamental principles and architectural patterns that can be employed to optimize the performance of these powerful AI systems. From single-layer caching to multi-tiered approaches, we examined how different caching techniques can dramatically improve the responsiveness and efficiency of LLM-powered applications.

Now, in this second and final installment, we delve into the practical implementation of one of the most versatile caching approaches - semantic caching. Unlike exact key matching, which is limited to serving only identical inputs, semantic caching leverages advanced natural language processing to identify relevant cached responses even for slightly different or paraphrased queries. This added flexibility can be particularly valuable when working with LLMs in real-world scenarios.

Whether you're running your LLMs on AWS or exploring ways to enhance the performance of your AI-driven applications, I hope this comprehensive blog series has equipped you with the knowledge and strategies needed to harness the power of caching for your large language models.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.

Bridging the Efficiency Gap: Mastering LLM Caching for Next-Generation AI (Part 2)

LLM caching refers to the process of storing and managing the intermediate computations and outputs generated by language models, allowing for rapid retrieval and reuse in subsequent queries or tasks. In this second part of a blog series, we'll explore LLM caching implementations.

Option 1: Do It Yourself (DIY)

Option 2: Langchain GPTCache

Option 3: LangChain built-in semantic cache

RAG implementation (using langchain retriever)

Summary

Comments