
Bridging the Efficiency Gap: Mastering LLM Caching for Next-Generation AI (Part 2)
LLM caching refers to the process of storing and managing the intermediate computations and outputs generated by language models, allowing for rapid retrieval and reuse in subsequent queries or tasks. In this second part of a blog series, we'll explore LLM caching implementations.
- Custom business logic, which will govern how the caching system handles incoming requests. This includes any pre/post processing, authentication and role-base caching, performs semantic analysis, and retrieves or stores the relevant responses. This could be implemented using AWS Lambda or any of the container-based services like Amazon EKS or ECS.
- An embedding model to convert text inputs into high-dimensional vector representations. Because every request will be queried for past cached inputs, this model needs to be fast, light and cheap (after all, we are trying to avoid invoking the heavy LLM!). Such models can use Amazon Bedrock embeddings models like Titan Embeddings Text, Cohere Embed models or SageMaker Jumpstart embedding models like gte, e5 or bge.
- These embeddings will need to be efficiently stored and queried in a purpose-built vector database. A vector database is a specialized data storage and retrieval system designed to efficiently store and query high-dimensional vector data, such as the embeddings produced by an embedding model. Again, since every request will be checked for past cached inputs, this database needs to be fast and scalable to support storing and querying large number of stored embedded inputs. Vector stores on AWS can use multiple options like Amazon Aurora PostgreSQ, Amazon OpenSearch Service ,Vector search for Amazon MemoryDB or Amazon DocumentDB.
- For the cache store itself, you could leverage a high-performance database. A cache store is a high-performance data storage system used to temporarily hold and quickly retrieve frequently accessed data, in this case, the results of semantic queries to the language model. The cache store can leverage various AWS database options, such as PostgreSQL, MySQL, MariaDB, SQL Server, Amazon DynamoDB, or Amazon DocumentDB, depending on factors like scalability, latency, and data modeling requirements.
- Finally, to manage the cache eviction strategy, you'll need to implement your own policies, potentially considering factors like access frequency, recency, or predefined expiration periods.
1
%pip install --upgrade --quiet langchain langchain_aws langchain-community gptcache
1
2
3
4
5
from langchain_aws import BedrockLLM
llm = BedrockLLM(
model_id="amazon.titan-text-express-v1"
)
init_similar_cache
function. 1
2
3
4
5
6
7
8
9
10
11
12
13
14
import hashlib
from langchain_core.globals import set_llm_cache
from gptcache import Cache
from gptcache.adapter.api import init_similar_cache
from langchain_community.cache import GPTCache
def get_hashed_name(name):
return hashlib.sha256(name.encode()).hexdigest()
def init_gptcache(cache_obj: Cache, llm: str):
hashed_llm = get_hashed_name(llm)
init_similar_cache(cache_obj=cache_obj, data_dir=f"similar_cache_{hashed_llm}")
set_llm_cache(GPTCache(init_gptcache))
"What is Amazon Bedrock?"
, took 12 seconds for response."What is Amazon Bedrock?"
(notice the extra space between "is" and "Amazon"), 881 ms"Explain what is Amazon Bedrock?"
, 1.03 secondslangchain.cache
module in LangChain is a feature that allows you to cache results of individual LLM calls using different cache stores. The full list of supported cacheing stores can be found here, however notice these include exact key caching as well. Let's look at an example using Amazon OpenSearch.1
2
3
4
5
from langchain_aws import ChatBedrock
chat = ChatBedrock(
model_id="anthropic.claude-3-haiku-20240307-v1:0", model_kwargs={"temperature": 0.5}
)
1
endpoint='https://<USER>:<PASSWORD>@<DOMAIN>.<REGION>.es.amazonaws.com'
OpenSearchSemanticCache
. Notice we also need to provide the embedding model we want to use. Here, we are using Bedrock Titan embedding.1
2
3
4
5
6
7
8
9
10
11
12
13
14
from langchain.globals import set_llm_cache
from langchain_aws import BedrockEmbeddings
from langchain_community.cache import OpenSearchSemanticCache
bedrock_embeddings = BedrockEmbeddings(
model_id="amazon.titan-embed-text-v1", region_name="us-east-1"
)
# Enable LLM cache. Make sure OpenSearch is set up and running. Update URL accordingly.
set_llm_cache(
OpenSearchSemanticCache(
opensearch_url=endpoint, embedding=bedrock_embeddings, score_threshold=0.1
)
)
score_threshold
parameter. The semantic matching threshold in the caching system controls the level of similarity required between the user's input and the cached results. A higher threshold value indicates a more strict matching criteria, where the input and cached results must be highly similar to be considered a match. Conversely, a lower threshold allows for more lenient semantic matching, where a broader range of inputs can be deemed similar enough to the cached results.RetrievalQA
module.1
2
3
4
5
6
7
8
9
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": claude_prompt}
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from langchain.globals import set_llm_cache
from langchain_aws import BedrockEmbeddings
from langchain_community.cache import OpenSearchSemanticCache
bedrock_embeddings = BedrockEmbeddings(
model_id="amazon.titan-embed-text-v1", region_name="us-east-1"
)
# Enable LLM cache. Make sure OpenSearch is set up and running. Update URL accordingly.
set_llm_cache(
OpenSearchSemanticCache(
opensearch_url=endpoint, embedding=bedrock_embeddings, score_threshold=0.1
)
)
RetrievalQA
, we are also seeing the referenced documents retrieved. The response time here was 5.4 seconds.Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.