Client-side parallel invocation of LLMs in Amazon Bedrock

Client-side parallel invocation of LLMs in Amazon Bedrock

A guidance on potentially improving large language model (LLM) invocation throughput in real-time use case by optimizing the client-side way of calling the LLM.

Published Jul 30, 2024
With Amazon Bedrock, you can use foundation models (FMs) from leading AI companies like Anthropic, Cohere, Meta, and Mistral AI though API. These are commonly referred as large language models (LLMs). In a typical workflow, you input your data into the API and programmatically invoke the LLM to retrieve the output. In this case you process the input data to produce the output data.
There can be a scenario where you need a certain throughput for your large amount of data to be processed within a short period of time. The Provision Throughput for Amazon Bedrock can be a solution where the bottleneck is on the requests processed per minute or tokens processed per minute quotas. However, there may be a situation where your usage of Amazon Bedrock has not reached those quotas yet, but has rather been limited by the client-side application’s ability to invoke the LLM in parallel to achieve higher throughput. While batch inference in Amazon Bedrock may help for use cases where inferences can be made in batches, you probably need a different solution for a real-time use case.

How it can be done

One way to achieve this parallel processing in Python is by using concurrent.futures library. With the ThreadPoolExecutor class, asynchronous invocation is implemented with threads. A code example is provided below.

How it works

With this invocation technique, the calls made to Amazon Bedrock can be performed in parallel by the client-side device. The calls are made asynchronously instead of sequentially. Instead of being idle when the program is waiting for the response from the server-side, it can start the asynchronous invocation for the next data in the input list. The reduction of idle time is how the higher throughput can be achieved in this case.
It is important to know that the responses may not be received in the same order as the input list. If the input list contains data [1, 2, 3] and the LLM call is depicted by f() notation, then the output of f(2) may be received first before f(1). This means that you should not have any code within the call_llm method above which is dependent on the output of the previous data. An anti-pattern is described in the following scenario. Suppose you have a list of the chat history for a given customer across different sessions like one here [“long session 1 chat . . .”, “long session 2 chat . . .”, “most recent chat”] and you want to summarize each chat session by using LLM. While you can still use the parallel technique for the scenario so far, once you have a requirement to summarize a particular chat session using the the previous chat sessions as context, the problem may start coming because the summary for the previous chats may not be ready yet. Instead, you can process the chat sessions sequentially, or process the per-chat summarization in parallel and separate the cross-chats summarization into another later process.

Performance

In the previous code example, the value of max_workers can be adjusted based on the number of CPUs the device has and other factors. A possible formula is os.cpu_count()*alpha + beta where alpha is a multiplication factor and beta is a constant. For example, alpha may be 1 and beta may be 4. I used max_workers = 5 in one occasion and 15 in another occasion. I performed testing to find the optimal max_workers.
I did at simple benchmark for an image understanding use case with a multi-modal LLM in Amazon Bedrock in Oregon (us-west-2) region. I used max_workers = 5 on an ml.t3.medium instance in SageMaker JupyterLab. The result demonstrated about 5.47 times higher throughput when invoking the LLM in parallel as opposed to invoking it sequentially for a list of data points.

Application

This technique is used in Video Understanding Solution sample published on AWS Sample repository. The solution uses LLM by making invocation calls to Amazon Bedrock to extract information from the uploaded .mp4 videos, to be later used with summary generation and Q&A. The input data to the LLM calls are the video frames. The use of the ThreadPoolExecutor class from the concurrent.futures library can be seen on this line of code.
The time needed to process uploaded videos has decreased meaningfully by the application of this technique. It is important to remember that the input data may not be processed in order as arranged in the input list. For this particular Video Understanding Solution application, this means that frames processing must not be dependent on each other since frames can be processed in different order.

Next steps

Learn more about generative AI with Amazon Bedrock. You can refer to the code samples here and more resources here. I hope the blog post will be useful and meaningful for you. Thank you for reading.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

1 Comment