Triggering a GenAI task from a live Contact Flow with Amazon Connect
Written by Jean Malha Senior Solutions Architect ISV, Moh Tahsin Specialist Solutions Architect Generative AI, Richard Kim Senior Technical Account Manager.
Jean Malha
Amazon Employee
Published Sep 13, 2024
Last Modified Sep 16, 2024
Deciding how to use an Foundation Model in a user workflow can lead down many different paths. We can ask a Large Language Model to generate a natural language response based off of its own knowledge, we can ask it to query a database for relevant information, or even to execute programmatic functions on our behalf. These tasks, depending on how complex, can take longer than the multiple seconds.
The timeout Lambda calls within an Amazon connect Contact Flow cannot exceed 8 seconds (See documentation) making it hard to incorporate a call to a Foundation Model into a live contact flow.
Let us break down an example:
A customer is calling to ask for a summary of the most recent quarterly earnings call for Amazon, to keep informed on their investments. In this scenario, a large document like that could be around 12,500 tokens (around 9,325 words, or 18 pages). Even for some of the most robust Models of today (September, 13th, 2024) summarizing a corpus of text that size could take upwards of 20 seconds. Normally the timeout hitting 8 seconds would be a roadblock for a live answer within an Amazon Connect Contact flow (See documentation). With the proposed solution, a waiting prompt will play during the processing time while the task executes. Why we are able to circumvent the 8 second time out, is that in reality the processing we are doing is taking place asynchronously, while feeling like a real time request to the user.
A customer is calling to ask for a summary of the most recent quarterly earnings call for Amazon, to keep informed on their investments. In this scenario, a large document like that could be around 12,500 tokens (around 9,325 words, or 18 pages). Even for some of the most robust Models of today (September, 13th, 2024) summarizing a corpus of text that size could take upwards of 20 seconds. Normally the timeout hitting 8 seconds would be a roadblock for a live answer within an Amazon Connect Contact flow (See documentation). With the proposed solution, a waiting prompt will play during the processing time while the task executes. Why we are able to circumvent the 8 second time out, is that in reality the processing we are doing is taking place asynchronously, while feeling like a real time request to the user.
This gives us two distinct advantages:
- as mentioned previously we can take as long as needed for the Model to generate a response (within reason for the customer of course)
- in the future, we can cache the responses after generation
The cache (DynamoDB as shown in the diagram below), can speed up similar requests by pulling directly from the DynamoDB instead of querying the LLM again when the same query to summarize the earnings call gets asked, the wait time is now reduced from 20 seconds to potentially under 1, due to the fact we are only pulling from the cache now.
Proposed solution:
In order to keep the customer engaged in the conversation, limit the number of abandoned calls, and make it possible for a specific result to be cached in case the customer drops, we propose to create a batch processing capability that offers waiting prompts that will be repeated at a defined interval and stores results in a DynamoDB table so that if the same input is sent more than once, the application will be able to provide the stored answer.
In order to keep the customer engaged in the conversation, limit the number of abandoned calls, and make it possible for a specific result to be cached in case the customer drops, we propose to create a batch processing capability that offers waiting prompts that will be repeated at a defined interval and stores results in a DynamoDB table so that if the same input is sent more than once, the application will be able to provide the stored answer.
The processing is split into 3 tasks:
- Producing the generation query - handled by a Producer Lambda function
- Generating the response - handled by a Generative Lambda function
- Checking if a generation has already been posted for the query - handled by a Consumer Lambda Function
Implementation :
Once the contact flow gets to a point where you are ready to invoke the Model (or any long running task), you should trigger the Producer Lambda through an Invoke Lambda Function Block (See documentation).
The Producer Lambda function will take the event and post it to an SQS queue.
Here is a simplified example in Python:
The Generative Lambda will be automatically invoked by the SQS Trigger and write the results in a per-defined DynamoDB table.
Here is a simplified example in Python:
In parallel, the contact flow will keep executing and invoke the Consumer Lambda. This Lambda function has 2 objectives:
- Check if the results are available in the DynamoDB table.
- Wait an appropriate time so that the waiting prompt is not repeated too fast
In this example we are using the 202 status code as a response to the Lambda Invoke block. In the Contact Flow, we can leverage a Check Status Contact Attribute block (See documentation) to decide if we should route the flow towards a Loop block if the result is not ready or play the answer through a Play Prompt Block.
Conclusion and proposed improvements:
With this architecture, the contact flow will keep customers engaged for tasks taking longer than the maximum Lambda timeout for a live Contact Flow allowing you to delegate higher value, more complex tasks to a Generative AI application rather than route them systematically to a live agent.
With this architecture, the contact flow will keep customers engaged for tasks taking longer than the maximum Lambda timeout for a live Contact Flow allowing you to delegate higher value, more complex tasks to a Generative AI application rather than route them systematically to a live agent.
If you want to take this further, consider the following recommendations:
- Generate consistent question references that will allow the consumer lambda to re-use en existing answer for very similar questions.
- Replace the time.sleep call in the consumer lambda with a time bound live polling of the DynamoDB table to be able to exit early if the answer is generated while the function is executing.
Please reach out if you have any question in testing out or implementing this type of solutions.
Written by Jean Malha Senior Solutions Architect ISV, Moh Tahsin Specialist Solutions Architect Generative AI, Richard Kim Senior Technical Account Manager.
Illustration generated through Stable Diffusion - SD3 Large 1.0 Model in Amazon Bedrock. (prompt: "An artificial intelligent robot is answering the phone in a call center")
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.