Detailed summaries and high-quality content creation with genAI
Get really long, detailed, and accurate answers from generative AI models (Reference code and Streamlit application provided!)
Ignacio Sanchez Alvarado
Amazon Employee
Published Jul 8, 2024
Have you ever had trouble getting long, high-quality answers from genAI models?
As you may have notice, AI models have a tendency for replying with relatively short answers, which for some use cases is totally the opposite of what we need as an output. If we simply use prompt engineering (e.g. "answer me in more than X words...") this usually doesn't solve the problem, either because the model doesn't care about the prompts or because it does care but produces inconsistent answers just to meet the required number of words.
This is specially an issue for a common use case like summarizations. Let's say we have a technical document of more than 100 pages and we want to get a detailed summary of 10 pages. If we simply ask the model to give us a summary, we will get only a few paragraphs that do not contain much detailed information.
Let's see an example:
In this case we asked the model to resume "The treasure island" a 140+ pages novel from Robert Louis Stevenson. While the content is accurate, the summary is very sparse; a couple of paragraphs do not seem to be enough to explain the content of a 100+ pages book!
Like this one there are many other uses cases that can require high quality and long content creation... for all those there is a method that you can use in order to get this desired outcome. In this blog post we will focus on the summarization example, but a similar architecture can be used for other use cases just by making some minor modifications. Let's see how to do this.
Let's explain first the different types of summarizations that we can use in generative AI.
According to Langchain, one of the most popular frameworks for generative AI application development, there are two main summarization techniques:
Stuff: when all the content of the document fits in the model context window. You just simply pass the content and prompt the model to summarize it.
Map-Reduce: when the content doesn't fit the context window. This technique split the content in several chunks, then each chunk is summarized and a final call to the model creates the final summary using the summaries of the chunks.
In the first section I showed an example using the stuff technique. Let's see what a example using map-reduce looks like:
As you can see, the result is somewhat better than in the first example, but in the end the final answer is not long enough. Because the final summary is generated by just one final prompt of the model this is always going to be relatively short, as discussed in the introduction generative AI models don't like to talk too much in their answers!
Let's talk about how we can overcome this problem. It is clear that with just one final prompt we will never get a detailed and extensive answer, so what we can do is to combine several prompts to obtain our the final result.
This sectioning technique is similar to the map-reduce one, but the main difference is that the final output is not generated by just one model prompt instead we ask the model to create several summaries for specific sections of the document and then we join together all those summaries and have our final output.
Have a look at the diagram to understand better how this works:
There are two main calls to the generative AI model: first we ask about the main sections of the document. Then we ask the model to create a a summary of each of the sections.
Two main improvements regarding the previous techniques:
- Each summary is generated with all the content of the original text. With map reduce we were generating the summaries only with the information of the chunk so all that contextual info was lost.
- For the final summary we don't use a generative AI model we just simply join all the sections. This allows us to have responses as long as we need.
One main downside:
- Token consumption is much higher with this technique as we are passing the whole context as input in every section summary. Good news is that we can use some of the newest small models that are really cheap and optimized for this kind of tasks (for example Claude Haiku from Anthropic!). With this we can have a really cost-effective solution that provides high quality content.
Let's explore how we can implement this type of summarization in a python program. We will use langchain, as the generative AI applications framework, and Amazon Bedrock, as the model provider service. We are gonna use Claude 3 as mode, feel free to change it and to choose a model of your preference.
First, we initialize the language model in langchain:
Load the document using PyPDF library:
We obtain the sections of the document in a first model call:
Then, we convert the sections output in a python list:
And finally we obtain and join all the section summaries. Here we use the ThreadPoolExecutor in Python to perform all those model calls in parallel and save a massive amount of waiting time for the final response:
Do you want to bring this code into a working application? No worries! I have also created a streamlit app that displays a webpage to upload documents and start a high quality summarization job.
Here you can access the repo link to explore the code and deploy the app in your local environment: high_quality_summarization. It executes in the backend the code shared in the previous section.
This is the final result for a high quality summary of the same example document, "The treasure island" novel:
As with these two there are another 35 more sections in this final output. It generates a 10 page long summary with all the detailed information about the book.
This type of content creation is specially useful for other use cases like technical documents insights extractions. See what's the result for a 160 pages long "Energy performance certificates in buildings" study:
And this is just one of the 18 sections summaries generated.
Generative AI models are really powerful tools if we know how to use them properly. In this example we have seen how to overcome one of the main issues with current models: lack of detailed and lengthy responses.
By applying some rapid engineering techniques as well as some application logic, we can make multiple calls to a model and stitch the final results together into a long, high-quality final response.
- Sectioning summarization video (by Sam Witteveen)
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.