Visual Analysis with GenAI - Part 1: Sentiment & Emotion
In this blog post, you'll learn how to analyze sentiment and emotional tone of images found in documents.
Ross Alas
Amazon Employee
Published Aug 12, 2024
Last Modified Aug 19, 2024
Authors: @Ross Alas, @Arya Subramanyam
In this four-part series of Visual Analysis with GenAI, Arya and I will be taking you through the different techniques including prompt engineering techniques on how you can gain deeper understanding of documents as part of your intelligent document processing pipeline.
Updated August 19, 2024:
We've released Part 2 here: Visual Analysis with GenAI - Part 2: Graphs, Charts & Tables
We've released Part 2 here: Visual Analysis with GenAI - Part 2: Graphs, Charts & Tables
Many PDF documents, particularly marketing materials, rely heavily on visual elements to convey sentiment and emotional tone. These documents often contain a rich mix of text, figures, tables, and images that traditional intelligent document processing tools struggle to interpret comprehensively.
When using these complex documents with Generative AI large language models (LLMs), a common pre-processing step involves converting PDFs to plaintext. However, this approach has significant limitations, especially when dealing with image-rich content. The result is often a loss of crucial visual information that contributes to the document's overall message and emotional impact.
In this blog post, we'll introduce an innovative solution to this challenge. The approach involves first converting PDF documents to images, preserving the visual elements as is. The solution uses Amazon Bedrock, a fully managed service that simplifies the development of GenAI-powered applications, in conjunction with Anthropic's Claude 3.5 Sonnet, a multi-modal large language model capable of understanding both text and images.
Multi-modal LLMs represent a significant advancement in AI technology. These models can analyze images alongside text, providing a more holistic understanding of content. By utilizing Claude 3.5 Sonnet through Amazon Bedrock, we can gain deeper insights into documents, capturing the sentiment and emotional nuances conveyed by both textual and visual elements.
This method allows us to overcome the limitations of traditional text-only processing, offering a more comprehensive analysis of complex PDF documents. Throughout this post, we'll demonstrate how this approach can unlock new possibilities in document understanding, particularly for content where visual elements play a crucial role in conveying the overall message and emotional tone.
The solution consists of first splitting and converting the PDF document to pages as an image using a Python library called pdf2image. Then, these images are used as part of the prompt that will then be sent to Anthropic Claude 3.5 Sonnet that is hosted on Amazon Bedrock. The model will respond with the sentiment and emotional tone as it extracts information throughout the pages.
Figure 1. Architecture overview of the solution. It takes a PDF document, splits it into pages in the form of images, then the images are used as part of the prompt sent to Bedrock, and finally the LLM hosted on Amazon Bedrock responds with the sentiment and emotional tone.
To implement this solution, ensure you have the following:
- An AWS account with an AWS Identity and Access Management (IAM) user with permissions to invoke Amazon Bedrock
- The AWS Command Line Interface (AWS CLI) installed and configured for use
- Python 3.11 or later with Amazon SDK for Python (Boto3) installed and pdf2image
- (Optional) Use virtualenv or conda to create a virtual Python environment
- (Optional) Use Jupyter Notebooks
Ensure that you have the AWS CLI installed and is configured for use, and as well as Python 3.11 or later. Then, install boto3 and pypdf using PIP:
Import boto3 and pdf2image and create the Bedrock client. Feel free to experiment with different models from the Anthropic Claude 3 and Claude 3.5 families, adjust the temperature, and max_tokens parameters
This is the sample PDF (financial_freedom_achieved.pdf) that I’ve used for this blog. It’s in the form of a news article with an image of a couple at the top. You can use any of your own PDFs in this example.
Figure 2. The test document containing a picture of a couple and a news article about financial freedom.
After you have your .PDF that you want to use, convert it to an array of image bytes in JPEG format.
The following is an example prompt on how we can prompt the LLM to give a sentiment and emotion of the images found in the document. Depending on what you are looking to accomplish and the structure of your documents, you may have to do some prompt engineering to get it to output what you want.
Once the prompt and images have been prepared, build the Converse API call to Amazon Bedrock. Using the list of image bytes, you will need to build the image image content block for each of the images and incorporate the text prompt. This will be used as part of the message to the Converse API. Note that at the time of writing, there is a maximum limit of 20 images per call, a maximum of 3.75 MB per image, and 8000px by 8000px maximum resolution.
Using the message built above, you can now call the Converse API.
As you can see in the following. model response, it correctly shows the sentiment and emotion depicted in the image of the document.
Using multi-modal language models hosted on Amazon Bedrock, you can now gain deeper understanding of documents that contain images. You can create a description of each image that’s found in the document and understand the sentiment and emotions conveyed.
To learn more using Generative AI in your applications, check out the following resources:
- Amazon Bedrock is the easiest way to build GenAI-powered applications
- Learn more about Anthropic Claude 3.5 Sonnet
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.