logo
Menu
Multi-modal Image text validation using Bedrock Claude 3

Multi-modal Image text validation using Bedrock Claude 3

The post demos a scenario to validate restaurant hours of operation based on image submission time using multi-modal model, Claude 3 on Amazon Bedrock.

Swagat Kulkarni
Amazon Employee
Published Apr 3, 2024

Introduction

In our previous article, we showcased the utilization of Amazon Rekognition and Amazon Bedrock to address a use case that involved validating the text present in an image. The objective was to determine whether a restaurant was closed or open at a specific time of the day.
In this article, we aim to utilize the recently released Claude 3 model family to perform similar tasks. The Claude 3 model family sets new industry benchmarks across a wide range of cognitive tasks. This family includes three state-of-the-art models with increasing levels of capability: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus.
Amazon Bedrock offer support for Claude 3 Haiku and Sonnet with support for Opus coming soon at the time of writing this article.

Putting the models to test

Lets look at a concrete example. The screenshot below shows the chat playground experience within Amazon Bedrock inside AWS console. We have selected Claude 3 Sonnet to test this use case.
Chat playground
Amazon Bedrock Chat Playground
Next we open the prompt editor and upload an image which shows the restaurant hours of operation.
And the prompt used as shown here:
After hitting confirm and run, we get the following response enclosed in the xml tags:
1
2
3
4
5
6
7
8
9
10
11
<reasoning>
The image shows the hours of operation for the restaurant:
- Monday to Friday: 11am - 11pm
- Saturday and Sunday: 8am - 11pm

The given information states that the picture was submitted at 10AM on Monday.

Since 10AM on Monday falls before the opening time of 11AM on Monday, the restaurant would be considered closed at the time the picture was submitted.
</reasoning>

<outcome>Closed</outcome>
Note that the model was able to detect the text within the image with accurate hours of operation. It was able able to reason based on the time and day of the image submitted and provided an accurate outcome. Furthermore, the prompt specifically asked the model to wrap the responses in xml tags which improves readability and makes it convenient to programmatically process the output response.
In another example, we passed the following image which had no hours of operation in it. The picture only shows the name of the restaurant.
Using the same prompt, the model responded with the following output:
1
2
3
4
5
6
7
8

<reasoning>
The image does not show the hours of operation for the restaurant. It only displays the name "Mi Cocina Mexican Restaurante" on the storefront sign. There is no information provided about the restaurant's opening and closing times.

Given the lack of details regarding the restaurant's operating hours in the image, it is impossible to determine whether the restaurant would be open or closed at 10AM on Monday based solely on the information given.
</reasoning>

<outcome>Insufficient data</outcome>
The model accurately detected the absence of relevant data to determine whether the restaurant is open or closed, and came back with an outcome stating insufficient data as instructed in the prompt.
When testing with Claude 3 Haiku, the responses were not sufficiently accurate and exhibited hallucination (generating fictional or factually incorrect information) when using the same prompt. However, after switching to the more capable Claude 3 Sonnet model, no hallucinations were observed in the responses.

Conclusion

In this post, we demonstrated the power of multi-modal models to solve for complex use cases which require reasoning capabilities. It's important to note that the outcome will depend on the clarity of the prompt. It's important for the prompt to be precise, with clear and unambiguous instructions that will result in accurate responses from the models.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.