Image Text Validation using Amazon Rekognition and Bedrock
This post provides details of a serverless image text validation solution built using AWS AI services and Bedrock.
Published Mar 5, 2024
In this article, we aim to address a use case: validating the pictures of restaurant hours of operation taken by delivery drivers to ascertain whether the restaurant is closed or not. Given the wide variety of store operating hour signs, it can be challenging for a system to perform such validation. The solution leverages Amazon Rekognition to detect the text in the pictures and the cognitive abilities of large language models (LLM), readily available on Amazon Bedrock.Architecture Overview
The diagram above provides the high level overview of the end to end architecture powering this use case.
- A mobile client app uploads the image along with metadata (including restaurant name, driver id, driver name, timestamp when picture was taken) to an S3 bucket.
- The picture upload triggers an event, which in turn invokes a Lambda function.
- The Lambda function initiates the orchestration flow, first invoking
detectText
api in Amazon Rekognition. - The api response contains the text, in this case, restaurant hours of operations.
- Next, the Lambda function invokes an LLM, passing it the extracted text in step #4 and the metadata containing the time when the picture was taken.
- The LLM processes the information provided in the prompt and, in response, informs whether the restaurant is open or closed along with the reasoning.
- Finally, the LLM response along with the metadata is stored in Amazon DynamoDB table for analysis.
NOTE - The code sample is available in the
amazon-bedrock-samples
git repository here.We used the one-shot prompting technique to prompt the LLM to provide accurate responses. One-shot prompting is a machine learning technique that uses a single example to guide the model's output. Here is an example of the one-shot prompt with Anthropic Claude's v2 model used in this solution.
If the text in the image is illegible or unclear, the Language Learning Model (LLM) will indicate its inability to determine whether the restaurant is open or closed due to insufficient information. This feature can aid in identifying potential fraudulent situations when an image is posted by an untrustworthy individual.
In this post, we presented a demonstration of a sample solution for validating text, specifically focusing on restaurant operating hours. Nonetheless, this architectural approach can be adapted for various scenarios where there's a need for extracting and validating text from images using Amazon Rekognition and Bedrock.
This post have been co-authored with Tony Howell (anhwell)