Automate parsing of PDFs, xlsx files, images, and other document formats into structured JSON, leveraging LLMs on Amazon Bedrock
This post provides a fully serverless solution aimed at automating the parsing of unstructured and semi-structured documents, and images, into predetermined structured JSON output. The deterministic structured JSON can be then used to integrated with your next system stages for processing or taking actions. The solution will Leverage AWS Lambda for serverless compute, and Amazon Bedrock for serverless access to a variety of Multimodal Large Language Models (LLMs).
Tamer Soliman
Amazon Employee
Published Jul 28, 2024
Last Modified Jul 29, 2024
Parsing semi-structured and unstructured documents and images while presenting the output in structured JSON, or similar format, often presents a challenge. Those documents can be of different types and formats, e.g. xlsx, doc, html, or jpeg. In addition they can be totally unstructured, text for example, or semi-structured as pdf or xls. In additions, the semi-structured documents can, at times, have a structure that can not be per-determined or dynamic, e.g. third party forms.
Why is structured output important:
Once we have a structured JSON formatted output, companies can present that structured output to their next stage systems for further processing, in an automated document processing pipeline. The document processing pipeline can facilitate the automation of order processing, logging and paying invoices, making a reservation based on an email request, loading a parsed scanned expense receipt into an expense report, or, as aggregated, can be used to provide business insight from social media feeds or email interactions.
What are the challenges then:
There are challenges in processing unstructured/semi-structured documents and images using traditional methods, specially if the structure is not per-determined/static. Let's take a few examples to better understand these challenges.
If your forms are semi-structured, static, and deterministic, you can you can go all the way and build a programed parser. The challenge with that is, every time you have a form layout update, you must update your parser, this adds unnecessary and undesired engineering overhead.
If your forms are a bit more dynamic, you can use traditional ML methods, train a model, and improve accuracy over time. With every change in form layout, depending on the nature and level of changes, you may need to do additional model re-training. If you don't want to train your own models, you can use ready-to-use AI services like Amazon Textract and Amazon Comprehend, alleviating some of this burden.
If your documents/forms are often dynamics, do not adhere to a predetermined structure, or no structure at all, i.e. an email, a text file. Or if the forms are third party documents/images you need to process yet have no control over how they look and how often their layout changes, e.g. an invoice, a receipt, a purchase order. In these cases you may not do very well with traditional ML methods, and definitely no hard coding will be of help.
Leveraging GenAI:
With the evolution of Large Language Models (LLMs), trained on a very large amount of data in almost every discipline, and the introduction of multimodality support, more specifically Text and Vision capabilities, we have a new tool in our arsenal that can come to our avail. Leading LLMs like Claude 3 /3.5 families by Anthropic have demonstrated a high ability of complex logical reasoning through both Text and Vision. This, when added to the extensive amount of data use to train the base Foundation Model (FM) make LLMs a very appealing tool to address both the unstructured and dynamic semi-structured document challenge we are up against.
Today in this blog, I intend to address a use case where your documents/images may be a combination of unstructured, and semi-structured files. They can also be known to you or third party forms that are unpredictable. We will talk about how to leverage LLMs to address these challenges, and extract the insights we need from those documents, regardless of the document type, structure, dynamic nature. Our goal is to produce a predetermined JSON formatted output of our choice.
In our approach today, we will focus on leveraging the capabilities of Multimodal Large Language Models (LLMs), more specifically, the text and vision modalities. I have used Claude 3 Sonnet for my testing and it proved very reliable with high degree of accuracy. Claude 3.5 Sonnet would likely be faster and more accurate with the task at hand, I just did not have access to the model during my testing. I recommend testing it out if you have access to it. Depending on the complexity of your use case, you might be able to even use a lighter model, like Claude 3 Haiku and get the accuracy level you are targeting at a better price and performance point.
Leveraging Amazon Bedrock, you can have access to all these models and more, with just the change of your target model_id in your API call. In our solution today we will also explore the Amazon Bedrock Runtime new Converse API.
For orchestration, we will use AWS Lambda as our serverless compute. Combining that with Bedrock for serverless access to LLMs we will be looking at a total end-to-end serverless solution.
In addition, we will leverage the native Lambda integration with Amazon S3 to automate triggering the flow. The diagram above shows a high level architecture. Basically, your user/app drops a new file in S3, in any of the supported formats, and the lambda prepares the file and presents it to Claude 3 Sonnet on Bedrock using the Converse AP. Once the response is received by Lambda, the function prepares the output and stores the JSON formatted file in the output S3 bucket you specify.
Additional Challenges:
Claude 3 family of models are built with multimodality from the ground up, with currently two supported modalities, Text and Vision. As we have a variety of source document formats, we will need to a) chose the modality most suitable to process each one of these document types, and b) prepare our documents to be presented to Claude based on the modality selected.
Let's take an example, a file type xlsx. We first need to decide which modality would produce the best parsing outcome. An an architect, I started experimenting with different options. I first thought converting the xlsx file into an image would present Claude with the most intact content. Took an image of an xlsx formatted receipt and presented it to Claude using the image modality input. I then tried converting the xlsx file into a comma separated text file and presented it to Claude as text input, using the text modality. Last I tried to use Panda, prepare the file as a dictionary, flatten it, and present it to Claude as text input. Comparing the results of the three approaches, using comma separated format and presenting it to Claude as Text modality input produced, not just the most accurate result out of all three, but it was also very reliable and accurate. I was about to embark on a journey to experiment with every other format, pdf, html , etc.
Before I carry on all that much work I decided to, instead of using the InvokeModel, test out the Converse API . With the Converse API all what I had to do is to present the xlsx file in Bytes as a document file, and I was able to get the same level of accuracy. Basically, the Converse API took care of preparing and presenting the file to the model the most optimal way, based on the file type. This finding changed my approach and made my code much simpler, I will explain why. Should I have continued to use the
InvokeModel
API , for every document type I would have had to find the right tool to convert the document type into text input without losing too much of the layout information and its impact on data. This would have been a lengthy inefficient process and would have resulted in a much more complex code. Now let me take you on a journey to explore the Converse API and how it can simplify that all.Using the Converse API:
Bedrock runtime Converse API provides a consistent interface that works with all models that support messages. This allows you to write code once and use it with different models. One of the appealing features of the Converse API, for our specific use case, is the ability to include images or documents in your message along with your text prompt input, on that we will dive a little deeper.
Converse API introduces the ability to include a variety of image formats and document formats. You present them through the API and the API take care of preparing them to the model in the format the model is expecting. In our use case for Claude 3, we are looking at either Text and Image. Let's take a look at how that works.
ImageBlock data type allows you introduce an image to the model along sides with the user prompt text input. It supports any of the following image format
png | jpeg | gif | webp
. Here is an example of how your message would look like On the Other hand, with a DocumentBlock, you can introduce any of the following document types
pdf | csv | doc | docx | xls | xlsx | html | txt | md
to the model, here is a message example. Now that we have learned the fundamentals and the tools we we will need to use, let's get building!
In this example we will use expense receipt processing as a sample use case. The same method can apply to invoice processing , form processing, email processing, or any other use case that requires the parsing of unstructured, or semi-structured document/image into predetermined JSON formatted output. All what you will need to do, is to change the system prompt to reflect your use case, and the JSON output format to represent the desired structure and keys your use case require.
In this example, the function we will build together will be able to handle any of the following document formats [pdf | csv | doc | docx | xls | xlsx | html | txt | md], and any of the following image formats [png | jpeg | gif | webp]. It will be able to automatically identify the type of document/image format based on the source file extension, and prepare the document to be presented to Claude 3 Sonnet, the model we are using in this code example.
We will also be using Python 3.12 and as we will be using the Bedrock Converse API , we will also need to make sure our boto3 version is
1.34.116
or higher. A Lambda layer containing an updated version of boto3 maybe required if the version of Python you are using does not include the minimum required boto3 version level. You can run this command to print your boto3 version within your Lambda code.The automated document processing solution we are building will assume your user/app place the file that needs to be analyzed to output your desired structured JSON formatted output, into an S3 bucket/folder. We will refer to this bucket and folder as the source bucket and source folder. Once a new file is placed an S3 event is generated that we will use to trigger our Lambda function. The Lambda will take care of preparing the file, and will use the Converse API to communicate with Bedrock to access Claude 3 Sonnet and present the file to the model along with the system prompt, inference configuration, and prompt details. The output of the model is returned to the Lambda function in JSON format, is processed, and then stored in our S3 output bucket/folder.
Now we go about building the solution. We start with creating the Lambda function, adding necessary permissions, and configure an S3 trigger. We will create a new Lambda function, and ensure the Lambda execution role has proper credentials to access both the S3 document source folder and the S3 output document folder that you will be using. The lambda function execution role will also need to have permission to call the Bedrock
bedrock:InvokeModel
in order to be able to use the Converse API. Once the Lambda function is created, the proper execution role with the right permissions is included, we can go ahead and configure our function trigger. To receive triggers from S3 when a source file is loaded or updated, we can use the Lambda console. Simply select "add trigger", add the S3 source bucket and folder, and select the trigger events you would like included, save and you are good to go. To reduce the risk of running into a loop situation, where the output file triggers may trigger a new Lambda execution, it is recommend to use a separate S3 bucket to store your function output.
Now that we have the ground work done, let's build our Lambda!
Here are is the step-by-step procedure required to construct our Lambda function code. I have also included at the end the an full code example.
- Import necessary Python libraries, including json, base64, tempfile, sys, boto3, and os.2.Set up the AWS Bedrock client for the Amazon Bedrock Runtime service and sets up the inference configuration parameters.
- Defines the system prompt that instructs the Al model to act as an accounting expert specialized in analyzing expenses and receipts. The prompt will provide instructions on how to extract information from receipts and format the output as JSON. This can be adjusted based on your use case.
- Define the text prompt, in this example I am calling it user_input. Here we will give additional instructions and get to define the desired JSON output format. This as well can be adjusted based on your use case.
- Define supported document formats (e.g., PDF, Word, Excel) and image formats (e.g., PNG, JPEG, GIF).
- Set up the target output bucket and folder names. This is where the output JSON will be stored.
- Define our lambda_handler, parse the S3 event trigger data. We will extract the bucket name, file key, file name, and file extension from the event data triggered by an S3 file upload.
- Download source file from S3 to the function temp directory .
- Prepare the file for processing and construct the prompt message. We will loop through, and depending on the file type we will read the file content into memory, as either a document file or an image file. Then we will construct a message object containing the user input and the file content (either a document or an image).
- The function will then invoke the specified Amazon Bedrock model (in our case, anthropic.claude-3-sonnet-20240229-v1:0) using the converse API, and passes, the system prompt , the prepared message, and inference configuration.
- Next , we parse the response from the Bedrock model, extracting the output text, stop reason, and token usage.
- Construct the output as a JSON object containing the output text, stop reason, and token usage. In this step we will also construct the output file name to be the same as the input file name but with a .json extension.
- Write the output JSON to a new file in the specified output S3 bucket and folder.
- Finally, the function returns the function final output.
I am including the full Lambda code example here. It is worth noting that although this is a functional code, it is not meant for production use as is. It is a sample that you can use for POC testing, and to guide your code development.
Happy hacking!
There are many ways you can enhance, augment, map, and validate your JSON output prior to passing it to next stage systems. The goal is to increase your parsing confidence level for critical tasks like processing payments for example. Here are a few common validation techniques and tools:
You can leverage Bedrock model optionality and run in-parallel queries against two different LLMs. Then you can compare the output of the two models, something that can also be done using a LLM call or in some cases a programmatic call can be sufficient. If identical, you can accept the JSON as is, and if not, you can kick the item to a review queue. A good example is if you are parsing a PO or receipt, and running the parsing using two different LLMs produced different dollar amounts.
Let's say we are processing a PO and we need to make sure we have a customer name match to an existing customer in our customer database. In this case we can use the customer name from our output JSON, run a semantic search or SQL search against our customer database, and then use a LLM to validate and map back to the actual customer entity. A good example here if our parsed customer name appeared as John J. Smith , with Address x , and our customer database has John Smith with a similar address. In this case it is obvious that we are dealing with the same customer, and we can remap our JSON output to the exact customer name in our database. The step of reasoning through a match and remapping/updating our JSON output to match our database can also be done by a LLM call. Our prompt would instruct LLM to reason through the data and determine if there is a match, and if so map the parsed value to actual value in an updated JSON. In our example, John J. Smith and John Smith with the same address are very likely the same person. Once verified and during the same LLM call we can update our JSON.
It is worth noting that automating search validation can be done within our flow by making a call to an external system using Call a tool with the Converse API as part of your workflow. Tool_use is basically the equivalent of function calling.
With a few lines of code and leveraging S3 event triggers, Lambda for serveless compute, and Bedrock for serverless LLM access with model optionality, we can build a fully automated, fully serverless , and deterministic document processing solution in under an hour!
Leveraging the Bedrock runtime Converse API, with some simple code, we can prepare a variety of document types and image formats to present to multimodal LLMs using the Message construct and a ContentBlock.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.