Transcribing PDF with tables and graphs to Markdown with GenAI

Transcribing PDF with tables and graphs to Markdown with GenAI

Do you want to do build your own virtual assistant but your knowledge base is in complex PDF documents? This post is for you!

Published Jul 2, 2024
Last Modified Jul 12, 2024
Disclaimer: I'm a AWS Solution Architect and this post represents my own opinion.

Introduction

In today's world we are all rushing to build Generative AI powered tools and assistants, but the foundation of these tools usually comes down to your data. I have seen many projects where they get stuck on how to structure the documents in order to feed it into the large language models (LLMs) and it's not always easy to process it.
Traditionally you would first tackle this problem using classical pdf-to-text tools (which there are several online), however you may face two challenges with this; 1/ Tabular data get's all messed up 2/ Non text data is simply lost. Sometimes you may be facing a worse challenge, your PDF may contain scanned text that isn't caught by pdf-to-text tools, for these scenarios you would typically fall back to OCR to process them, but doing OCR over complex documents it's a engineering problem on it's own.
Here I want to propose an alternative to process the PDF documents your use-case may need to feed from, leveraging LLMs multi-modal capability.

Solution

The solution leverages Amazon Bedrock's capability to process images and output text from them, in particular I use Anthropic's Claude 3 Sonnet for my use-cases with great success.
Solution Architecture

Transform PDF to Images

To transform PDFs to Images I leveraged the python package pdf2image which makes this task as easy as one line of code

Build Claude 3 message

Claude 3 needs a particular syntax when you feed it images, this is described in their messages documentation. In a nutshell, you need to transform the image to base64 and then generate a specific JSON.

Calling Amazon Bedrock

When calling bedrock, there are a few tips to keep in mind.
  1. You will want to modify the default read_timeout of AWS's SDK as the Bedrock answer may take a while if you are processing several images together, you can do this with the config = Config(read_timeout=1000) line.
  2. It's helpful to feed the LLM an initial keyword in order to force the LLM to skip the small-talk and jump straight to the output, I do this by passing the "```markdown" hint
  3. Claude 3 Sonnet has a maximum output of 4096 tokens. If your images contain more data than that, you will need to call again the LLM passing the output so far, so it can resume from there.
  4. The prompt is rather simple
    You are an assistant that helps transcribe a PDF to Markdown format. The user will feed you images that represent the pages of the PDF and you will generate Markdown format out of it.
    If there are graphs or tables in the image, generate Markdown tables with the data contained inside those resources.
And that's it! Here is a sample prompt

Sample

Sample PDF
Here is the transformed text

Caveats

Claude 3 accept a maximum of 20 images in a single prompt. Depending on your document you may want to split one image per prompt, or maybe several ones per prompt and then do a merge process.

Costs

Estimating the token cost of images is not trivial and different for each LLM. For Claude 3 the vision documentation gives the following guide (width px * height px)/750
In my case, for this two-page PDF was of US$ 0.024 USD for it's transcription.

Summary

Leveraging Amazon's Bedrock Multi-modal capabilities you can process complex PDFs by transforming them to images and generate text from it.
 

Comments