logo
Menu
Reading complex .pdf files using Amazon Bedrock

Reading complex .pdf files using Amazon Bedrock

Claude 2.1 is becoming the default for many customers to prompt and analyze complex data sources.

Albert
Amazon Employee
Published Mar 28, 2024
Recently Bedrock launched support for Claude 3 models, so this guide is a bit outdated because you can feed screenshots of .pdfs as prompts for Claude to parse and analyze. I'll update this soon.
A growing use case I've discovered from users is using Bedrock to prompt a foundational model with data pulled from .pdfs stored in S3. We did release an example app showing how you can get any FM on Bedrock to analyze a .pdf file you uploaded. However, I want to walkthrough how you can do this under the hood using Claude 2.1, which outputs reliable responses from other data sources if the prompt is structured well.
We'll start with structuring our directory.
visual_studio_code_repo_structure
In my lambda.py, I'll pull in the .pdf from S3 and use the utility module for XML conversion.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import json
import boto3
from utils.xml_conversion import dict_to_xml
import subprocess # To invoke AWS CLI commands for Bedrock's invoke-model
import xml.etree.ElementTree as ET

s3_client = boto3.client('s3')
textract_client = boto3.client('textract')

def lambda_handler(event, context):
bucket_name = 'your-bucket-name'
file_key = 'path/to/your/file.pdf'

response = s3_client.get_object(Bucket=bucket_name, Key=file_key)
pdf_file = response['Body'].read()

response = textract_client.analyze_document(
Document={'Bytes': pdf_file},
FeatureTypes=['TABLES']
)

tables = [block for block in response['Blocks'] if block['BlockType'] == 'TABLE']

xml_data = ''
for table in tables:
xml_root = dict_to_xml(table, 'table')
xml_string = ET.tostring(xml_root, encoding='utf-8')
xml_data += xml_string.decode('utf-8')

invoke_model_command = f"""aws bedrock-runtime invoke-model \
--model-id anthropic.claude-v2 \
--body '{{"messages":[{{"role":"user","content":[{{"type":"text","text":"You are my accountant. Please read this document of invoices that contains many tables and return any invoice that exceeds $10,000 {xml_data}"}}]}}]}}, \
--cli-binary-format raw-in-base64-out \
--region us-east-1"""


response = subprocess.check_output(invoke_model_command, shell=True)

return {
'statusCode': 200,
'body': json.dumps('Operation successful')
}
Next, I'll need to make sure all external libraries are installed in my requirements.txt. The easiest way is to do a bulk export from my xml_conversion.py.
1
2
# requirements.txt
boto3==1.34.71
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import xml.etree.ElementTree as ET

def dict_to_xml(data, root_name='root'):
root = ET.Element(root_name)
for key, value in data.items():
if isinstance(value, dict):
root.append(dict_to_xml(value, key))
elif isinstance(value, list):
for item in value:
if isinstance(item, dict):
root.append(dict_to_xml(item, key))
else:
child = ET.SubElement(root, key)
child.text = str(item)
else:
child = ET.SubElement(root, key)
child.text = str(value)
return root
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments