Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

AWS Logo
Menu
AI-Powered Text Extraction with Amazon Bedrock

AI-Powered Text Extraction with Amazon Bedrock

Learn to extract highlighted text from images using Amazon Bedrock. Automate document processing and information extraction.

Akash Ninave
Amazon Employee
Published Mar 6, 2025
Automating Highlighted Text Extraction with Amazon Bedrock and Amazon S3. Building a serverless solution to extract highlighted text from images, generate explanations, and store the results as formatted PDFs in Amazon S3 using AWS Bedrock and Python.
This project demonstrates the power of combining AI services with cloud storage and serverless computing to create a useful tool for researchers, students, or anyone who needs to quickly digitize and explain highlighted text from physical documents or digital images.
Simply upload an image of your highlighted text, and our AI will accurately extract all the important information and will provide :-
  1. Comprehensive Explanations: For each piece of highlighted text, receive a detailed explanation that breaks down complex concepts into easily understandable insights.
  2. Curated Resource Links: Gain access to a wealth of knowledge with carefully selected links to public articles that provide additional context and information related to your extracted text.
  3. Relevant Video Content: Enhance your understanding with links to YouTube videos that offer visual explanations and expert discussions on the topics you've highlighted.

Infrastructure :-

  1. AWS Bedrock:
    • Used for AI-powered text extraction and explanation generation
    • Specifically uses the "anthropic.claude-3-5" model
  2. Amazon S3:
    • Stores the generated PDFs in a pre-defined bucket
    • The bucket name is hardcoded in the script (BUCKET_NAME variable)
  3. AWS IAM:
    • Implicit use for managing permissions to access Bedrock and S3 services
  4. Local Environment:
    • Python script running on a local machine or server
    • Handles image processing, PDF generation, and AWS service interactions
  5. ReportLab:
    • Python library used for generating well-formatted PDFs
  6. Boto3:
    • AWS SDK for Python, used to interact with AWS services (Bedrock and S3)

Key Components :-

  • S3 Client: For uploading PDFs and generating pre-signed URLs.
  • Bedrock Runtime Client: For interacting with the Bedrock AI model.
  • PDF Generation: Using ReportLab to create formatted PDFs from extracted text.
  • Error Handling and Logging: Basic error handling and logging mechanisms are in place.

Prerequisites :-

  • AWS Account with appropriate permissions
  • Python 3.8 or later
  • boto3 library installed
  • reportlab library installed
  • Access to AWS Bedrock service (specifically the Claude 3 Sonnet model)
  • AWS CLI configured with appropriate credentials
  • An S3 bucket created for storing the generated PDFs
  • An image file containing highlighted text for processing

Data flow :-

  • User provides an image containing highlighted text.
  • The application reads the image file.
  • AWS Bedrock is called to extract the highlighted text and generate explanations.
  • The extracted text and explanations are formatted into a PDF.
  • The PDF is uploaded to a unique folder in an S3 bucket as per timestamp.
  • A pre-signed URL for the PDF is generated and returned to the user.

Code block :-

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
import logging
import boto3
import mimetypes
import time
import uuid
from datetime import datetime
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from reportlab.lib.utils import simpleSplit
from botocore.exceptions import ClientError
import io

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# AWS Clients
s3_client = boto3.client("s3")
bedrock_client = boto3.client(service_name="bedrock-runtime")

# **Fixed S3 Bucket Name**
BUCKET_NAME = "example-bucket" #replace with your bucket name

def save_text_as_pdf(content):
"""
Generates a properly formatted PDF from extracted text with auto-wrapping, pagination.
"""

pdf_buffer = io.BytesIO() # Create an in-memory byte-stream buffer
pdf = canvas.Canvas(pdf_buffer, pagesize=letter)
width, height = letter # Page dimensions

pdf.setFont("Helvetica-Bold", 14)
pdf.drawString(50, height - 50, "Extracted Highlighted Text")

text = pdf.beginText(50, height - 80) # Start position
text.setFont("Helvetica", 12)
line_height = 18 # Adjust line height for readability
y_position = height - 80
max_width = width - 100 # Maximum width for text wrapping

for line in content.split("\n"):
wrapped_lines = simpleSplit(line, "Helvetica", 12, max_width) # Auto-wrap lines
for sub_line in wrapped_lines:
pdf.drawString(50, y_position, sub_line)
y_position -= line_height
if y_position < 50: # Check for page overflow
pdf.showPage()
pdf.setFont("Helvetica", 12)
y_position = height - 50

pdf.save()
pdf_buffer.seek(0) # Reset buffer position to the beginning
return pdf_buffer

def upload_to_s3(content):
"""
Creates a unique folder with a timestamp and uploads the extracted text as a PDF to S3.
Returns the S3 file URL.
"""

try:
# Generate a unique folder using a timestamp
timestamp = datetime.utcnow().strftime('%Y%m%d-%H%M%S')
folder_name = f"highlighted-text-{timestamp}"
file_name = f"{folder_name}/extracted_text.pdf"

# Convert text content to PDF
pdf_buffer = save_text_as_pdf(content)

# Upload the PDF to the fixed S3 bucket
s3_client.put_object(
Bucket=BUCKET_NAME,
Key=file_name,
Body=pdf_buffer.getvalue(),
ContentType="application/pdf"
)

# Generate a pre-signed URL (valid for 24 hours)
presigned_url = s3_client.generate_presigned_url(
'get_object',
Params={'Bucket': BUCKET_NAME, 'Key': file_name},
ExpiresIn=86400
)

return presigned_url

except ClientError as e:
logger.error(f"❌ Failed to upload PDF to S3: {e}")
return None

def generate_conversation(model_id, input_text, input_image):
"""
Sends an image & message to AWS Bedrock API and retrieves extracted text.
"""

logger.info("Generating message with model %s", model_id)

try:
with open(input_image, "rb") as f:
image = f.read()
except FileNotFoundError:
logger.error(f"Image file not found: {input_image}")
return None, None
except Exception as e:
logger.error(f"Error reading image file: {str(e)}")
return None, None

message = {
"role": "user",
"content": [
{"text": input_text},
{
"image": {
"format": "png",
"source": {"bytes": image}
}
}
]
}

messages = [message]
start_time = time.time()

try:
response = bedrock_client.converse(modelId=model_id, messages=messages)
except ClientError as err:
logger.error(f"A client error occurred: {err.response['Error']['Message']}")
return None, None
except Exception as e:
logger.error(f"An unexpected error occurred: {str(e)}")
return None, None

end_time = time.time()
latency = end_time - start_time
logger.info("Total response time: %.2f seconds", latency)

return response, latency

def extract_text_and_upload():
"""
Extracts text from an image using AWS Bedrock and uploads the output as a PDF to a new timestamped folder.
"""

model_id = "anthropic.claude-3-5-sonnet-20240620-v1:0"
input_text = (
"Extract all highlighted text from the provided image. "
"Ensure text is properly formatted, and provide detailed explanations and references for each extracted text. "
"If there are diagrams in the image, describe their content accurately. "
"For each highlighted text, provide relevant references to public articles and YouTube videos with proper links."
)
input_image = "path to yor image" # write the path to your input image from local machine

# Step 1: Extract text from the image
response, latency = generate_conversation(model_id, input_text, input_image)

if not response:
logger.error("No response received from Bedrock API.")
return

output_message = response.get('output', {}).get('message', None)
if not output_message:
logger.error("No output message received from the model.")
return

# Extract the text
extracted_text = "\n".join([content.get("text", "N/A") for content in output_message.get("content", [])])

# Step 2: Upload extracted text to a new unique folder in the fixed S3 bucket as a PDF
s3_url = upload_to_s3(extracted_text)

if s3_url:
print(f"✅ Extracted text successfully saved as PDF in S3: {s3_url}")
else:
print("❌ Failed to save extracted text in S3.")

if __name__ == "__main__":
extract_text_and_upload()
Note:-
Replace bucket name on line no 21 with your actual s3 bucket.
Replace image path with your actual image path on line no 151.

Warning:-

It is not a best practice to expose your bucket name in the code, please use environment variable instead.
example :-
1
2
3
4
5
6

BUCKET_NAME = os.environ.get('S3_BUCKET_NAME')

input_image = os.environ.get('INPUT_IMAGE_PATH')

model_id = os.environ.get('BEDROCK_MODEL_ID')
Please refer to this link to understand how to create environment variables .

Reference input image :-

Image not found
Input Image
Output :-
Image not found
output

Understanding the code :-

Import Statements and Basic Setup :-
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import logging
import boto3
import mimetypes
import time
import uuid
from datetime import datetime
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from reportlab.lib.utils import simpleSplit
from botocore.exceptions import ClientError
import io

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# AWS Clients
s3_client = boto3.client("s3")
bedrock_client = boto3.client(service_name="bedrock-runtime")

# Fixed S3 Bucket Name
BUCKET_NAME = "example-bucket" #replace with your bucket name
This section imports necessary libraries:
  • logging: For application logging
  • boto3: AWS SDK for Python
  • reportlab: For PDF generation
  • Other utility imports for time, UUID, and IO operations It also initializes AWS clients for S3 and Bedrock services.
PDF Generation Function:-
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def save_text_as_pdf(content):
"""
Generates a properly formatted PDF from extracted text with auto-wrapping, pagination.
"""

pdf_buffer = io.BytesIO() # Create an in-memory byte-stream buffer
pdf = canvas.Canvas(pdf_buffer, pagesize=letter)
width, height = letter # Page dimensions

pdf.setFont("Helvetica-Bold", 14)
pdf.drawString(50, height - 50, "Extracted Highlighted Text")

text = pdf.beginText(50, height - 80) # Start position
text.setFont("Helvetica", 12)
line_height = 18 # Adjust line height for readability
y_position = height - 80
max_width = width - 100 # Maximum width for text wrapping

for line in content.split("\n"):
wrapped_lines = simpleSplit(line, "Helvetica", 12, max_width) # Auto-wrap lines
for sub_line in wrapped_lines:
pdf.drawString(50, y_position, sub_line)
y_position -= line_height
if y_position < 50: # Check for page overflow
pdf.showPage()
pdf.setFont("Helvetica", 12)
y_position = height - 50

pdf.save()
pdf_buffer.seek(0) # Reset buffer position to the beginning
return pdf_buffer
This function creates a PDF from the input text with proper formatting, pagination, and text wrapping.
S3 Upload Function:-
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def upload_to_s3(content):
"""
Creates a unique folder with a timestamp and uploads the extracted text as a PDF to S3.
Returns the S3 file URL.
"""

try:
timestamp = datetime.utcnow().strftime('%Y%m%d-%H%M%S')
folder_name = f"highlighted-text-{timestamp}"
file_name = f"{folder_name}/extracted_text.pdf"

pdf_buffer = save_text_as_pdf(content)

s3_client.put_object(
Bucket=BUCKET_NAME,
Key=file_name,
Body=pdf_buffer.getvalue(),
ContentType="application/pdf"
)

presigned_url = s3_client.generate_presigned_url(
'get_object',
Params={'Bucket': BUCKET_NAME, 'Key': file_name},
ExpiresIn=86400
)

return presigned_url

except ClientError as e:
logger.error(f"❌ Failed to upload PDF to S3: {e}")
return None
This function manages S3 operations:
  • Creates timestamped folders
  • Converts text to PDF
  • Handles S3 upload
  • Generates temporary access URLs
  • Includes error handling
Bedrock Integration Function:-
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def generate_conversation(model_id, input_text, input_image):
logger.info("Generating message with model %s", model_id)

# Image reading with error handling
try:
with open(input_image, "rb") as f:
image = f.read()
except FileNotFoundError:
logger.error(f"Image file not found: {input_image}")
return None, None
except Exception as e:
logger.error(f"Error reading image file: {str(e)}")
return None, None

# Message preparation
message = {
"role": "user",
"content": [
{"text": input_text},
{"image": {"format": "png", "source": {"bytes": image}}}
]
}

# API call with timing and error handling
messages = [message]
start_time = time.time()
try:
response = bedrock_client.converse(modelId=model_id, messages=messages)
except ClientError as err:
logger.error(f"A client error occurred: {err.response['Error']['Message']}")
return None, None
except Exception as e:
logger.error(f"An unexpected error occurred: {str(e)}")
return None, None

end_time = time.time()
latency = end_time - start_time
logger.info("Total response time: %.2f seconds", latency)

return response, latency
This function handles:
  • Image file reading
  • Message formatting for Bedrock
  • API call execution
  • Response timing
  • Comprehensive error handling
Main Orchestration Function:-
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def extract_text_and_upload():
model_id = "anthropic.claude-3-5-sonnet-20240620-v1:0"
input_text = (
"Extract all highlighted text from the provided image. "
"Ensure text is properly formatted, and provide detailed explanations and references for each extracted text. "
"If there are diagrams in the image, describe their content accurately. "
"For each highlighted text, provide relevant references to public articles and YouTube videos with proper links."
)
input_image = "path to your image"

# Process image and get response
response, latency = generate_conversation(model_id, input_text, input_image)
if not response:
logger.error("No response received from Bedrock API.")
return

# Extract and process text
output_message = response.get('output', {}).get('message', None)
if not output_message:
logger.error("No output message received from the model.")
return

extracted_text = "\n".join([content.get("text", "N/A") for content in output_message.get("content", [])])

# Upload and get URL
s3_url = upload_to_s3(extracted_text)
if s3_url:
print(f"✅ Extracted text successfully saved as PDF in S3: {s3_url}")
else:
print("❌ Failed to save extracted text in S3.")
This main function:
  • Sets up model and input parameters
  • Coordinates the extraction process
  • Handles response processing
  • Manages the upload process
  • Provides status feedback
The entry point:-
1
2
if __name__ == "__main__":
extract_text_and_upload()
This ensures the script runs only when executed directly.

Key Features and Best Practices

  1. Error Handling: The script includes comprehensive error handling and logging, ensuring robustness and easier debugging.
  2. Modularity: Functions are well-separated, promoting code reusability and maintainability.
  3. Security: The use of pre-signed URLs ensures secure, time-limited access to the uploaded files.
  4. Scalability: By leveraging AWS services, the solution can easily scale to handle large volumes of documents.
  5. Flexibility: The AI model and input instructions can be easily modified to adapt to different use cases.

Potential Improvements and Extensions

  1. Parallel Processing: Implement multiprocessing to handle multiple images concurrently.
  2. Integration with Document Management Systems: Extend the script to integrate with popular document management systems.
  3. User Interface: Develop a web interface for easy upload and processing of images.
  4. Automated Workflow: Integrate with AWS S3 event notification and SNS to create an automated workflow.

Conclusion :-

This implementation demonstrates how AWS services can be combined to create practical solutions useful for
  • Researchers digitizing research papers
  • Students organizing study materials
  • Business professionals extracting key information from documents.
As AI and cloud technologies continue to evolve, we can expect even more powerful and efficient solutions for document processing and information extraction. By staying up-to-date with these technologies and continuously improving our processes, we can unlock new levels of productivity and insight from our document repositories.
Note :- Remember to follow security best practices and handle errors appropriately when implementing similar solutions in production environments.
 

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Comments

Log in to comment