AI-Powered Text Extraction with Amazon Bedrock

Automating Highlighted Text Extraction with Amazon Bedrock and Amazon S3. Building a serverless solution to extract highlighted text from images, generate explanations, and store the results as formatted PDFs in Amazon S3 using AWS Bedrock and Python.

This project demonstrates the power of combining AI services with cloud storage and serverless computing to create a useful tool for researchers, students, or anyone who needs to quickly digitize and explain highlighted text from physical documents or digital images.

Simply upload an image of your highlighted text, and our AI will accurately extract all the important information and will provide :-

Comprehensive Explanations: For each piece of highlighted text, receive a detailed explanation that breaks down complex concepts into easily understandable insights.
Curated Resource Links: Gain access to a wealth of knowledge with carefully selected links to public articles that provide additional context and information related to your extracted text.
Relevant Video Content: Enhance your understanding with links to YouTube videos that offer visual explanations and expert discussions on the topics you've highlighted.

Infrastructure :-

AWS Bedrock:
- Used for AI-powered text extraction and explanation generation
- Specifically uses the "anthropic.claude-3-5" model
Amazon S3:
- Stores the generated PDFs in a pre-defined bucket
- The bucket name is hardcoded in the script (BUCKET_NAME variable)
AWS IAM:
- Implicit use for managing permissions to access Bedrock and S3 services
Local Environment:
- Python script running on a local machine or server
- Handles image processing, PDF generation, and AWS service interactions
ReportLab:
- Python library used for generating well-formatted PDFs
Boto3:
- AWS SDK for Python, used to interact with AWS services (Bedrock and S3)

Key Components :-

S3 Client: For uploading PDFs and generating pre-signed URLs.
Bedrock Runtime Client: For interacting with the Bedrock AI model.
PDF Generation: Using ReportLab to create formatted PDFs from extracted text.
Error Handling and Logging: Basic error handling and logging mechanisms are in place.

Prerequisites :-

AWS Account with appropriate permissions
Python 3.8 or later
boto3 library installed
reportlab library installed
Access to AWS Bedrock service (specifically the Claude 3 Sonnet model)
AWS CLI configured with appropriate credentials
An S3 bucket created for storing the generated PDFs
An image file containing highlighted text for processing

Data flow :-

User provides an image containing highlighted text.
The application reads the image file.
AWS Bedrock is called to extract the highlighted text and generate explanations.
The extracted text and explanations are formatted into a PDF.
The PDF is uploaded to a unique folder in an S3 bucket as per timestamp.
A pre-signed URL for the PDF is generated and returned to the user.

Code block :-

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
import logging
import boto3
import mimetypes
import time
import uuid
from datetime import datetime
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from reportlab.lib.utils import simpleSplit
from botocore.exceptions import ClientError
import io

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# AWS Clients
s3_client = boto3.client("s3")
bedrock_client = boto3.client(service_name="bedrock-runtime")

# **Fixed S3 Bucket Name**
BUCKET_NAME = "example-bucket" #replace with your bucket name 

def save_text_as_pdf(content):
    """
    Generates a properly formatted PDF from extracted text with auto-wrapping, pagination.
    """
    pdf_buffer = io.BytesIO()  # Create an in-memory byte-stream buffer
    pdf = canvas.Canvas(pdf_buffer, pagesize=letter)
    width, height = letter  # Page dimensions

    pdf.setFont("Helvetica-Bold", 14)
    pdf.drawString(50, height - 50, "Extracted Highlighted Text")

    text = pdf.beginText(50, height - 80)  # Start position
    text.setFont("Helvetica", 12)
    line_height = 18  # Adjust line height for readability
    y_position = height - 80
    max_width = width - 100  # Maximum width for text wrapping

    for line in content.split("\n"):
        wrapped_lines = simpleSplit(line, "Helvetica", 12, max_width)  # Auto-wrap lines
        for sub_line in wrapped_lines:
            pdf.drawString(50, y_position, sub_line)
            y_position -= line_height
            if y_position < 50:  # Check for page overflow
                pdf.showPage()
                pdf.setFont("Helvetica", 12)
                y_position = height - 50

    pdf.save()
    pdf_buffer.seek(0)  # Reset buffer position to the beginning
    return pdf_buffer

def upload_to_s3(content):
    """
    Creates a unique folder with a timestamp and uploads the extracted text as a PDF to S3.
    Returns the S3 file URL.
    """
    try:
        # Generate a unique folder using a timestamp
        timestamp = datetime.utcnow().strftime('%Y%m%d-%H%M%S')
        folder_name = f"highlighted-text-{timestamp}"
        file_name = f"{folder_name}/extracted_text.pdf"

        # Convert text content to PDF
        pdf_buffer = save_text_as_pdf(content)

        # Upload the PDF to the fixed S3 bucket
        s3_client.put_object(
            Bucket=BUCKET_NAME,
            Key=file_name,
            Body=pdf_buffer.getvalue(),
            ContentType="application/pdf"
        )

        # Generate a pre-signed URL (valid for 24 hours)
        presigned_url = s3_client.generate_presigned_url(
            'get_object',
            Params={'Bucket': BUCKET_NAME, 'Key': file_name},
            ExpiresIn=86400
        )

        return presigned_url

    except ClientError as e:
        logger.error(f"❌ Failed to upload PDF to S3: {e}")
        return None

def generate_conversation(model_id, input_text, input_image):
    """
    Sends an image & message to AWS Bedrock API and retrieves extracted text.
    """
    logger.info("Generating message with model %s", model_id)

    try:
        with open(input_image, "rb") as f:
            image = f.read()
    except FileNotFoundError:
        logger.error(f"Image file not found: {input_image}")
        return None, None
    except Exception as e:
        logger.error(f"Error reading image file: {str(e)}")
        return None, None

    message = {
        "role": "user",
        "content": [
            {"text": input_text},
            {
                "image": {
                    "format": "png",
                    "source": {"bytes": image}
                }
            }
        ]
    }

    messages = [message]
    start_time = time.time()

    try:
        response = bedrock_client.converse(modelId=model_id, messages=messages)
    except ClientError as err:
        logger.error(f"A client error occurred: {err.response['Error']['Message']}")
        return None, None
    except Exception as e:
        logger.error(f"An unexpected error occurred: {str(e)}")
        return None, None

    end_time = time.time()
    latency = end_time - start_time
    logger.info("Total response time: %.2f seconds", latency)

    return response, latency

def extract_text_and_upload():
    """
    Extracts text from an image using AWS Bedrock and uploads the output as a PDF to a new timestamped folder.
    """
    model_id = "anthropic.claude-3-5-sonnet-20240620-v1:0"
    input_text = (
        "Extract all highlighted text from the provided image. "
        "Ensure text is properly formatted, and provide detailed explanations and references for each extracted text. "
        "If there are diagrams in the image, describe their content accurately. "
        "For each highlighted text, provide relevant references to public articles and YouTube videos with proper links."
    )
    input_image = "path to yor image" # write the path to your input image from local machine

    # Step 1: Extract text from the image
    response, latency = generate_conversation(model_id, input_text, input_image)

    if not response:
        logger.error("No response received from Bedrock API.")
        return

    output_message = response.get('output', {}).get('message', None)
    if not output_message:
        logger.error("No output message received from the model.")
        return

    # Extract the text
    extracted_text = "\n".join([content.get("text", "N/A") for content in output_message.get("content", [])])

    # Step 2: Upload extracted text to a new unique folder in the fixed S3 bucket as a PDF
    s3_url = upload_to_s3(extracted_text)

    if s3_url:
        print(f"✅ Extracted text successfully saved as PDF in S3: {s3_url}")
    else:
        print("❌ Failed to save extracted text in S3.")

if __name__ == "__main__":
    extract_text_and_upload()

Note:-
Replace bucket name on line no 21 with your actual s3 bucket.
Replace image path with your actual image path on line no 151.

Warning:-

It is not a best practice to expose your bucket name in the code, please use environment variable instead.
example :-

1
2
3
4
5
6

BUCKET_NAME = os.environ.get('S3_BUCKET_NAME')

input_image = os.environ.get('INPUT_IMAGE_PATH')

model_id = os.environ.get('BEDROCK_MODEL_ID')

Please refer to this link to understand how to create environment variables .

Reference input image :-

Image not found

Input Image

Output :-

Image not found

output

Understanding the code :-

Import Statements and Basic Setup :-

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import logging
import boto3
import mimetypes
import time
import uuid
from datetime import datetime
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from reportlab.lib.utils import simpleSplit
from botocore.exceptions import ClientError
import io

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# AWS Clients
s3_client = boto3.client("s3")
bedrock_client = boto3.client(service_name="bedrock-runtime")

# Fixed S3 Bucket Name
BUCKET_NAME = "example-bucket" #replace with your bucket name 

This section imports necessary libraries:

logging: For application logging
boto3: AWS SDK for Python
reportlab: For PDF generation
Other utility imports for time, UUID, and IO operations It also initializes AWS clients for S3 and Bedrock services.

PDF Generation Function:-

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def save_text_as_pdf(content):
    """
    Generates a properly formatted PDF from extracted text with auto-wrapping, pagination.
    """
    pdf_buffer = io.BytesIO()  # Create an in-memory byte-stream buffer
    pdf = canvas.Canvas(pdf_buffer, pagesize=letter)
    width, height = letter  # Page dimensions

    pdf.setFont("Helvetica-Bold", 14)
    pdf.drawString(50, height - 50, "Extracted Highlighted Text")

    text = pdf.beginText(50, height - 80)  # Start position
    text.setFont("Helvetica", 12)
    line_height = 18  # Adjust line height for readability
    y_position = height - 80
    max_width = width - 100  # Maximum width for text wrapping

    for line in content.split("\n"):
        wrapped_lines = simpleSplit(line, "Helvetica", 12, max_width)  # Auto-wrap lines
        for sub_line in wrapped_lines:
            pdf.drawString(50, y_position, sub_line)
            y_position -= line_height
            if y_position < 50:  # Check for page overflow
                pdf.showPage()
                pdf.setFont("Helvetica", 12)
                y_position = height - 50

    pdf.save()
    pdf_buffer.seek(0)  # Reset buffer position to the beginning
    return pdf_buffer

This function creates a PDF from the input text with proper formatting, pagination, and text wrapping.

S3 Upload Function:-

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def upload_to_s3(content):
    """
    Creates a unique folder with a timestamp and uploads the extracted text as a PDF to S3.
    Returns the S3 file URL.
    """
    try:
        timestamp = datetime.utcnow().strftime('%Y%m%d-%H%M%S')
        folder_name = f"highlighted-text-{timestamp}"
        file_name = f"{folder_name}/extracted_text.pdf"

        pdf_buffer = save_text_as_pdf(content)

        s3_client.put_object(
            Bucket=BUCKET_NAME,
            Key=file_name,
            Body=pdf_buffer.getvalue(),
            ContentType="application/pdf"
        )

        presigned_url = s3_client.generate_presigned_url(
            'get_object',
            Params={'Bucket': BUCKET_NAME, 'Key': file_name},
            ExpiresIn=86400
        )

        return presigned_url

    except ClientError as e:
        logger.error(f"❌ Failed to upload PDF to S3: {e}")
        return None

This function manages S3 operations:

Creates timestamped folders
Converts text to PDF
Handles S3 upload
Generates temporary access URLs
Includes error handling

Bedrock Integration Function:-

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def generate_conversation(model_id, input_text, input_image):
    logger.info("Generating message with model %s", model_id)
    
    # Image reading with error handling
    try:
        with open(input_image, "rb") as f:
            image = f.read()
    except FileNotFoundError:
        logger.error(f"Image file not found: {input_image}")
        return None, None
    except Exception as e:
        logger.error(f"Error reading image file: {str(e)}")
        return None, None

    # Message preparation
    message = {
        "role": "user",
        "content": [
            {"text": input_text},
            {"image": {"format": "png", "source": {"bytes": image}}}
        ]
    }
    
    # API call with timing and error handling
    messages = [message]
    start_time = time.time()
    try:
        response = bedrock_client.converse(modelId=model_id, messages=messages)
    except ClientError as err:
        logger.error(f"A client error occurred: {err.response['Error']['Message']}")
        return None, None
    except Exception as e:
        logger.error(f"An unexpected error occurred: {str(e)}")
        return None, None

    end_time = time.time()
    latency = end_time - start_time
    logger.info("Total response time: %.2f seconds", latency)
    
    return response, latency

This function handles:

Image file reading
Message formatting for Bedrock
API call execution
Response timing
Comprehensive error handling

Main Orchestration Function:-

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def extract_text_and_upload():
    model_id = "anthropic.claude-3-5-sonnet-20240620-v1:0"
    input_text = (
        "Extract all highlighted text from the provided image. "
        "Ensure text is properly formatted, and provide detailed explanations and references for each extracted text. "
        "If there are diagrams in the image, describe their content accurately. "
        "For each highlighted text, provide relevant references to public articles and YouTube videos with proper links."
    )
    input_image = "path to your image"

    # Process image and get response
    response, latency = generate_conversation(model_id, input_text, input_image)
    if not response:
        logger.error("No response received from Bedrock API.")
        return

    # Extract and process text
    output_message = response.get('output', {}).get('message', None)
    if not output_message:
        logger.error("No output message received from the model.")
        return

    extracted_text = "\n".join([content.get("text", "N/A") for content in output_message.get("content", [])])

    # Upload and get URL
    s3_url = upload_to_s3(extracted_text)
    if s3_url:
        print(f"✅ Extracted text successfully saved as PDF in S3: {s3_url}")
    else:
        print("❌ Failed to save extracted text in S3.")

This main function:

Sets up model and input parameters
Coordinates the extraction process
Handles response processing
Manages the upload process
Provides status feedback

The entry point:-

1
2
if __name__ == "__main__":
    extract_text_and_upload()

This ensures the script runs only when executed directly.

Key Features and Best Practices

Error Handling: The script includes comprehensive error handling and logging, ensuring robustness and easier debugging.
Modularity: Functions are well-separated, promoting code reusability and maintainability.
Security: The use of pre-signed URLs ensures secure, time-limited access to the uploaded files.
Scalability: By leveraging AWS services, the solution can easily scale to handle large volumes of documents.
Flexibility: The AI model and input instructions can be easily modified to adapt to different use cases.

Potential Improvements and Extensions

Parallel Processing: Implement multiprocessing to handle multiple images concurrently.
Integration with Document Management Systems: Extend the script to integrate with popular document management systems.
User Interface: Develop a web interface for easy upload and processing of images.
Automated Workflow: Integrate with AWS S3 event notification and SNS to create an automated workflow.

Conclusion :-

This implementation demonstrates how AWS services can be combined to create practical solutions useful for

Researchers digitizing research papers
Students organizing study materials
Business professionals extracting key information from documents.

As AI and cloud technologies continue to evolve, we can expect even more powerful and efficient solutions for document processing and information extraction. By staying up-to-date with these technologies and continuously improving our processes, we can unlock new levels of productivity and insight from our document repositories.

Note :- Remember to follow security best practices and handle errors appropriately when implementing similar solutions in production environments.

Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.

Select your cookie preferences

Site Terms, Privacy, and more.