logo
Menu
Intelligent Document Processing With Augmented AI

Intelligent Document Processing With Augmented AI

Hands-on Guide: Extract Text with Textract, Classify Batch Documents Using Comprehend, and Implement Human Review with Amazon Augmented AI (A2I)

Published Mar 15, 2024
We will explore an Intelligent Document Processing (IDP) solution demonstrating core functionalities through code. Additionally, we will demonstrate the same operations using the Console, providing both programmatic and console-based approaches to building an IDP solution.
Intelligent Document Processing
Intelligent Document Processing (IDP) is the automation of manual document processing tasks. IDP usually involves using machine learning solutions to automate tasks such as extracting text from images or other legacy documents and performing business processing tasks on extracted text, such as document classification from the content of documents.
IDP is automating document processing, and usually involving extraction of text from legacy documents.
AWS Definition of IDP - Intelligent document processing (IDP) is automating the process of manual data entry from paper-based documents or document images to integrate with other digital business processes.

Benefits of Intelligent Document Processing (IDP)

  • IDP decreases the chances of human error.
  • It decreases the workload of employees, allowing them to focus on certain edge cases requiring human verification.
  • Decrease workload of employee, letting them focus on certain edge cases requiring human verification
  • Increases scale of enterprise document processing
  • Reduce cost of document processing
We would use A2I to review documents that are classified with confidence level below certain threshold for humans to review, while not reviewing document above certain threshold.

Tools

Textract - Is an AWS service to extract text from unstructured documents (PNG, JPEG, TIFF, and PDF).
Comprehend - Is a service to perform analysis on text, like extraction of key phrases, redaction of personal identifiable information and classification of text, it also support extension by training it on your data.
Amazon Augmented AI (A2I) - This is used to improve accuracy of machine learning tasks by including humans verify classification outputs based on rules it minimize misclassification in edge cases. Example if a certain task has confidence score below certain threshold, it is flagged for human verification.

Solution Overview

Our documents are mixed and we want to separate them into folders based on the type of document, and automate the process of document classification. Our solution.
  1. Extract text from legacy document using Textract
  2. Training Comprehend to classify document based on content of extract text
  3. Classify the document using comprehend custom classifier
  4. Based on the predicted document category save them to the right category
  5. If comprehend confidence of document category is below certain threshold send to for human review using A2I.
To get started, you'll need to create a SageMaker Domain. You can find detailed instructions in my other blog post or refer to the official AWS guide.
After that, create a JupyterLab space for code editing by following the steps outlined in this guide under the To create a space and open JupyterLab section.
we will modify Lab 1 of the AWS Intelligent Document Processing (IDP) workshop. You can download the complete solution from my GitHub repository and the original Jupyter notebook. The content will be similar to the original workshop, but we will enhance the solution by incorporating a human review component using Amazon Augmented AI (A2I)

Permissions

We will and attach an inline policy to our SageMakerExecution Role that give SageMakerExecution role permission to pass IAM role credentials to other AWS services, such as Amazon Comprehend. This helps SageMaker integrate with other services to improve our IDP workflow.
In the AWS console navigate to IAM services
select Roles on the left pane,
Under Roles select your SageMakerExecution role.
Click Add Permissions and select create inline policy
Switch to the Json tab and paste the following json policy.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::*:role/*",
"Condition": {
"StringEquals": {
"iam:PassedToService": [
"comprehend.amazonaws.com"
]
}
}
}
]
}
Click Next, Under Review and save give the policy a name
Click create policy
Attach the following managed policies to your SageMakerExecution role.
ComprehendFullAccess
AmazonTextractFullAccess
IAMPass
AmazonS3FullAccess

Copying Documents to S3

We will be using the default SageMaker bucket, you can use other bucket of your choice
1
!curl https://idp-assets-wwso.s3.us-east-2.amazonaws.com/workshop-data/classification-training.zip --output classification-training.zip
We use the command above to the download document dataset for our classification task.
The code below is used to unzip the dataset and remove hidden files
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import shutil

try:
shutil.unpack_archive("./classification-training.zip", extract_dir="classification-training")
print("Document archive extracted successfully...")
for path, subdirs, files in os.walk('./classification-training'):
for name in files:
if name.startswith('.'):
hidden = os.path.join(path, name)
print(f'Removing hidden files/directories: {hidden}')
os.system(f"rm -rf {hidden}")
for dirs in subdirs:
if dirs.startswith('.'):
if dirs.startswith('.'):
hidden = os.path.join(path, dirs)
print(f'Removing hidden files/directories: {hidden}')
os.system(f"rm -rf {hidden}")
except Exception as e:
print("Please upload the document zip file classification-training.zip")
raise e

 

Extracting Document with Textract

We downloaded a third party library call_textract containing code samples of different Textract use cases to interact with Textract instead of directly interacting with the Textract SDK (boto3).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def textract_extract_text(document, bucket=data_bucket):
try:
print(f'Processing document: {document}')
lines = ""
row = []

# using amazon-textract-caller
response = call_textract(input_document=f's3://{bucket}/{document}')
# using pretty printer to get all the lines
lines = get_string(textract_json=response, output_type=[Textract_Pretty_Print.LINES])

label = [name for name in names if(name in document)]
row.append(label[0])
row.append(lines)
return row
except Exception as e:
print (e)
The code submits a document to Textract and extracts text. We use multi threading to submit and extract document text concurrently as shown below.
1
2
3
4
pool = mp.Pool(mp.cpu_count())
pool_results = [pool.apply_async(textract_extract_text, (document,data_bucket)) for document in docs]
labeled_collection = [res.get() for res in pool_results]
pool.close()py
 
After extracting the text from documents using Amazon Textract, we compiled the extracted data into a CSV file. This file contains the extracted texts and their corresponding document types, which will serve as the training data for our Comprehend custom classifier. To make this data accessible for training, we uploaded the CSV file to an Amazon S3 bucket.
1
2
comprehend_df = pd.DataFrame(labeled_collection, columns=['label','document'])
comprehend_df

Creating Custom Comprehend Classifier

We will create a custom model using Comprehend to detect document types from the csv dataset we uploaded to s3. The below code creates the custom classifier.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Create a document classifier
account_id = boto3.client('sts').get_caller_identity().get('Account')
id = str(datetime.datetime.now().strftime("%s"))

document_classifier_name = 'Sample-Doc-Classifier-IDP'
document_classifier_version = 'Sample-Doc-Classifier-IDP-v1'
document_classifier_arn = ''
response = None

try:
create_response = comprehend.create_document_classifier(
InputDataConfig={
'DataFormat': 'COMPREHEND_CSV',
'S3Uri': f's3://{data_bucket}/{key}'
},
DataAccessRoleArn=role,
DocumentClassifierName=document_classifier_name,
VersionName=document_classifier_version,
LanguageCode='en',
Mode='MULTI_CLASS'
)

document_classifier_arn = create_response['DocumentClassifierArn']

print(f"Comprehend Custom Classifier created with ARN: {document_classifier_arn}")
except Exception as error:
if error.response['Error']['Code'] == 'ResourceInUseException':
print(f'A classifier with the name "{document_classifier_name}" already exists.')
document_classifier_arn = f'arn:aws:comprehend:{region}:{account_id}:document-classifier/{document_classifier_name}/version/{document_classifier_version}'
print(f'The classifier ARN is: "{document_classifier_arn}"')
else:
print(error)
 

Start Classification Job

You can now classify documents with the custom classification model created above using the console or the SDK specifying the input and output data locations. We demonstrate how to run a comprehend batch job below.
The below code starts a comprehend document classification job using our previously trained classifier. The comprehend API uses Textract to extract text from input documents, abstracting the process from users.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import uuid

jobname = f'doc-classification-job-{uuid.uuid1()}'
print(f'Starting Comprehend Classification job {jobname} with model {document_classifier_arn}')

response = comprehend.start_document_classification_job(
JobName=jobname,
DocumentClassifierArn=document_classifier_arn,
InputDataConfig={
'S3Uri': f's3://{data_bucket}/idp/comprehend/mixedbag/',
'InputFormat': 'ONE_DOC_PER_FILE',
'DocumentReaderConfig': {
'DocumentReadAction': 'TEXTRACT_DETECT_DOCUMENT_TEXT',
'DocumentReadMode': 'FORCE_DOCUMENT_READ_ACTION'
}
},
OutputDataConfig={
'S3Uri': f's3://{data_bucket}/idp/comprehend/doc-class-output/'
},
DataAccessRoleArn=role
)

response

 

Amazon Augmented AI (A2I)

There are three core components of A2I
Template - Contains instructions for reviewers to follow.
Human Workforce - Are human reviewers performing the task.
Workflow or flow definition - Encompasses other component like specifying a UI template for instructions, the workforce team for review, the type of task and conditions for human review. We will create a custom A2I workflow definition.
We are use A2I to minimize misclassification, our workflow sends document for humans to review when classification confidence is below 1.0 .
We will create a private workforce to help review classifications with lower confidence.
You can use this guide to create work team with the console Create a Private Workforce (Amazon SageMaker Console) - Amazon SageMaker , we could also use the API to create private workteam programmatically but would need to create a Cognito User pool. To keep things simple we created our workforce using the SageMaker console.
On the SageMaker console select Labeling workforces under Ground Truth and under private team copy the Workteam Arn to use for creating workflow with the API
Copying A2I ARN

Human Review Workflow

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Flow definition name - this value is unique per account and region. You can also provide your own value here.
flowDefinitionName = 'fd-comprehend-demo-' + str(uuid.uuid4())

create_workflow_definition_response = sagemaker.create_flow_definition(
FlowDefinitionName= flowDefinitionName,
RoleArn= ROLE,
HumanLoopConfig= {
"WorkteamArn": WORKTEAM_ARN,
"HumanTaskUiArn": humanTaskUiArn,
"TaskCount": 1,
"TaskDescription": "Identify the sentiment of the provided text",
"TaskTitle": "Detect Sentiment of Text"
},
OutputConfig={
"S3OutputPath" : OUTPUT_PATH
}
)
flowDefinitionArn = create_workflow_definition_response['FlowDefinitionArn'] # let's save this ARN for future use
 

Starting Human Loop

 The below code specifies custom logic to send document for human classification when confidence score is less than 1.
Note, the initialValue of inputContent should be one of the possible categories
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
human_loops_started = []
SENTIMENT_SCORE_THRESHOLD = 1
for _, blurb in doc_class_df.iterrows():
response = blurb["Confidence"]

print(f'Processing blurb: \"{blurb["Document"]}\"')

# Our condition for when we want to engage a human for review
if (response < SENTIMENT_SCORE_THRESHOLD):

humanLoopName = str(uuid.uuid4())
inputContent = {
"initialValue": blurb["DocType"][:-1],
"taskObject": blurb["DocText"]
}
start_loop_response = a2i.start_human_loop(
HumanLoopName=humanLoopName,
FlowDefinitionArn=flowDefinitionArn,
HumanLoopInput={
"InputContent": json.dumps(inputContent)
}
)
human_loops_started.append(humanLoopName)
print(f'SentimentScore of {response}, {blurb["DocType"]} is less than the threshold of {SENTIMENT_SCORE_THRESHOLD}')
print(f'Starting human loop with name: {humanLoopName} \n')
else:
print(f'SentimentScore of {response}, {blurb["DocType"]} is above threshold of {SENTIMENT_SCORE_THRESHOLD}')
print('No human loop created. \n')

Optional

You could continue with the optional part of the original workshop and try adding a human loop to the deployed comprehend real-time endpoint or do it replacing comprehend classifier with an LLM. Bonus point write your own blog.

Conclusion

We have demonstrated how to implement an intelligent document processing solution using AWS tools within SageMaker and outside SageMaker to extract text from a legacy document format (image) and train a text classification model from our extracted document to classify our documents.

References

https://github.com/aws-samples/amazon-a2i-sample-jupyter-notebooks/blob/master/Amazon%20Augmented%20AI%20(A2I)%20and%20Comprehend%20DetectSentiment.ipynb
 

Comments