Redacting PII efficiently using agents
Using an agent and tools, you can efficiently and cost effectively redact PII from files.
Ross Alas
Amazon Employee
Published Nov 23, 2024
Follow us on LinkedIn for more AWS & GenAI-related content: www.linkedin.com/in/ross-alas and https://www.linkedin.com/in/deborshi-c-77797863/
Personally Identifiable Information (PII) can be effectively redacted from text using AI agents that outperform traditional regex-based and NLP approaches to PII detection. While it’s relatively simple to feed a source document to a large language model (LLM) and create a prompt to have it output the redacted version, this approach will be cost-prohibitive for some organizations. If you have 1 million input tokens, you’ll have approximately 1 million output tokens minus the redacted elements. Additionally, it’s quite common for the initial output of the LLM to miss some PII. You will have to re-feed the output of one iteration into the next to perform multi-pass PII redaction. As a result, costs will increase significantly.
There are better ways to achieve efficient and high-quality PII redaction. Let’s explore a solution that leverages AI agents and tools to optimize this process.
An AI agent is an autonomous piece of software that can create a plan, execute it, and fulfill an objective without human intervention. In this case, an AI agent is tasked with redacting all of the PII within a given text. The agent reads the text, identifies defined PII, calls a tool with a list of text to replace, and automatically repeat and check its work if necessary. By only outputting the strings to redact, the agent significantly reduces the amount of output tokens compared to outputting the entire text.
- PII Definitions: Create PII definitions containing the types of PII (e.g., First Name) and their replacements (e.g., [FIRST_NAME]). Include examples of PII to account for variations in how the PII may appear in the text. For example, an account number maybe written as '“12345” or “1 2 three forty-five“.
- AI Agent and Tools: AI agent is provided with two tools:
- Read File: Reads**** the original raw text file from storage.
- RedactPIITool: Accepts a file_uri (such as an S3 URI or local file storage) and a list of tuples containing the text to redact and its replacement (such as [(“Bob”, "[FIRST_NAME]"), ("September 1, 1987", "[BIRTHDAY]"), ...]
- Agent Execution: Once the agent receives the prompt, the agent uses the Read File tool to ingest the original text file. The agent will then reason through the text to identify PII based on the definitions.
- Redaction Process: The agent calls the RedactPIITool with the file_uri and the list of identified PII tuples.
- Text Replacement: The RedactPIITool reads the text from storage, loops through the list of tuples, and performs find-and-replace operations.
- Output Generation: Finally, the RedactPIITool outputs the redacted text to the destination.
With the simple redaction solution where the LLM will output all of the text but redacted, we can assume that the number of input and output tokens are going to be very similar. Here’s a comparison of costs:
- Input Tokens: 1 million
- Output Tokens: ~1 million
- Cost with Anthropic Claude 3 Haiku:
- Input Token Cost: 1 million tokens * $0.00025/ 1000 tokens = $0.25
- Output Token Cost: 1 million tokens * $0.00125 / 1000 tokens = $1.25
- Total Cost: $1.50
With the architecture above, assuming that for each 1 million input tokens, the model only needs to output 10,000 tokens, but may need 3 turns to complete:
- Input Tokens: 1 million tokens * 3 turns = ~3 million tokens
- Output Tokens: 10,000 tokens
- Cost with Anthropic Claude 3 Haiku:
- Input Token Cost: 3 million tokens * $0.00025/ 1000 tokens = $0.75
- Output Token Cost: 10,000 tokens * $0.00125 / 1000 tokens = $0.0125
- Total Cost: $0.76
The numbers here will change depending on your use case, but by offloading the redaction process to a tool and minimizing the output tokens, the tool-based approach nearly halves the cost.
Let’s try out this approach with a sample document. On the left is the original text, and on the right is the output.
Using AI agents with tools for PII redaction offers a more efficient and cost-effective alternative to other methods. By focusing on identifying and replacing only the PII elements rather than reprocessing entire documents, organizations can significantly reduce costs and improve redaction accuracy. The tool-based approach also allows for scalability and adaptability to various data types and PII definitions.
Read more about Amazon Bedrock Agents
If you want to learn more about how to make your own custom agents, take a look at previous blog posts:
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.