Published November 25, 2025
| Version v2
Journal article
Open
AI-Powered Secure Document Anonymization Pipeline: A Serverless AWS Architecture for PII Detection and Redaction
Description
Organizations struggle with manual redaction of Personally Identifiable Information (PII) from sensitive documents, facing significant compliance challenges with GDPR and HIPAA regulations. This work develops an intelligent, serverless document anonymization pipeline using Amazon Web Services to automate PII detection and redaction processes. The solution employs AWS Step Functions to orchestrate a microservices architecture that ingests documents through a secure web interface, extracts text using Amazon Textract, identifies sensitive information via Amazon Comprehend, and applies configurable anonymization strategies. The system integrates multiple AWS services including Lambda functions for processing logic, API Gateway for API communication between frontend and backend, S3 for storage, DynamoDB for audit trails, and EventBridge for workflow management. Key features include a JavaScript-based frontend with real-time progress tracking, support for multiple document formats (PDF, TXT, and images), and intelligent PII detection covering names, Social Security numbers, emails, medical information and other sensitive PII. Security measures encompass malware scanning via GuardDuty, encryption at rest and in transit, fine-grained IAM policies, and comprehensive audit logging via CloudTrail. The serverless architecture ensures cost-effectiveness through pay-per-use pricing while providing automatic scaling capabilities. This implementation showcases a practical application of cloud-native architectures and AI services for solving real-world data privacy challenges in enterprise environments.
Files
document-anonymization-publication.pdf
Files
(860.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:76d46ec0eb7b254416871ab27631ac84
|
860.2 kB | Preview Download |