The Java sample program below shows how to search formatted fields in a PDF document (such as Social Security Numbers, email addresses, phone numbers or dates), apply redaction annotations and then burn them to remove the underlying text content.

Formatted text is identified through the use of regular expressions (“regex”), so the code can be modified to search any kind of formatted data, such as a European country phone number or other formatted record number, patient number, customer id, etc…

The sample is using Qoppa’s Java PDF library jPDFProcess to search text and then modify the PDF documents.

// Open the document
PDFDocument pdfDoc = new PDFDocument("input.pdf", null);
// per page: search text, create redaction annotations, then apply
for (int i = 0; i < pdfDoc.getPageCount(); i++) 
PDFPage pdfPage = pdfDoc.getPage(i);
//overly simplistic email matching expresssion I copied off the web
Vector<TextPositionWithContext> emails_maybe = pdfPage.findTextWithContextUsingRegEx("([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})");
//overly simplistic phone # regex I just came up with without being too careful 
// should match 7 digit phone numbers with optional 3 digit area code
Vector<TextPositionWithContext> usPhoneNums = pdfPage.findTextWithContextUsingRegEx("(([(][0-9]{3}[)])|[0-9]{3})(\\w|-)[0-9]{3}(\\w|-)[0-9]{4}");
//SSNs ###-##-####
Vector<TextPositionWithContext> ssns = pdfPage.findTextWithContextUsingRegEx("[0-9]{3}-[0-9]{2}-[0-9]{4}");
//very simple dates in YYYY-MM-DD format
Vector<TextPositionWithContext> yearFirstDates = pdfPage.findTextWithContextUsingRegEx("[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}");
List<TextPositionWithContext> allResults = new ArrayList<TextPositionWithContext>();
//create redaction annotations
for (TextPosition textPos : allResults)
pdfPage.addAnnotation(pdfDoc.getAnnotationFactory().createRedaction("Redaction sample", textPos.getPDFQuadrilaterals()));
//apply ("burn-in") all redaction annotations on the page