The Java sample program below shows how to search formatted fields in a PDF document (such as Social Security Numbers, email addresses, phone numbers or dates), apply redaction annotations and then burn them to remove the underlying text content.
Formatted text is identified through the use of regular expressions (“regex”), so the code can be modified to search any kind of formatted data, such as a European country phone number or other formatted record number, patient number, customer id, etc…
The sample is using Qoppa’s Java PDF library jPDFProcess to search text and then modify the PDF documents.
// Open the document PDFDocument pdfDoc = new PDFDocument("input.pdf", null); // per page: search text, create redaction annotations, then apply for (int i = 0; i < pdfDoc.getPageCount(); i++) { PDFPage pdfPage = pdfDoc.getPage(i); //overly simplistic email matching expresssion I copied off the web Vector<TextPositionWithContext> emails_maybe = pdfPage.findTextWithContextUsingRegEx("([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})"); //overly simplistic phone # regex I just came up with without being too careful // should match 7 digit phone numbers with optional 3 digit area code Vector<TextPositionWithContext> usPhoneNums = pdfPage.findTextWithContextUsingRegEx("(([(][0-9]{3}[)])|[0-9]{3})(\\w|-)[0-9]{3}(\\w|-)[0-9]{4}"); //SSNs ###-##-#### Vector<TextPositionWithContext> ssns = pdfPage.findTextWithContextUsingRegEx("[0-9]{3}-[0-9]{2}-[0-9]{4}"); //very simple dates in YYYY-MM-DD format Vector<TextPositionWithContext> yearFirstDates = pdfPage.findTextWithContextUsingRegEx("[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}"); List<TextPositionWithContext> allResults = new ArrayList<TextPositionWithContext>(); allResults.addAll(emails_maybe); allResults.addAll(usPhoneNums); allResults.addAll(ssns); allResults.addAll(yearFirstDates); //create redaction annotations for (TextPosition textPos : allResults) { pdfPage.addAnnotation(pdfDoc.getAnnotationFactory().createRedaction("Redaction sample", textPos.getPDFQuadrilaterals())); } //apply ("burn-in") all redaction annotations on the page pdfPage.applyRedactionAnnotations(); |