Here is a sample Java program to find all instances of social security numbers in a PDF document using a regex expression. Once the numbers are identified, they are removed from the PDF content and the area blacked out through a process called permanent redaction. The SSN are then covered with redaction annotations and removed when the redaction annotations are applied or “burnt”, leaving just a black rectangle where the SSN used to be. This sample code uses Qoppa’s PDF Redaction SDK API jPDFProcess.
Note: Make sure you use the regular expression corresponding to the format of the social security numbers present in your documents. In the sample code below, we are matching the following pattern: “123-12-1234”.
// Open the PDF document PDFDocument pdfDoc = new PDFDocument("input.pdf", null); // Regular expression to check valid SSN String redactSSN = "^(?!666|000|9\\d{2})\\d{3}-(?!00)\\d{2}-(?!0{4})\\d{4}$"; // per page: search text, create redaction annotations, then apply for (int i = 0; i < pdfDoc.getPageCount(); i++) { PDFPage pdfPage = pdfDoc.getPage(i); // Search for the text List<TextPosition> searchResults = pdfPage.findTextUsingRegex(redactSSN); //create redaction annotations for (TextPosition textPos : searchResults) { Redaction redact = pdfDoc.getAnnotationFactory().createRedaction("Redaction sample", textPos.getPDFQuadrilaterals()); pdfPage.addAnnotation(redact); } //apply ("burn-in") all redaction annotations on the page pdfPage.applyRedactionAnnotations(); } // save the redacted PDF document pdfDoc.saveDocument("output.pdf"); |
Download Full Java Sample Search & Redact SSN
Note this sample using jPDFProcess v2021R1 to be released in August 2021. For previous version, look at method findTextWithContextUsingRegEx instead. Contact us with any question.