This Java sample shows how to search a text label within a PDF document and then remove the text following that label. This is done by adding a redaction annotation and burning it which remove any content below the redaction annotation. This sample is using Qoppa’s PDF library jPDProcess

In the sample code below, we’re searching for the label “Phone Number” to redact all phone number contained in a PDF document. The same process could be applied to redact Social Security Numbers (SSN), Patient Record Numbers or other confidential information.

// Load the document
PDFDocument pdfDoc = new PDFDocument ("C:\\myfolder\\input.pdf", null);
 
// this is my search label that comes before the text to be redacted
String searchLabel = "Phone Number";
 
// Loop through all pages in the document
boolean foundLabel = false;
for (int pageix = 0; pageix < pdfDoc.getPageCount(); ++pageix)
{
   // Search for the label text
   Vector labelInstances = pdfDoc.getPage(pageix).findText(searchLabel, false, false);
 
   // Add annotations after the instances of the label
   if (labelInstances != null && labelInstances.size() &gt; 0)
   {
    	foundLabel = true;
    	for (TextPosition tp : labelInstances)
    	{
    	  Rectangle2D labelBounds = tp.getEnclosingShape().getBounds2D();
    	  Rectangle2D.Double eraseBounds = new Rectangle2D.Double(labelBounds.getX() + labelBounds.getWidth() + 1, 
    																			labelBounds.getY() - 2, 2 * 72, labelBounds.getHeight() + 4);
 
    	  Redaction redact = pdfDoc.getAnnotationFactory().createRedaction("Redaction");
    	  redact.setRectangle(eraseBounds);
    	  redact.setInternalColor(Color.black);
    	  pdfDoc.getPage(pageix).addAnnotation(redact);
    	}
    }
 }
 
// output whether the search label was found or not
System.out.println("Search Label found " + foundLabel);
 
// save doc with redaction
pdfDoc.saveDocument ("C:\\myfolder\\output_redact.pdf");

Download Full Java Sample SearchAndRedact.java

This is a screenshot of the output PDF:

This is a page where phone numbers were redacted. The black rectangle are the areas where the redaction annotations were burnt.
Tagged: