When trying to automate processes, it can be necessary to identify and extract specific text strings contained in PDF documents such as invoices, statements, medical records or other business documents.

Text is not structured in PDF documents so the best way to do this is done by identifying text by its location on the PDF page.

Sample invoice where the invoice number and invoice date are located within specific rectangle on the page
Sample invoice where the invoice number and invoice date are located within specific rectangles on the page

Here is a sample code to extract text contained within a rectangle, at a specific position on a PDF page. This sample is using Qoppa’s jPDFProcess library:

// Load the PDF document
PDFDocument pdfDoc = new PDFDocument ("C:\\doc.pdf", null);
// get the first page
PDFPage page = pdfDoc.getPage(0);
// Define your position rectangle for the text to be identified on the page 
// These coordinates are in 72 dpi.
Rectangle rectangle = new Rectangle(100, 150, 250, 10);
// Get text contained within a rectangle in a PDF page 
TextSelection selection = page.getTextInArea(rectangle);
if(selection != null)
 System.out.println("Text found in the defined rectangle " + selection.getText());
// This will draw a red rectangle around the search area on the PDF page
// This is for debugging purposes only so you can see the rectangular area 
// where the text is being extracted from
Graphics2D g2d = page.createGraphics();
// save the PDF with the red rectangle
pdfDoc.saveDocument ("C:\\doc_with_red_rectangle.pdf");

The returned object TextSelection is an interface that describes the text found. The interface provides methods to get a Shape object and a quadrilateral that encloses the text selection on the page as well as a method to retrieve the selected text as a string.