When trying to automate processes, it can be necessary to identify and extract specific text strings contained in PDF documents such as invoices, statements, medical records or other business documents.
Text is not structured in PDF documents so the best way to do this is done by identifying text by its location on the PDF page.
Here is a sample code to extract text contained within a rectangle, at a specific position on a PDF page. This sample is using Qoppa’s jPDFProcess library:
// Load the PDF document PDFDocument pdfDoc = new PDFDocument ("C:\\doc.pdf", null); // get the first page PDFPage page = pdfDoc.getPage(0); // Define your position rectangle for the text to be identified on the page // These coordinates are in 72 dpi. Rectangle rectangle = new Rectangle(100, 150, 250, 10); // Get text contained within a rectangle in a PDF page TextSelection selection = page.getTextInArea(rectangle); if(selection != null) { System.out.println("Text found in the defined rectangle " + selection.getText()); } // This will draw a red rectangle around the search area on the PDF page // This is for debugging purposes only so you can see the rectangular area // where the text is being extracted from Graphics2D g2d = page.createGraphics(); g2d.setColor(Color.red); g2d.draw(rectangle); // save the PDF with the red rectangle pdfDoc.saveDocument ("C:\\doc_with_red_rectangle.pdf"); |
The returned object TextSelection is an interface that describes the text found. The interface provides methods to get a Shape object and a quadrilateral that encloses the text selection on the page as well as a method to retrieve the selected text as a string.