Code Sample: Extract Words and Position in a PDF document in Java

Java program to extract all the words in a PDF document with their bounding box (as a quadrilatral) and echoes this information to the console. The bounding box is a quadrilateral which gives information about the the location of the word on each page as well as the word’s length and height.

// Load the PDF document
PDFText pdfText = new PDFText ("input.pdf", null);
 
// Loop through the PDF pages
for (int pageIx = 0; pageIx < pdfText.getPageCount(); ++pageIx)
{
  // Echo page number
  System.out.println ("\n***** Page " + pageIx + " *****\n");
 
  // Get the words in the page and their position
  Vector wordList = pdfText.getWordsWithPositions(pageIx);
 
  // Echo each of the words in the document
  for (int wordIx = 0; wordIx < wordList.size(); ++wordIx)
  {
     // Echo the word information
     TextPosition tp = (TextPosition)wordList.get(wordIx);
     System.out.println (tp.getText() + " - " + echoQuad (tp.getQuadrilateral()));
   }
}

See our PDF technology in action!

Privacy Policy

Links to Qoppa’s Main Website

Contact Support

Follow Us

Related Articles

Customizing separators when extracting words in a PDF using jPDFText

Extracting fields data and positions from invoices and statements using jPDFText

Check if a PDF file contains any text content

Code Sample: Extract text from each page on a PDF document (in Java)

jPDFText Java API

Code Sample: Extract Words from a PDF document in Java

See our PDF technology in action!

Privacy Policy

Links to Qoppa’s Main Website

Contact Support

Follow Us