Java program to extract all the words in a PDF document with their bounding box (as a quadrilatral) and echoes this information to the console. The bounding box is a quadrilateral which gives information about the the location of the word on each page as well as the word’s length and height.

// Load the PDF document
PDFText pdfText = new PDFText ("input.pdf", null);
 
// Loop through the PDF pages
for (int pageIx = 0; pageIx < pdfText.getPageCount(); ++pageIx)
{
  // Echo page number
  System.out.println ("\n***** Page " + pageIx + " *****\n");
 
  // Get the words in the page and their position
  Vector wordList = pdfText.getWordsWithPositions(pageIx);
 
  // Echo each of the words in the document
  for (int wordIx = 0; wordIx < wordList.size(); ++wordIx)
  {
     // Echo the word information
     TextPosition tp = (TextPosition)wordList.get(wordIx);
     System.out.println (tp.getText() + " - " + echoQuad (tp.getQuadrilateral()));
   }
}

GetWordsAndPositions
GetWordsAndPositions
GetWordsAndPositions.java
2.2 KiB
1088 Downloads
Details