Java program to extract all the words in a PDF document with their bounding box (as a quadrilatral) and echoes this information to the console. The bounding box is a quadrilateral which gives information about the the location of the word on each page as well as the word’s length and height.
// Load the PDF document PDFText pdfText = new PDFText ("input.pdf", null); // Loop through the PDF pages for (int pageIx = 0; pageIx < pdfText.getPageCount(); ++pageIx) { // Echo page number System.out.println ("\n***** Page " + pageIx + " *****\n"); // Get the words in the page and their position Vector wordList = pdfText.getWordsWithPositions(pageIx); // Echo each of the words in the document for (int wordIx = 0; wordIx < wordList.size(); ++wordIx) { // Echo the word information TextPosition tp = (TextPosition)wordList.get(wordIx); System.out.println (tp.getText() + " - " + echoQuad (tp.getQuadrilateral())); } } |