Customizing separators when extracting words in a PDF using jPDFText

Q: I need to extract words and position in a PDF. I am using getWordsWithPositions(int pageIndex) method in jPDFText. Having each word and its bounding quadrilateral is exactly what I need. However, I also need punctuation marks (ie ,;: . ) to be included as part of the words with bounding information. Is there a […]

Read More

Extracting fields data and positions from invoices and statements using jPDFText

With Java PDF library jPDFText, you can obtain strings and positions from invoices and statements using the PDFText.getLinesWithPosition method. Knowing the rectangular coordinates and location of each text string allows you to do content analysis of the invoice or statement and get data values for specific fields such as invoice date, customer name, customer address, […]

Read More

Check if a PDF file contains any text content

Here is a Java sample program that uses Qoppa’s jPDFText library to determine if a PDF file contains any text content. The method “findTextInPDF” will return true of text was found on any page in the PDF, false if no text was found on any page. public static boolean findTextInPDF(String absoluteFilePath) throws PDFException, FileNotFoundException, IOException […]

Read More

jPDFText Java API

Q: Where can I find jPDFText javadoc API? A: You can find the API specification for the latest version of our library jPDFText on our website at this link. jPDFText is a java library to extract text and words from PDF documents in Java.

Read More

Code Sample: Extract Words from a PDF document in Java

Java program that gets all the words in a PDF document and echoes them to the console using Qoppa’s library jPDFText. // Load the document PDFText pdfText = new PDFText ("input.pdf", null); // Get the words in the document Vector wordList = pdfText.getWords(); // Echo the words for (int wordIx = 0; wordIx < wordList.size(); […]

Read More

Code Sample: Extract Words and Position in a PDF document in Java

Java program to extract all the words in a PDF document with their bounding box (as a quadrilatral) and echoes this information to the console. The bounding box is a quadrilateral which gives information about the the location of the word on each page as well as the word’s length and height. // Load the […]

Read More