Q: I need to extract words and position in a PDF. I am using getWordsWithPositions(int pageIndex) method in jPDFText. Having each word and its bounding quadrilateral is exactly what I need. However, I also need punctuation marks (ie ,;: . ) to be included as part of the words with bounding information. Is there a […]
Category: jPDFText: Extract Text From PDFs
Developer library to extract text from PDF documents in Java.
Extracting fields data and positions from invoices and statements using jPDFText
With Java PDF library jPDFText, you can obtain strings and positions from invoices and statements using the PDFText.getLinesWithPosition method. Knowing the rectangular coordinates and location of each text string allows you to do content analysis of the invoice or statement and get data values for specific fields such as invoice date, customer name, customer address, […]
Check if a PDF file contains any text content
Here is a Java sample program that uses Qoppa’s jPDFText library to determine if a PDF file contains any text content. The method “findTextInPDF” will return true of text was found on any page in the PDF, false if no text was found on any page. public static boolean findTextInPDF(String absoluteFilePath) throws PDFException, FileNotFoundException, IOException […]
Code Sample: Extract text from each page on a PDF document (in Java)
Java program that extracts the text for each page in a PDF document and writes it to a file using Qoppa’s library jPDFText. // Load the document PDFText pdfText = new PDFText ("input.pdf", null); // Loop through the pages for (int pageIx = 0; pageIx < pdfText.getPageCount(); ++pageIx) { // Get the text for […]
jPDFText Java API
Q: Where can I find jPDFText javadoc API? A: You can find the API specification for the latest version of our library jPDFText on our website at this link. jPDFText is a java library to extract text and words from PDF documents in Java.
Code Sample: Extract Words from a PDF document in Java
Java program that gets all the words in a PDF document and echoes them to the console using Qoppa’s library jPDFText. // Load the document PDFText pdfText = new PDFText ("input.pdf", null); // Get the words in the document Vector wordList = pdfText.getWords(); // Echo the words for (int wordIx = 0; wordIx < wordList.size(); […]
Code Sample: Extract Words and Position in a PDF document in Java
Java program to extract all the words in a PDF document with their bounding box (as a quadrilatral) and echoes this information to the console. The bounding box is a quadrilateral which gives information about the the location of the word on each page as well as the word’s length and height. // Load the […]
Code Sample: Extract text from a PDF document into a text file in Java
Simple Java program to extract the entire text from a PDF document as a single String, and then saving the text to a file using Qoppa’s library jPDFText.