Java Sample Code to Recognize (OCR) and Add Text to a PDF Document

Here is a simple small Java program that uses Qoppa’s PDF library jPDFProcess and the Tesseract libraries to recognize text in a PDF and add it as invisible text on each PDF page:

// Load a PDF that contains scanned pages needing to be OCRed
PDFDocument pdfDoc = new PDFDocument("C:/test/test.pdf", null);
// initialize the OCR bridge with the Tesseract libpath and the datapath
OCRBridge.initialize("C:/test/tess", "C:/test/tess/tessdata");
// start OCR
TessJNI ocr = new TessJNI();
// loop through all pages
for (int count = 0; count < pdfDoc.getPageCount(); ++count)
{
  // perform OCR on the current PDF page
  // 300 is the DPI resolution parameter at which to render the page 
  // when sending it to the OCR engine
  String pageOCR = ocr.performOCR("eng", pdfDoc.getPage(count), 300);
  // insert OCR text into the current PDF page
  pdfDoc.getPage(count).insert_hOCR(pageOCR, true);
}
// Save the OCRed PDF document
pdfDoc.saveDocument("C:/test/test_ocr.pdf");

For the code above to run, you will need to have read this KB article and setup the 2 folders below:

“C:/test/tess”: the libpath containing Tesseract libraries
“C:/test/tess/tessdata”: the datapath containing Tesseract languages

Download Full Java Sample PDF OCR Program

See our PDF technology in action!

Privacy Policy

Links to Qoppa’s Main Website

Contact Support

Follow Us

Suggested Articles

Related Articles

See our PDF technology in action!

Links to Qoppa’s Main Website

Contact Support

Follow Us