Here is a simple small Java program that uses Qoppa’s PDF library jPDFProcess and the Tesseract libraries to recognize text in a PDF and add it as invisible text on each PDF page:

// Load a PDF that contains scanned pages needing to be OCRed
PDFDocument pdfDoc = new PDFDocument("C:/test/test.pdf", null);
// initialize the OCR bridge with the Tesseract libpath and the datapath
OCRBridge.initialize("C:/test/tess", "C:/test/tess/tessdata");
// start OCR
TessJNI ocr = new TessJNI();
// loop through all pages
for (int count = 0; count < pdfDoc.getPageCount(); ++count)
  // perform OCR on the current PDF page
  // 300 is the DPI resolution parameter at which to render the page 
  // when sending it to the OCR engine
  String pageOCR = ocr.performOCR("eng", pdfDoc.getPage(count), 300);
  // insert OCR text into the current PDF page
  pdfDoc.getPage(count).insert_hOCR(pageOCR, true);
// Save the OCRed PDF document

For the code above to run, you will need to have read this KB article and setup the 2 folders below:

  • “C:/test/tess”: the libpath containing Tesseract libraries
  • “C:/test/tess/tessdata”: the datapath containing Tesseract languages

Download Full Java Sample PDF OCR Program