Here is a simple small Java program that uses Qoppa’s PDF library jPDFProcess and the Tesseract libraries to recognize text in a PDF and add it as invisible text on each PDF page:
// Load a PDF that contains scanned pages needing to be OCRed PDFDocument pdfDoc = new PDFDocument("C:/test/test.pdf", null); // initialize the OCR bridge with the Tesseract libpath and the datapath OCRBridge.initialize("C:/test/tess", "C:/test/tess/tessdata"); // start OCR TessJNI ocr = new TessJNI(); // loop through all pages for (int count = 0; count < pdfDoc.getPageCount(); ++count) { // perform OCR on the current PDF page // 300 is the DPI resolution parameter at which to render the page // when sending it to the OCR engine String pageOCR = ocr.performOCR("eng", pdfDoc.getPage(count), 300); // insert OCR text into the current PDF page pdfDoc.getPage(count).insert_hOCR(pageOCR, true); } // Save the OCRed PDF document pdfDoc.saveDocument("C:/test/test_ocr.pdf"); |
For the code above to run, you will need to have read this KB article and setup the 2 folders below:
- “C:/test/tess”: the libpath containing Tesseract libraries
- “C:/test/tess/tessdata”: the datapath containing Tesseract languages