jPDFProcess, Qoppa’s java PDF creation and manipulation library, has an OCR module. Please contact us regarding licensing this additional feature.

How to Activate / Implement OCR

To get started, you can download the latest jPDFProcess version from here:
https://www.qoppa.com/pdfprocess/demo/download
 
And the JNI native bridge files from here:
https://www.qoppa.com/files/pdfprocess/ocr/libtessjni411.zip

The JNI zip file contains the native libraries builds for Windows, Linux and MacOS, in 32 and 64 bits. At runtime, these native libraries will  need to be in the machine that is running the software.

If you are running in an application, you can bundle the native libraries in your installation.  If you are running in an applet, you probably want to get these files from a server on demand:  When the user chooses to use OCR, you can have the applet download the appropriate file for the OS and bitness to a local folder from your web server.

Additionally, you will need to have the OCR language data files, these can be downloaded from here:
https://kbdeveloper.qoppa.com/ocr-language-download-links/

The language files are compressed for efficiency, to use them you will need to uncompress into a folder and then let jPDFProcess know where they are located when initializing OCR. You only need to have the language files for the languages that you want to support in the local machine.

To activate the OCR functionality, call OCRBridge.initialize() with the path to these directories.

OCRBridge.initialize(String libraryPath, String dataPath);

    • libraryPath is the path to the folder where the native libraries are located
This is the libraryPath folder which contains the Tesseract libraries
    • dataPath is the path to the folder where the OCR language files (uncompressed)
This is the dataPath folder which contains English files in this case

You can then make calls to it and feed the results to jPDFProcess.

// Load a PDF that contains scanned pages needing to be OCRed
PDFDocument pdfDoc = new PDFDocument("C:/test/test.pdf", null);
// initialize the OCR bridge with the Tesseract libpath and the datapath
OCRBridge.initialize("C:/test/tess", "C:/test/tess/tessdata");
TessJNI ocr = new TessJNI();
for (int count = 0; count < pdf.getPageCount(); ++count)
{
String pageOCR = ocr.performOCR("eng", pdf.getPage(count), 300);
pdf.getPage(count).insert_hOCR(pageOCR, true);
}

Download Full Java OCR sample program that shows this.

Additional Languages

Additional languages, including non-latin and CJK languages, can be downloaded from OCR Language Download Links.

Extract the archives and place all files for a language in the “tessdata” directory. Add entries to languages.xml to convert the language prefix in the language combo box.

Tagged: