jPDFProcess, Qoppa’s java PDF creation and manipulation library, has an OCR module. Please contact us regarding licensing this additional feature.

How to Activate / Implement OCR

To get started, you can download:

  • the latest jPDFProcess version from our standard download page:
  • the JNI native bridge files from here:

http://www.qoppa.com/files/pdfprocess/ocr/libtessjni.zip

The JNI zip file contains the native libraries builds for Windows, Linux and Mac OS X, all in 32 and 64 bits. At runtime, these native libraries will  need to be in the machine that is running the software.

If you are running in an application, you can bundle the native libraries in your installation.  If you are running in an applet, you probably want to get these files from a server on demand:  When the user chooses to use OCR, you can have the applet download the appropriate file for the OS and bitness to a local folder from your web server.

  • The OCR language files from here:

http://www.qoppa.com/files/pdfprocess/ocr/tesslang.zip

The language zip file contains language files for English, German, French, Spanish and Italian. The files inside the zip file are directly from the Tesseract project site, they are archive files for each of the languages which you will need to un-compress so that jPDFProcess can use them.You only need to have the language files for the languages that you want to support in the local machine.  You should also probably install these on demand by having the applet download the files from your server when necessary.

To activate the OCR functionality, call OCRBridge.initialize() with the path to these directories.

OCRBridge.initialize(String libraryPath, String dataPath);

  • libraryPath is the path to the folder where the native libraries are located
  • dataPath is the path to the folder where the OCR language files (uncompressed)

You can then make calls to it and feed the results to jPDFProcess.

TessJNI ocr = new TessJNI();
for (int count = 0; count < pdf.getPageCount(); ++count)
{
String pageOCR = ocr.performOCR("eng", pdf.getPage(count), 300);
pdf.getPage(count).insert_hOCR(pageOCR, true);
}

Download OCR sample that shows this.

Additional Languages

Additional languages, including non-latin and CJK languages, can be downloaded from OCR Language Download Links.

Extract the archives and place all files for a language in the “tessdata” directory. Add entries to languages.xml to convert the language prefix in the language combo box.


TestOCR
TestOCR
TestOCR.java
721.0 B
706 Downloads
Details
Tagged: