Qoppa offers a PDF OCR solution for Java which supports most languages, including English, German, French, and Spanish as well as Chinese, Japanese and Korean. It is available for Windows®, Mac OS X® and Linux®, in 32 and 64 bit. This is a clean, production-level Java integration of the well-known Tesseract engine with Qoppa’s own advanced PDF rendering and editing technology.

Qoppa’s OCR solution allows to add searchable text directly to existing PDF documents or to create PDF documents from images and then add text to them. Typically, PDF documents needing OCR are scanned documents or documents containing images (JPG, TIFF or PNG).

OCR is available as an optional module to the following Java library and component products:


Easy OCR Evaluation / Demo:

Before starting integrating in your Java product, you may first wish to evaluate the accuracy level of our OCR function. We have released the OCR function in our desktop end-user tool PDF Studio. Try our desktop end-user tool free demo. After installing, launch PDF Studio, open a PDF document that you wish to scan and look under under Document -> OCR – Create Searchable PDFs in the top menu.

hOCR Format and Tesseract Integration:

Qoppa has not developed its own text recognition engine. This is a huge project on its own. We tried to find a 100% Java OCR solution but were unable to find one that was production level.

As we evaluated other non-Java OCR engines, both commercial and open source, we found that Tesseract is a very solid option. Tesseract was originally developed by HP which gave it to the community as an open source project. Google has been sponsoring the project since 2006. We have released this OCR integration in our desktop tool PDF Studio and have received great feedback from our end-users.

Tesseract is developed in Native C and requires a JNI bridge to connect from Java. So, our OCR solution is not 100% Java when it comes to communicating with the OCR engine. Everything else in Qoppa’s PDF libraries and components is, i.e, conversion from PDF to images, adding the recognized ext to PDF. Our API works with the hOCR format – the file format representing OCR output.

Our API can take hOCR input and adds the text to PDF documents. It is possible to substitute Tesseract with another OCR engine.

Added-Value of Qoppa’s Integration:

  • Convert PDF documents to images using 100% Java top-of-the-line PDF rendering technology (Qoppa’s).
  • Insert hOCR results into the PDF using 100% Java advanced editing PDF technology (Qoppa’s).
  • Tesseract is built in native C and we’ve created native builds for Windows, Mac and Linux in both 32 and 64 bit that we make available to our customers.
  • We’ve made some minor changes to the Tesseract code to include rotation information and to be able to set the data directory programmatically.
  • To communicate with our Java libraries, we created a JNI interface to access the native builds through JNI (instead of command line) for tighter integration.
  • We will support the native builds to keep them up to date with new OCR engine releases, new operating system releases, etc….
  • We will support and enhance, as needed, the API that inserts the hOCR results into the PDF.
  • We do not plan to support any of the Tesseract code for accuracy or performance, we plan to leave the recognition code “as is”.
  • Tested by thousands of end-users in our PDF desktop tool for Windows, Mac and Linux.


OCR Languages:

We support most languages including Dutch, English, French, German, Italian, Portuguese, Spanish. In version v2015R2, we added support for most languages, including CJK (Chinese, Japanese, Korean)..


Operating Systems Supported
:

  • Windows 32 and 64 bit
  • Mac 32 and 64 bit
  • Linux 32 and 64 bit

Please email us if you have any technical or licensing question about our OCR module.

Tagged: