OCR – Knowledge Base – Qoppa Java PDF API SDK & Server Products

Java Sample Code to Recognize (OCR) and Add Text to a PDF Document

Here is a simple small Java program that uses Qoppa’s PDF library jPDFProcess and the Tesseract libraries to recognize text in a PDF and add it as invisible text on each PDF page: // Load a PDF that contains scanned pages needing to be OCRed PDFDocument pdfDoc = new PDFDocument("C:/test/test.pdf", null); // initialize the OCR […]

How to skip PDF pages that have already been OCRed when recognizing text in a PDF

If you are OCRing a document where some pages have already been OCRed, you can skip these pages: TessJNI ocr = new TessJNI(); for (int count = 0; count < pdf.getPageCount(); ++count) { PDFPage page = pdf.getPage(count); // if the page already has invisible text, skip it if (!page.containsInvisibleText()) { String pageOCR = ocr.performOCR("eng", page, […]

Comparing Tesseract versions 3.02 and 3.05 for accuracy & performance

The Java PDF OCR module available in Qoppa PDF libraries currently runs on Tesseract 3.02. In June 1st 2017, Tesseract 3.05 was released and as a part of our 2018 software release cycle, we looked into upgrading the OCR module to use that version. Tests were done to compare Tesseract 3.02.02 against the new 3.05.01 […]

OCR Languages Download Links

OCR Language Download Links Required Data File for All Languages Orientation and script detection Common Languages English – English French – Français German – Deutsch Spanish – Español Italian – Italiano Chinese (Simplified) – 中文简体中文 Chinese (Traditional) – 中文繁體 All Other Languages – This file contains all the languages available (large file) tessdata_fast.zip

Creating Searchable PDF from Image Files

Q: Can we convert images files into searchable PDF documents, by performing OCR, using Qoppa’s Java PDF library? A: Yes, using jPDProcess, you can do that. 1. Convert Images to PDF Pages The first step is to create a PDF from the images: // create a new PDF document PDFDocument pdfDoc = new PDFDocument(); // […]

PDF OCR With Multiple Languages

To call OCR with multiple languages, for instance English and French, call: com.qoppa.ocr.TessJNI.performOCR("eng+fra", myPage, 200); com.qoppa.ocr.TessJNI.performOCR("eng+fra", myPage, 200);

New Languages Supported in OCR

v2015R2 added OCR support for non-Latin and CJK languages. New Latin languages have also been added to the available list of languages. Here is a complete list of newly added OCR languages: New OCR Languages: Afrikaans Albanian – shqip Arabic – العربية Azerbaijani – azərbaycan Basque – euskara Belarusian – беларуская Bengali – বাংলা Bulgarian […]

Articles Tagged: OCR

Java Sample Code to Recognize (OCR) and Add Text to a PDF Document

How to skip PDF pages that have already been OCRed when recognizing text in a PDF

Comparing Tesseract versions 3.02 and 3.05 for accuracy & performance

OCR Languages Download Links

Creating Searchable PDF from Image Files

PDF OCR With Multiple Languages

New Languages Supported in OCR

See our PDF technology in action!

Privacy Policy

Links to Qoppa’s Main Website

Contact Support

Follow Us