Java Sample Code to Recognize (OCR) and Add Text to a PDF Document

Here is a simple small Java program that uses Qoppa’s PDF library jPDFProcess and the Tesseract libraries to recognize text in a PDF and add it as invisible text on each PDF page: // Load a PDF that contains scanned pages needing to be OCRed PDFDocument pdfDoc = new PDFDocument("C:/test/test.pdf", null); // initialize the OCR […]

Read More

OCR Languages Download Links

OCR Language Download Links Required Data File for All Languages      Orientation and script detection Common Languages      English – English      French – Français      German – Deutsch      Spanish – Español      Italian – Italiano      Chinese (Simplified)  – 中文简体中文      Chinese (Traditional)  – 中文繁體 All Other Languages – This file contains all the languages available (large file)      tessdata_fast.zip

Read More

Creating Searchable PDF from Image Files

Q: Can we convert images files into searchable PDF documents, by performing OCR, using Qoppa’s Java PDF library? A: Yes, using jPDProcess, you can do that. 1. Convert Images to PDF Pages The first step is to create a PDF from the images: // create a new PDF document PDFDocument pdfDoc = new PDFDocument(); // […]

Read More

PDF OCR With Multiple Languages

To call OCR with multiple languages, for instance English and French, call: com.qoppa.ocr.TessJNI.performOCR("eng+fra", myPage, 200); com.qoppa.ocr.TessJNI.performOCR("eng+fra", myPage, 200);

Read More

New Languages Supported in OCR

v2015R2 added OCR support for non-Latin and CJK languages. New Latin languages have also been added to the available list of languages. Here is a complete list of newly added OCR languages: New OCR Languages: Afrikaans Albanian – shqip Arabic – العربية Azerbaijani – azərbaycan Basque – euskara Belarusian – беларуская Bengali – বাংলা Bulgarian […]

Read More