Here is a simple small Java program that uses Qoppa’s PDF library jPDFProcess and the Tesseract libraries to recognize text in a PDF and add it as invisible text on each PDF page: // Load a PDF that contains scanned pages needing to be OCRed PDFDocument pdfDoc = new PDFDocument("C:/test/test.pdf", null); // initialize the OCR […]
Articles Tagged: OCR
How to skip PDF pages that have already been OCRed when recognizing text in a PDF
If you are OCRing a document where some pages have already been OCRed, you can skip these pages: TessJNI ocr = new TessJNI(); for (int count = 0; count < pdf.getPageCount(); ++count) { PDFPage page = pdf.getPage(count); // if the page already has invisible text, skip it if (!page.containsInvisibleText()) { String pageOCR = ocr.performOCR("eng", page, […]
Comparing Tesseract versions 3.02 and 3.05 for accuracy & performance
The Java PDF OCR module available in Qoppa PDF libraries currently runs on Tesseract 3.02. In June 1st 2017, Tesseract 3.05 was released and as a part of our 2018 software release cycle, we looked into upgrading the OCR module to use that version. Tests were done to compare Tesseract 3.02.02 against the new 3.05.01 […]
OCR Languages Download Links
OCR Language Download Links Required Data File for All Languages Orientation and script detection Common Languages English – English French – Français German – Deutsch Spanish – Español Italian – Italiano Chinese (Simplified) – 中文简体中文 Chinese (Traditional) – 中文繁體 All Other Languages – This file contains all the languages available (large file) tessdata_fast.zip
Creating Searchable PDF from Image Files
Q: Can we convert images files into searchable PDF documents, by performing OCR, using Qoppa’s Java PDF library? A: Yes, using jPDProcess, you can do that. 1. Convert Images to PDF Pages The first step is to create a PDF from the images: // create a new PDF document PDFDocument pdfDoc = new PDFDocument(); // […]
PDF OCR With Multiple Languages
To call OCR with multiple languages, for instance English and French, call: com.qoppa.ocr.TessJNI.performOCR("eng+fra", myPage, 200); com.qoppa.ocr.TessJNI.performOCR("eng+fra", myPage, 200);
New Languages Supported in OCR
v2015R2 added OCR support for non-Latin and CJK languages. New Latin languages have also been added to the available list of languages. Here is a complete list of newly added OCR languages: New OCR Languages: Afrikaans Albanian – shqip Arabic – العربية Azerbaijani – azərbaycan Basque – euskara Belarusian – беларуская Bengali – বাংলা Bulgarian […]