Q: Can we convert images files into searchable PDF documents, by performing OCR, using Qoppa’s Java PDF library?

A: Yes, using jPDProcess, you can do that.

1. Convert Images to PDF Pages

The first step is to create a PDF from the images:

// create a new PDF document
PDFDocument pdfDoc = new PDFDocument();
// Create new page from a JPG image file
pdfDoc.appendJPEGAsPage ("C:\\somefile.jpg");
// Create page(s) from a Tiff image file
pdfDoc.appendTIFFAsPages ("C:\\somefile.tif");
// Create new page from a PNG image file
pdfDoc.appendPNGAsPage ("C:\\somefile.png");

Image files can be read from a file or from an input stream (all these methods can take an input stream).

2. Add Searchable Text to the PDF pages

The PDF then need to be “OCRed” in order to recognize / extract text from the images and then add invisible searchable text to the pages:

TessJNI ocr = new TessJNI();
for (int count = 0; count < pdf.getPageCount(); ++count)
{
String pageOCR = ocr.performOCR("eng", pdf.getPage(count), 300);
pdfDoc.getPage(count).insert_hOCR(pageOCR, true);
}

Read more about how to setup and run OCR in jPDFProcess:

3. Save the file

Finally, the PDF should be saved to a file or to an output stream.

// save the PDF document as a file
pdfDoc.saveDocument ("myDoc.pdf");
Tagged: