One way to tell if a page is empty in a PDF document is to look at the number of drawing commands on the page. If there is zero command then the page is perfectly empty, which usually means it was programmatically generated to be blank.
But sometimes a page can contain a scanned image that appear blank but the page will still have commands in it. One need then to analyze the image to determine if it “appears” / “looks” empty.
This Java sample shows how to determine whether a page appears empty in a PDF document. It is using Qoppa’s PDF library jPDFProcess but could be adapted to use jPDFImages.
We convert the page to an image and then look how much variance in luminance there is in the image.
// Load the PDF document PDFDocument pdfDoc = new PDFDocument ("doc.pdf", null); // get the first page PDFPage page = pdfDoc.getPage(0); // Convert the page to an image BufferedImage pageImage = page.getImage(); // set a tolerance int tolerance = 5 // We need to remove the alpha channel from the page's image // before checking it's variance in luminance. BufferedImage rgbImage = new BufferedImage(pageImage.getWidth(), pageImage.getHeight(), BufferedImage.TYPE_INT_RGB); Graphics2D g2d = rgbImage.createGraphics(); g2d.drawImage(pageImage, 0, 0, null); // compute the variance in luminance, compare with tolerance and return boolean isEmptyPage = (calculateLumVar(rgbImage) < tolerance); |
And here is the method to calculate the variance in luminance in a BufferedImage.
//This method calculates the variance in luminance for a BufferedImage. The //single pass variance algorithm is adapted from //http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm. // //@param image //@return luminance variance // private double calculateLumVar(BufferedImage image) { double x; double mean = 0; double m2 = 0; double delta; long n = 0; int[] rgb = new int[image.getWidth()]; for (int hIdx = 0; hIdx < image.getHeight(); ++hIdx) { image.getRGB(0, hIdx, image.getWidth(), 1, rgb, 0, image.getHeight()); for (int i = 0; i < rgb.length; ++i) { ++n; // Standard luminance calculation x = ((rgb[i] >> 16) & 0xff) * 0.2126 + ((rgb[i] >> 8) & 0xff) * 0.7152 + ((rgb[i]) & 0xff) * 0.0722; delta = x - mean; mean += (delta / n); m2 += (delta * (x - mean)); } } if (n < 2) return 0; return m2 / (n - 1); } |
If you need to detect blank pages only, you can use our library called jPDFImages.
If you need to further manipulate the PDF document: for instance to delete the empty pages or split the PDF in multiple ones, you will need our library jPDFProcess.