Customizing separators when extracting words in a PDF using jPDFText

Q: I need to extract words and position in a PDF. I am using getWordsWithPositions(int pageIndex) method in jPDFText. Having each word and its bounding quadrilateral is exactly what I need. However, I also need punctuation marks (ie ,;: . ) to be included as part of the words with bounding information. Is there a way to do this?

A: You can tell what separators to use when identifying and separating words by calling the following method in the API:

PDFText.getWordsWithPosition(int pageIndex, String wordSeparators);

The default separator characters are:

 “,/;\n><():?&.@*\t”

but it is possible customize them to set your own. You can remove or add separator characters as needed.

For instance if your sentence is:

For my purpose, I need punctuation marks.

With the default separators, you would get the following words (in brackets):

[For] [my] [purpose] [I] [need] [punctuation] [marks.]

If you change the separators to “/\n><()&@*\t” (removing punctuation characters ‘,’ ‘.’ ‘?’ ‘:’ ‘;’ from the original string of separators), you would get the following words:

[For] [my] [purposes,] [I] [need] [punctuation] [marks.]

See our PDF technology in action!

Privacy Policy

Links to Qoppa’s Main Website

Contact Support

Follow Us

Related Articles

Sample Java Code to Print a PDF as Image

PAdES PDF Avanced Electronic Signatures Support in Qoppa PDF SDK

Merging signed PDFs with Java PDF SDK jPDFProcess

Add Document TimeStamp (DTS) to a PDF document with Java

Search & Redact Social Security Numbers SSN in a PDF with Java

Can a digital signature in a PDF document have multiple linked widgets?

See our PDF technology in action!

Privacy Policy

Links to Qoppa’s Main Website

Contact Support

Follow Us