Q: I need to extract words and position in a PDF. I am using getWordsWithPositions(int pageIndex) method in jPDFText. Having each word and its bounding quadrilateral is exactly what I need. However, I also need punctuation marks (ie ,;: . ) to be included as part of the words with bounding information. Is there a way to do this?

A: You can tell what separators to use when identifying and separating words by calling the following method in the API:

PDFText.getWordsWithPosition(int pageIndex, String wordSeparators);

The default separator characters are:

 “,/;\n><():?&.@*\t” 

but it is possible customize them to set your own. You can remove or add separator characters as needed.

For instance if your sentence is:

For my purpose, I need punctuation marks.

With the default separators, you would get the following words (in brackets):

[For] [my] [purpose] [I] [need] [punctuation] [marks.]

If you change the separators to “/\n><()&@*\t” (removing punctuation characters ‘,’ ‘.’ ‘?’ ‘:’ ‘;’ from the original string of separators), you would get the following words:

[For] [my] [purposes,] [I] [need] [punctuation] [marks.]
Tagged: