Q: I need to extract words and position in a PDF. I am using getWordsWithPositions(int pageIndex) method in jPDFText. Having each word and its bounding quadrilateral is exactly what I need. However, I also need punctuation marks (ie ,;: . ) to be included as part of the words with bounding information. Is there a way to do this?
A: You can tell what separators to use when identifying and separating words by calling the following method in the API:
PDFText.getWordsWithPosition(int pageIndex, String wordSeparators); |
The default separator characters are:
“,/;\n><():?&.@*\t”
but it is possible customize them to set your own. You can remove or add separator characters as needed.
For instance if your sentence is:
For my purpose, I need punctuation marks.
With the default separators, you would get the following words (in brackets):
[For] [my] [purpose] [I] [need] [punctuation] [marks.]
If you change the separators to “/\n><()&@*\t” (removing punctuation characters ‘,’ ‘.’ ‘?’ ‘:’ ‘;’ from the original string of separators), you would get the following words:
[For] [my] [purposes,] [I] [need] [punctuation] [marks.]