The following code is generating special characters instead of spaces for one PDF but not another:
String fullText;
BodyContentHandler handler = null;
try {
// size is limit is 100M
handler = new BodyContentHandler(100 * 1024 * 1024);
Metadata meta = new Metadata();
PDFParser parser = new PDFParser();
parser.setEnableAutoSpace(false);
parser.parse(new FileInputStream(this.pdf /*always a valid pdf file*/), handler, meta, new ParseContext());
}
catch (SAXException e) {
throw new IOException(e);
} catch (TikaException e) {
throw new IOException(e);
}
fullText = handler.toString();
Depending on the PDF a substring of fullText will look like:
will*continue*to*be*used*in*support*of*the
When It should look like this:
will continue to be used in support of the
In other places, '%' substitute '-' and '!' substitute spaces amongst bolded text.
This issue only when processing one PDF but not the other. According to pdfinfo, both PDF's are generated by Quartz PDFContext.
linux command pdftotext renders the same results.
Is this a problem with how the original PDF is generated? Why is this happening?