java - Why is PDFParser generating special characters instead of spaces?

The following code is generating special characters instead of spaces for one PDF but not another:

    String fullText;
    BodyContentHandler handler = null;
    try {

        // size is limit is 100M

        handler = new BodyContentHandler(100 * 1024 * 1024);
        Metadata meta = new Metadata();


        PDFParser parser = new PDFParser();
        parser.setEnableAutoSpace(false);
        parser.parse(new FileInputStream(this.pdf /*always a valid pdf file*/), handler, meta, new ParseContext());

    }

    catch (SAXException e) {
        throw new IOException(e);
    } catch (TikaException e) {
        throw new IOException(e);
    }

    fullText = handler.toString();

Depending on the PDF a substring of fullText will look like:

will*continue*to*be*used*in*support*of*the

When It should look like this:

will continue to be used in support of the

In other places, '%' substitute '-' and '!' substitute spaces amongst bolded text.

This issue only when processing one PDF but not the other. According to pdfinfo, both PDF's are generated by Quartz PDFContext.

linux command pdftotext renders the same results.

Is this a problem with how the original PDF is generated? Why is this happening?

asked Oct 11, 2013 at 13:51

JSK NS

3,4262 gold badges26 silver badges42 bronze badges

1

Probably there are inconsistencies in the font objects inside your PDF file. Note that in PDF, the visual representation of a glyth is generally independent of the character it represents. If you post a sample file that does not work while telling us which piece of text is not working we might be able to give more information.
– yms
Commented Oct 11, 2013 at 18:57
Also take a look at this question: stackoverflow.com/questions/12184304/…
– yms
Commented Oct 11, 2013 at 19:02
1

For a more specific analysis please provide a sample pdf.
– mkl
Commented Oct 11, 2013 at 19:29

Add a comment |

Collectives™ on Stack Overflow

Why is PDFParser generating special characters instead of spaces?

0

Browse other questions tagged
java
pdf
apache-tika
pdftotext
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Browse other questions tagged javapdfapache-tikapdftotext or ask your own question.

Linked

Browse other questions tagged
java
pdf
apache-tika
pdftotext
or ask your own question.