1

The following code is generating special characters instead of spaces for one PDF but not another:

    String fullText;
    BodyContentHandler handler = null;
    try {

        // size is limit is 100M

        handler = new BodyContentHandler(100 * 1024 * 1024);
        Metadata meta = new Metadata();


        PDFParser parser = new PDFParser();
        parser.setEnableAutoSpace(false);
        parser.parse(new FileInputStream(this.pdf /*always a valid pdf file*/), handler, meta, new ParseContext());

    }

    catch (SAXException e) {
        throw new IOException(e);
    } catch (TikaException e) {
        throw new IOException(e);
    }

    fullText = handler.toString();

Depending on the PDF a substring of fullText will look like:

will*continue*to*be*used*in*support*of*the

When It should look like this:

will continue to be used in support of the

In other places, '%' substitute '-' and '!' substitute spaces amongst bolded text.

This issue only when processing one PDF but not the other. According to pdfinfo, both PDF's are generated by Quartz PDFContext.

linux command pdftotext renders the same results.

Is this a problem with how the original PDF is generated? Why is this happening?

3
  • 1
    Probably there are inconsistencies in the font objects inside your PDF file. Note that in PDF, the visual representation of a glyth is generally independent of the character it represents. If you post a sample file that does not work while telling us which piece of text is not working we might be able to give more information.
    – yms
    Commented Oct 11, 2013 at 18:57
  • Also take a look at this question: stackoverflow.com/questions/12184304/…
    – yms
    Commented Oct 11, 2013 at 19:02
  • 1
    For a more specific analysis please provide a sample pdf.
    – mkl
    Commented Oct 11, 2013 at 19:29

0

Browse other questions tagged or ask your own question.