python PyPDF2 - Special characters are printing while tring to print text from pdf file?

Question

I am trying to print text from the pdf file using PyPDF2 module but some special characters are printing.
already tried this solution but it does not seems to work.
code

import PyPDF2

obj = open('/home/sarthak/Documents/UNIT-4.pdf','rb')

pdfReader = PyPDF2.PdfFileReader(obj)

print(pdfReader.numPages)   #printing No. of pages

pageObj = pdfReader.getPage(0)

print(pageObj.extractText().encode('ascii','ignore'))    #also used 'utf-8' but doesn't work either

obj.close()

output

17
b'\n\n\n\n!#$\n\n\n\n\n\n\n\n\n\n\n  \n\n"%$\n\n\n"#\n\n\n $\n\n\n\'())(*+, -$&\n\n\n\n\n $&-\n $\n'

Jinu Joseph · Accepted Answer · 2020-02-24 06:45:06Z

1

For removing /n u can pass the result in textacy.

import textacy
data=textacy.preprocess.remove_punct(section, marks='\n'))
print(data)

wheresection is the extracted data

for installing textacy pip install textacy

answered Feb 24, 2020 at 6:45

Jinu Joseph

6221 gold badge5 silver badges17 bronze badges

Add a comment |

Collectives™ on Stack Overflow

python PyPDF2 - Special characters are printing while tring to print text from pdf file?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
python-3.x
pdf
file-handling
pypdf
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonpython-3.xpdffile-handlingpypdf or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
python-3.x
pdf
file-handling
pypdf
or ask your own question.