0

I am trying to print text from the pdf file using PyPDF2 module but some special characters are printing.
already tried this solution but it does not seems to work.
code

import PyPDF2

obj = open('/home/sarthak/Documents/UNIT-4.pdf','rb')

pdfReader = PyPDF2.PdfFileReader(obj)

print(pdfReader.numPages)   #printing No. of pages

pageObj = pdfReader.getPage(0)

print(pageObj.extractText().encode('ascii','ignore'))    #also used 'utf-8' but doesn't work either

obj.close()

output

17
b'\n\n\n\n!#$\n\n\n\n\n\n\n\n\n\n\n  \n\n"%$\n\n\n"#\n\n\n $\n\n\n\'())(*+, -$&\n\n\n\n\n $&-\n $\n'

1 Answer 1

1

For removing /n u can pass the result in textacy.

import textacy
data=textacy.preprocess.remove_punct(section, marks='\n'))
print(data)

wheresection is the extracted data

for installing textacy pip install textacy

Not the answer you're looking for? Browse other questions tagged or ask your own question.