Python - convert pdf to text, encoding error

Question

I tried to convert pdf document to txt file. (example of pdf file link)

So I tried like below. But the extracted text is strange like ??챘#?遏?h첨챦_철?‾n?~w??¬?k How can I fix it?

#!/usr/bin/python
# -*- coding: cp949 -*-
# -*- coding: utf-8 -*-
# -*- coding: latin-1 -*-
# -*- coding: euc-kr -*-

import codecs
import pyPdf
filename = "d:/data/processed_data/paper/iscram/2006/iscram1.pdf"
#pdf = codecs.open(filename, "rb", encoding = 'utf-8') 
pdf = codecs.open(filename, "rb", encoding = 'latin1')
for page in pdf:
    print page.encode('utf-8')

I use a win7-64bit korean version.

I tried it to another way by using pyPdf like below

import os
import glob
from pyPdf import PdfFileReader
import pdfminer
 
f=open("d:/data/processed_data/paper/iscram/2006/iscram1.txt",'w')
parent = "d:/data/processed_data/paper/iscram/2006"
os.chdir(parent)
filename = os.path.abspath('iscram1.pdf')
 
input = PdfFileReader(file(filename, "rb"))
for page in input.pages:
    f.write(page.extractText())

but it doesn't work and it occurs ''ascii' codec can't encode character u'\u0152' in position 602: ordinal not in range(128)' error

You can't really use all those encoding declarations can you -- like, that doesn't work does it? — jedwards, Commented Mar 15, 2015 at 6:03
Also, you're not using pyPdf anywhere, that probably doesn't help. — jedwards, Commented Mar 15, 2015 at 6:04
@jedwards I used the pypdf. but I failed to get a good result..... — user3704652, Commented Mar 15, 2015 at 6:07

Community · Accepted Answer · 2017-05-23 10:34:11Z

3

The former code couldn't work at all, PDF does not necessarily contain directly readable text at all. The latter code with pyPdf looks more promising though.

The TypeError is raised because the pages in PDF (the page) are not strings, but f.write expects to see a string.

Thus you might try using the extractText method from the documentation:

for page in input.pages:
    f.write(page.extractText().encode('UTF-8'))

edited May 23, 2017 at 10:34

CommunityBot

11 silver badge

answered Mar 15, 2015 at 7:07

Antti Haapala -- Слава Україні

133k22 gold badges286 silver badges339 bronze badges

Than kyou! I fix the code following your guide. But It still doesn't work T.T. ['ascii' codec can't encode character u'\u0152' in position 602: ordinal not in range(128)'] error sentence comes out..
– user3704652
Commented Mar 15, 2015 at 7:14
@user3704652 fixed. Forgot that this is Python 2
– Antti Haapala -- Слава Україні
Commented Mar 15, 2015 at 7:17
1

can you recommend some paper or lecture that I can learn about it more?! The problem was fixed! I really appreciate with you!!
– user3704652
Commented Mar 15, 2015 at 7:32

Add a comment |

user13526470 · Accepted Answer · 2020-05-12 14:19:15Z

the pdf command stream is encoded with an encoding similar to latin-1
the command stream includes instructions to display stuff on the page
where this stuff is "text" then it is actually instructions to display character shapes i.e glyphs taken from a font (or subset of a font or combination of bits of several fonts).
most of the time the information needed to translate the bytes in these instructions to (say) unicode text is stored within the PDF but some times it is not and sometimes the translation is not possible at all (for example where the font prints a logo).
PyPDF2 (and many other open source PDF packages) does not include functionality to deal with the full complexity of this but fortunately many creators of documents rely on a small set of "standard encodings" which include a number of latin-1 variants and the 'extract text' function does provide usable results in these cases. I have also found PDFs where the font definitions have replacement mappings that give you the name of the glyph for each byte used and found it easy to modify PyPDF2 to take care of this. Other cases are not so simple.
Finally there are two other factors that need to be take account of when trying to extract readable text from PDFs. First is that some PDF streams can be compressed and that some are encrypted. PyPDF2 can take care of both of these cases. A second problem is that the PDF instructions are only to put the characters at specific points on the page. In most cases PDF writers may write the data in reading order but may make positioning changes within words as well as at word breaks.

user3888329 · Accepted Answer · 2023-04-22 14:03:20Z

# Import PyPDF2 and os libraries.
import PyPDF2, os

# Get the directory path where the PDF files are stored.
pdf_dir = 'C:\\Store\\output'
print("working......")

# Loop over all the files in the directory.
for filename in os.listdir(pdf_dir):
    if filename.endswith('.pdf'):
        # Open the PDF file in read binary mode.
        pdf_file = open(os.path.join(pdf_dir, filename), 'rb')

        # Read the PDF file using PyPDF2.
        pdf_reader = PyPDF2.PdfFileReader(pdf_file)

        # Initialize an empty string for the extracted text.
        text = ''

        # Loop over each page in the PDF file and extract the text.
        for page_num in range(pdf_reader.getNumPages()):
            page = pdf_reader.getPage(page_num)
            text += page.extractText()

        # Generate a new filename for the text file.
        txt_filename = os.path.splitext(filename)[0] + '.txt'

        # Open a new text file in write mode and write the extracted text.
        with open(os.path.join(pdf_dir, txt_filename), 'w', encoding='utf-8') as txt_file:
            txt_file.write(text)

        # Close the PDF file.
        pdf_file.close()

print("Complete!!!")

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center. — Community, Commented Apr 24, 2023 at 22:42

K J · Accepted Answer · 2023-04-22 20:07:51Z

0

to extract text you can use a shortcut or similar to run a simple file extraction
so here are three lines

one to fetch and read binary.pdf
one to write the binary.pdf layout order as UTF8.txt
and one to read as UTF8.txt in order

curl -O https://web.archive.org/web/20170829051433/http://www.iscramlive.org/ISCRAM2014/papers/p18.pdf  
"pdftotext.exe" -layout -enc UTF-8 -nopgbrk "C:\downloads\p18.pdf"
notepad p18.txt

edited Apr 22, 2023 at 20:07

answered Apr 22, 2023 at 19:42

K J

10.8k4 gold badges19 silver badges45 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Python - convert pdf to text, encoding error

4 Answers 4

Not the answer you're looking for? Browse other questions tagged
python
pdf
error-handling
encoding
pypdf
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Not the answer you're looking for? Browse other questions tagged pythonpdferror-handlingencodingpypdf or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
pdf
error-handling
encoding
pypdf
or ask your own question.