0

I tried to convert pdf document to txt file. (example of pdf file link)

So I tried like below. But the extracted text is strange like ??챘#?遏?h첨챦_철?‾n?~w??¬?k How can I fix it?

#!/usr/bin/python
# -*- coding: cp949 -*-
# -*- coding: utf-8 -*-
# -*- coding: latin-1 -*-
# -*- coding: euc-kr -*-

import codecs
import pyPdf
filename = "d:/data/processed_data/paper/iscram/2006/iscram1.pdf"
#pdf = codecs.open(filename, "rb", encoding = 'utf-8') 
pdf = codecs.open(filename, "rb", encoding = 'latin1')
for page in pdf:
    print page.encode('utf-8')

I use a win7-64bit korean version.

I tried it to another way by using pyPdf like below

import os
import glob
from pyPdf import PdfFileReader
import pdfminer
 
f=open("d:/data/processed_data/paper/iscram/2006/iscram1.txt",'w')
parent = "d:/data/processed_data/paper/iscram/2006"
os.chdir(parent)
filename = os.path.abspath('iscram1.pdf')
 
input = PdfFileReader(file(filename, "rb"))
for page in input.pages:
    f.write(page.extractText())

but it doesn't work and it occurs ''ascii' codec can't encode character u'\u0152' in position 602: ordinal not in range(128)' error

3
  • You can't really use all those encoding declarations can you -- like, that doesn't work does it?
    – jedwards
    Commented Mar 15, 2015 at 6:03
  • 1
    Also, you're not using pyPdf anywhere, that probably doesn't help.
    – jedwards
    Commented Mar 15, 2015 at 6:04
  • @jedwards I used the pypdf. but I failed to get a good result..... Commented Mar 15, 2015 at 6:07

4 Answers 4

3

The former code couldn't work at all, PDF does not necessarily contain directly readable text at all. The latter code with pyPdf looks more promising though.

The TypeError is raised because the pages in PDF (the page) are not strings, but f.write expects to see a string.

Thus you might try using the extractText method from the documentation:

for page in input.pages:
    f.write(page.extractText().encode('UTF-8'))
3
  • Than kyou! I fix the code following your guide. But It still doesn't work T.T. ['ascii' codec can't encode character u'\u0152' in position 602: ordinal not in range(128)'] error sentence comes out.. Commented Mar 15, 2015 at 7:14
  • @user3704652 fixed. Forgot that this is Python 2 Commented Mar 15, 2015 at 7:17
  • 1
    can you recommend some paper or lecture that I can learn about it more?! The problem was fixed! I really appreciate with you!! Commented Mar 15, 2015 at 7:32
0
  1. the pdf command stream is encoded with an encoding similar to latin-1
  2. the command stream includes instructions to display stuff on the page
  3. where this stuff is "text" then it is actually instructions to display character shapes i.e glyphs taken from a font (or subset of a font or combination of bits of several fonts).
  4. most of the time the information needed to translate the bytes in these instructions to (say) unicode text is stored within the PDF but some times it is not and sometimes the translation is not possible at all (for example where the font prints a logo).
  5. PyPDF2 (and many other open source PDF packages) does not include functionality to deal with the full complexity of this but fortunately many creators of documents rely on a small set of "standard encodings" which include a number of latin-1 variants and the 'extract text' function does provide usable results in these cases. I have also found PDFs where the font definitions have replacement mappings that give you the name of the glyph for each byte used and found it easy to modify PyPDF2 to take care of this. Other cases are not so simple.

  6. Finally there are two other factors that need to be take account of when trying to extract readable text from PDFs. First is that some PDF streams can be compressed and that some are encrypted. PyPDF2 can take care of both of these cases. A second problem is that the PDF instructions are only to put the characters at specific points on the page. In most cases PDF writers may write the data in reading order but may make positioning changes within words as well as at word breaks.

0
# Import PyPDF2 and os libraries.
import PyPDF2, os

# Get the directory path where the PDF files are stored.
pdf_dir = 'C:\\Store\\output'
print("working......")

# Loop over all the files in the directory.
for filename in os.listdir(pdf_dir):
    if filename.endswith('.pdf'):
        # Open the PDF file in read binary mode.
        pdf_file = open(os.path.join(pdf_dir, filename), 'rb')

        # Read the PDF file using PyPDF2.
        pdf_reader = PyPDF2.PdfFileReader(pdf_file)

        # Initialize an empty string for the extracted text.
        text = ''

        # Loop over each page in the PDF file and extract the text.
        for page_num in range(pdf_reader.getNumPages()):
            page = pdf_reader.getPage(page_num)
            text += page.extractText()

        # Generate a new filename for the text file.
        txt_filename = os.path.splitext(filename)[0] + '.txt'

        # Open a new text file in write mode and write the extracted text.
        with open(os.path.join(pdf_dir, txt_filename), 'w', encoding='utf-8') as txt_file:
            txt_file.write(text)

        # Close the PDF file.
        pdf_file.close()

print("Complete!!!")
1
  • 1
    As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.
    – Community Bot
    Commented Apr 24, 2023 at 22:42
0

to extract text you can use a shortcut or similar to run a simple file extraction
so here are three lines

  • one to fetch and read binary.pdf
  • one to write the binary.pdf layout order as UTF8.txt
  • and one to read as UTF8.txt in order

![enter image description here

curl -O https://web.archive.org/web/20170829051433/http://www.iscramlive.org/ISCRAM2014/papers/p18.pdf  
"pdftotext.exe" -layout -enc UTF-8 -nopgbrk "C:\downloads\p18.pdf"
notepad p18.txt

Not the answer you're looking for? Browse other questions tagged or ask your own question.