14

How would I open a pdf from url instead of from the disk

Something like

input1 = PdfFileReader(file("http://example.com/a.pdf", "rb"))

I want to open several files from web and download a merge of all the files.

1

4 Answers 4

20

I think urllib2 will get you what you want.

from urllib2 import Request, urlopen
from pyPdf import PdfFileWriter, PdfFileReader
from StringIO import StringIO

url = "http://www.silicontao.com/ProgrammingGuide/other/beejnet.pdf"
writer = PdfFileWriter()

remoteFile = urlopen(Request(url)).read()
memoryFile = StringIO(remoteFile)
pdfFile = PdfFileReader(memoryFile)

for pageNum in xrange(pdfFile.getNumPages()):
        currentPage = pdfFile.getPage(pageNum)
        #currentPage.mergePage(watermark.getPage(0))
        writer.addPage(currentPage)


outputStream = open("output.pdf","wb")
writer.write(outputStream)
outputStream.close()
6
  • I get AttributeError: 'str' object has no attribute 'seek'
    – meadhikari
    Commented Mar 17, 2012 at 16:38
  • 1
    @meadhikari, sorry about that, it's fixed now.
    – John
    Commented Mar 17, 2012 at 17:15
  • 1
    @meadhikari Your code is good, my fault again. outputStream = file("output.pdf","wb") needs to be outputStream = open("output.pdf","wb")
    – John
    Commented Mar 17, 2012 at 20:04
  • 3
    use urllib.request instead of urllib2 for python 3.5 and higher Commented Apr 24, 2020 at 5:03
  • 4
    for "StringIO" use >> from io import StringIO ## for Python 3 Commented Apr 24, 2020 at 5:07
9

I think it could be simplified with Requests now.

import io
import requests
from PyPDF2 import PdfReader
headers = {'User-Agent': 'Mozilla/5.0 (X11; Windows; Windows x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36'}

url = 'https://www.url_of_pdf_file.com/sample.pdf'
response = requests.get(url=url, headers=headers, timeout=120)
on_fly_mem_obj = io.BytesIO(response.content)
pdf_file = PdfReader(on_fly_mem_obj)
1
  • 2
    this is the right answer now. Commented Mar 23, 2023 at 16:57
4

Well, you can first download the pdf separately and then use pypdf to read it

import urllib

url = 'http://example.com/a.pdf'
webFile = urllib.urlopen(url)
pdfFile = open(url.split('/')[-1], 'w')
pdfFile.write(webFile.read())
webFile.close()
pdfFile.close()

base = os.path.splitext(pdfFile)[0]
os.rename(pdfFile, base + ".pdf")

input1 = PdfFileReader(file(pdfFile, "rb"))
2
  • Hey, what is thisFile from the line base = os.path.splitext(thisFile)[0]
    – meadhikari
    Commented Mar 17, 2012 at 16:33
  • 1
    Oh sorry it was a mistake, it should be pdfFile (the absolute path for the downloaded file)
    – Switch
    Commented Mar 17, 2012 at 16:41
2

For python 3.8

import io
from urllib.request import Request, urlopen

from PyPDF2 import PdfFileReader


class GetPdfFromUrlMixin:
    def get_pdf_from_url(self, url):
        """
        :param url: url to get pdf file
        :return: PdfFileReader object
        """
        remote_file = urlopen(Request(url)).read()
        memory_file = io.BytesIO(remote_file)
        pdf_file = PdfFileReader(memory_file)
        return pdf_file
2
  • 1
    You might want to use PdfReader instead of the deprecated PdfFileReader Commented Oct 15, 2022 at 12:13
  • Also it is completely unnecessary to put this function inside a class Commented Oct 15, 2022 at 12:20

Not the answer you're looking for? Browse other questions tagged or ask your own question.