3

I am crawling the web using urllib3. Example code:

from urllib3 import PoolManager

pool = PoolManager()
response = pool.request("GET", url)

The problem is that i may stumble upon url that is a download of a really large file and I am not interseted in downloading it.

I found this question - Link - and it suggests using urllib and urlopen. I don't want to contact the server twice.

I want to limit the file size to 25MB. Is there a way i can do this with urllib3?

6
  • Read until you hit 25MB and then cancel the download?
    – jarmod
    Commented Nov 14, 2016 at 17:51
  • That is an option. How can i do that?
    – Montoya
    Commented Nov 14, 2016 at 17:51
  • 3
    You can use HTTP HEAD verb and read Content-Length header to retrieve the size. If server is omitting Content-Length there is no way to check for the size unless as jarmod mentioned you start downloading file. Commented Nov 14, 2016 at 17:55
  • 1
    I believe you can issue a HEAD request, instead of GET, and it should contain the content-length header. Commented Nov 14, 2016 at 17:55
  • @JohnGordon> not always. Especially, if it's a script sending the file and the developer did not manually set the content-length header, the headers will not include one.
    – spectras
    Commented Nov 14, 2016 at 17:56

1 Answer 1

7

If the server supplies a Content-Length header, then you can use that to determine if you'd like to continue downloading the remainder of the body or not. If the server does not provide the header, then you'll need to stream the response until you decide you no longer want to continue.

To do this, you'll need to make sure that you're not preloading the full response.

from urllib3 import PoolManager

pool = PoolManager()
response = pool.request("GET", url, preload_content=False)

# Maximum amount we want to read  
max_bytes = 1000000

content_bytes = response.headers.get("Content-Length")
if content_bytes and int(content_bytes) < max_bytes:
    # Expected body is smaller than our maximum, read the whole thing
    data = response.read()
    # Do something with data
    ...
elif content_bytes is None:
    # Alternatively, stream until we hit our limit
    amount_read = 0
    for chunk in r.stream():
        amount_read += len(chunk)
        # Save chunk
        ...
        if amount_read > max_bytes:
            break

# Release the connection back into the pool
response.release_conn()
3
  • I also opened an issue to improve our documentation for this scenario, please add any additional notes that would be useful or helpful: github.com/shazow/urllib3/issues/1037
    – shazow
    Commented Nov 14, 2016 at 18:43
  • Quick question: as you don't close the connection and just release it to the pool, won't the next request just resume the download and break because it does not recognise an HTTP response? Shouldn't it be forcefully closed?
    – spectras
    Commented Nov 16, 2016 at 8:38
  • @spectras Honestly I'm not 100% sure what will happen off the top of my head, but if it does indeed fail to recover the connection then I'd consider it a bug in urllib3 and ask that you please report it. :) I'm pretty sure we do a check before we re-use the connection.
    – shazow
    Commented Nov 18, 2016 at 19:30

Not the answer you're looking for? Browse other questions tagged or ask your own question.