Python | HTTP - How to check file size before downloading it

Question

I am crawling the web using urllib3. Example code:

from urllib3 import PoolManager

pool = PoolManager()
response = pool.request("GET", url)

The problem is that i may stumble upon url that is a download of a really large file and I am not interseted in downloading it.

I found this question - Link - and it suggests using urllib and urlopen. I don't want to contact the server twice.

I want to limit the file size to 25MB. Is there a way i can do this with urllib3?

You can use HTTP HEAD verb and read Content-Length header to retrieve the size. If server is omitting Content-Length there is no way to check for the size unless as jarmod mentioned you start downloading file. — Alexander Schmidt, Commented Nov 14, 2016 at 17:55
I believe you can issue a HEAD request, instead of GET, and it should contain the content-length header. — John Gordon, Commented Nov 14, 2016 at 17:55
@JohnGordon> not always. Especially, if it's a script sending the file and the developer did not manually set the content-length header, the headers will not include one. — spectras, Commented Nov 14, 2016 at 17:56

shazow · Accepted Answer · 2016-11-14 18:38:18Z

7

If the server supplies a Content-Length header, then you can use that to determine if you'd like to continue downloading the remainder of the body or not. If the server does not provide the header, then you'll need to stream the response until you decide you no longer want to continue.

To do this, you'll need to make sure that you're not preloading the full response.

from urllib3 import PoolManager

pool = PoolManager()
response = pool.request("GET", url, preload_content=False)

# Maximum amount we want to read  
max_bytes = 1000000

content_bytes = response.headers.get("Content-Length")
if content_bytes and int(content_bytes) < max_bytes:
    # Expected body is smaller than our maximum, read the whole thing
    data = response.read()
    # Do something with data
    ...
elif content_bytes is None:
    # Alternatively, stream until we hit our limit
    amount_read = 0
    for chunk in r.stream():
        amount_read += len(chunk)
        # Save chunk
        ...
        if amount_read > max_bytes:
            break

# Release the connection back into the pool
response.release_conn()

answered Nov 14, 2016 at 18:38

shazow

17.9k1 gold badge36 silver badges35 bronze badges

I also opened an issue to improve our documentation for this scenario, please add any additional notes that would be useful or helpful: github.com/shazow/urllib3/issues/1037
– shazow
Commented Nov 14, 2016 at 18:43
Quick question: as you don't close the connection and just release it to the pool, won't the next request just resume the download and break because it does not recognise an HTTP response? Shouldn't it be forcefully closed?
– spectras
Commented Nov 16, 2016 at 8:38
@spectras Honestly I'm not 100% sure what will happen off the top of my head, but if it does indeed fail to recover the connection then I'd consider it a bug in urllib3 and ask that you please report it. :) I'm pretty sure we do a check before we re-use the connection.
– shazow
Commented Nov 18, 2016 at 19:30

Add a comment |

Collectives™ on Stack Overflow

Python | HTTP - How to check file size before downloading it

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
http
urllib2
urllib
urllib3
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonhttpurllib2urlliburllib3 or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
http
urllib2
urllib
urllib3
or ask your own question.