Get size of a file before downloading in Python

Question

I'm downloading an entire directory from a web server. It works OK, but I can't figure how to get the file size before download to compare if it was updated on the server or not. Can this be done as if I was downloading the file from a FTP server?

import urllib
import re

url = "http://www.someurl.com"

# Download the page locally
f = urllib.urlopen(url)
html = f.read()
f.close()

f = open ("temp.htm", "w")
f.write (html)
f.close()

# List only the .TXT / .ZIP files
fnames = re.findall('^.*<a href="(\w+(?:\.txt|.zip)?)".*$', html, re.MULTILINE)

for fname in fnames:
    print fname, "..."

    f = urllib.urlopen(url + "/" + fname)

    #### Here I want to check the filesize to download or not #### 
    file = f.read()
    f.close()

    f = open (fname, "w")
    f.write (file)
    f.close()

@Jon: thank for your quick answer. It works, but the filesize on the web server is slightly less than the filesize of the downloaded file.

Examples:

Local Size  Server Size
 2.223.533  2.115.516
   664.603    662.121

It has anything to do with the CR/LF conversion?

Possibly. Can you run diff on it and see a difference? Also do you see the file size difference in the binary (.zip) files? Edit: This is where things like Etags comes in handy. The server will tell you when something changes, so you don't have to download the complete file to figure it out. — Jonathan Works, Commented Aug 8, 2008 at 14:07
you're right, I wasn't using "wb" when opening the local file for writing. Works like a charm! Thx — PabloG, Commented Jan 29, 2011 at 15:23

Jonathan Works · Accepted Answer · 2008-08-08 14:26:40Z

I have reproduced what you are seeing:

import urllib, os
link = "http://python.org"
print "opening url:", link
site = urllib.urlopen(link)
meta = site.info()
print "Content-Length:", meta.getheaders("Content-Length")[0]

f = open("out.txt", "r")
print "File on disk:",len(f.read())
f.close()


f = open("out.txt", "w")
f.write(site.read())
site.close()
f.close()

f = open("out.txt", "r")
print "File on disk after download:",len(f.read())
f.close()

print "os.stat().st_size returns:", os.stat("out.txt").st_size

Outputs this:

opening url: http://python.org
Content-Length: 16535
File on disk: 16535
File on disk after download: 16535
os.stat().st_size returns: 16861

What am I doing wrong here? Is os.stat().st_size not returning the correct size?

Edit: OK, I figured out what the problem was:

import urllib, os
link = "http://python.org"
print "opening url:", link
site = urllib.urlopen(link)
meta = site.info()
print "Content-Length:", meta.getheaders("Content-Length")[0]

f = open("out.txt", "rb")
print "File on disk:",len(f.read())
f.close()


f = open("out.txt", "wb")
f.write(site.read())
site.close()
f.close()

f = open("out.txt", "rb")
print "File on disk after download:",len(f.read())
f.close()

print "os.stat().st_size returns:", os.stat("out.txt").st_size

this outputs:

$ python test.py
opening url: http://python.org
Content-Length: 16535
File on disk: 16535
File on disk after download: 16535
os.stat().st_size returns: 16535

Make sure you are opening both files for binary read/write.

// open for binary write
open(filename, "wb")
// open for binary read
open(filename, "rb")

when you do site = urllib.urlopen(link) you have performed a file download , so it is not size before downloading its infact downloaded to buffer from where you are retrieving the content-length — Ciasto piekarz, Commented Jul 5, 2014 at 9:50
@Ciastopiekarz I think it's when you attempt to read() that the file actually get downloaded in the buffer check this answer — CaptainDaVinci, Commented Jul 31, 2017 at 19:54

Rizwan · Accepted Answer · 2021-10-25 12:28:46Z

28

Using the returned-urllib-object method info(), you can get various information on the retrieved document. Example of grabbing the current Google logo:

>>> import urllib
>>> d = urllib.urlopen("http://www.google.co.uk/logos/olympics08_opening.gif")
>>> print d.info()

Content-Type: image/gif
Last-Modified: Thu, 07 Aug 2008 16:20:19 GMT  
Expires: Sun, 17 Jan 2038 19:14:07 GMT 
Cache-Control: public 
Date: Fri, 08 Aug 2008 13:40:41 GMT 
Server: gws 
Content-Length: 20172 
Connection: Close

It's a dict, so to get the size of the file, you do urllibobject.info()['Content-Length']

print f.info()['Content-Length']

And to get the size of the local file (for comparison), you can use the os.stat() command:

os.stat("/the/local/file.zip").st_size

edited Oct 25, 2021 at 12:28

Rizwan

1034 silver badges24 bronze badges

answered Aug 8, 2008 at 13:47

dbr

168k69 gold badges281 silver badges345 bronze badges

1

I have been using this solution, however I have hit an edge case where sometimes the content-length header is not defined. Can anyone explain why it wouldn't be consistently returned?
– user12121234
Commented Jun 10, 2016 at 18:18
stackoverflow.com/questions/22087370/… maybe explains it?
– dbr
Commented Jul 8, 2016 at 13:50

Add a comment |

ccpizza · Accepted Answer · 2017-06-02 13:24:18Z

12

A requests-based solution using HEAD instead of GET (also prints HTTP headers):

#!/usr/bin/python
# display size of a remote file without downloading

from __future__ import print_function
import sys
import requests

# number of bytes in a megabyte
MBFACTOR = float(1 << 20)

response = requests.head(sys.argv[1], allow_redirects=True)

print("\n".join([('{:<40}: {}'.format(k, v)) for k, v in response.headers.items()]))
size = response.headers.get('content-length', 0)
print('{:<40}: {:.2f} MB'.format('FILE SIZE', int(size) / MBFACTOR))

Usage

$ python filesize-remote-url.py https://httpbin.org/image/jpeg
...
Content-Length                          : 35588
FILE SIZE (MB)                          : 0.03 MB

edited Jun 2, 2017 at 13:24

answered Dec 4, 2016 at 10:21

ccpizza

30.8k20 gold badges174 silver badges178 bronze badges

Not every response will include a content-length --sometimes the response is generated using Transfer-Encoding: chunked, in which case there's no way to know without downloading.
– Harshal Parekh
Commented Apr 8, 2023 at 0:57

Add a comment |

Jonathan Works · Accepted Answer · 2008-08-08 13:44:47Z

7

The size of the file is sent as the Content-Length header. Here is how to get it with urllib:

>>> site = urllib.urlopen("http://python.org")
>>> meta = site.info()
>>> print meta.getheaders("Content-Length")
['16535']
>>>

edited Aug 8, 2008 at 13:44

answered Aug 8, 2008 at 13:41

Jonathan Works

1,8261 gold badge17 silver badges13 bronze badges

Add a comment |

Jonathan Works · Accepted Answer · 2008-08-08 13:51:23Z

6

Also if the server you are connecting to supports it, look at Etags and the If-Modified-Since and If-None-Match headers.

Using these will take advantage of the webserver's caching rules and will return a 304 Not Modified status code if the content hasn't changed.

answered Aug 8, 2008 at 13:51

Jonathan Works

1,8261 gold badge17 silver badges13 bronze badges

Add a comment |

Madhusudhan · Accepted Answer · 2014-08-26 09:31:46Z

6

In Python3:

>>> import urllib.request
>>> site = urllib.request.urlopen("http://python.org")
>>> print("FileSize: ", site.length)

answered Aug 26, 2014 at 9:31

Madhusudhan

951 silver badge2 bronze badges

12

This downloads the file!
– Joseph Victor Zammit
Commented Feb 4, 2016 at 13:39

Add a comment |

yukashima huksay · Accepted Answer · 2018-01-08 18:38:08Z

3

For a python3 (tested on 3.5) approach I'd recommend:

with urlopen(file_url) as in_file, open(local_file_address, 'wb') as out_file:
    print(in_file.getheader('Content-Length'))
    out_file.write(response.read())

edited Jan 8, 2018 at 18:38

answered Sep 27, 2017 at 5:33

yukashima huksay

6,1208 gold badges47 silver badges82 bronze badges

Add a comment |

Kalob Taulien · Accepted Answer · 2021-12-16 17:30:40Z

3

For anyone using Python 3 and looking for a quick solution using the requests package:

import requests 
response = requests.head( 
    "https://website.com/yourfile.mp4",  # Example file 
    allow_redirects=True
)
print(response.headers['Content-Length'])

Note: Not all responses will have a Content-Length so your application will want to check to see if it exists.

if 'Content-Length' in response.headers:
    ... # Do your stuff here

edited Dec 16, 2021 at 17:30

answered Dec 16, 2021 at 15:56

Kalob Taulien

1,89617 silver badges23 bronze badges

Add a comment |

Harshana · Accepted Answer · 2022-07-17 03:36:58Z

1

Here is a much more safer way for Python 3:

import urllib.request
site = urllib.request.urlopen("http://python.org")
meta = site.info()
meta.get('Content-Length')

Returns:

'49829'

meta.get('Content-Length') will return the "Content-Length" header if exists. Otherwise it will be blank

answered Jul 17, 2022 at 3:36

Harshana

5,4161 gold badge18 silver badges30 bronze badges

Otherwise it will be None. Still +1 for your answer. Note: If you want it to return e.g. 0 if there's no Content-Length - do meta.get('Content-Length', 0). Overall, my one-liner is urllib.request.urlopen(url).info().get('Content-Length', 0)
– Eugene Chabanov
Commented Oct 26, 2022 at 20:38

Add a comment |

Rizwan · Accepted Answer · 2021-10-25 12:29:24Z

@PabloG Regarding the local/server filesize difference

Following is high-level illustrative explanation of why it may occur:

The size on disk sometimes is different from the actual size of the data. It depends on the underlying file-system and how it operates on data. As you may have seen in Windows when formatting a flash drive you are asked to provide 'block/cluster size' and it varies [512b - 8kb]. When a file is written on the disk, it is stored in a 'sort-of linked list' of disk blocks. When a certain block is used to store part of a file, no other file contents will be stored in the same blok, so even if the chunk is no occupuing the entire block space, the block is rendered unusable by other files.

Example: When the filesystem is divided on 512b blocks, and we need to store 600b file, two blocks will be occupied. The first block will be fully utilized, while the second block will have only 88b utilized and the remaining (512-88)b will be unusable resulting in 'file-size-on-disk' being 1024b. This is why Windows has different notations for 'file size' and 'size on disk'.

NOTE: There are different pros & cons that come with smaller/bigger FS block, so do a better research before playing with your filesystem.

Eugene Chabanov · Accepted Answer · 2022-10-26 20:46:01Z

0

Quick and reliable one-liner for Python3 using urllib:

import urllib

url = 'https://<your url here>'

size = urllib.request.urlopen(url).info().get('Content-Length', 0)

.get(<dict key>, 0) gets the key from dict and if the key is absent returns 0 (or whatever the 2nd argument is)

answered Oct 26, 2022 at 20:46

Eugene Chabanov

5305 silver badges10 bronze badges

Add a comment |

kareem said · Accepted Answer · 2022-11-28 01:16:33Z

0

you can use requests to pull this data


File_Name=requests.head(LINK).headers["X-File-Name"]

#And other useful info** like the size of the file from this dict (headers)
#like 

File_size=requests.head(LINK).headers["Content-Length"]

answered Nov 28, 2022 at 1:16

kareem said

112 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Get size of a file before downloading in Python

12 Answers 12

Usage

Not the answer you're looking for? Browse other questions tagged
python
urllib
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

Usage

Not the answer you're looking for? Browse other questions tagged pythonurllib or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
urllib
or ask your own question.