Python: Downloading files from HTTP server

Question

I have written some python scripts to download images off an HTTP website, but because I'm using urllib2, it closes the existing connection and then opens another before opening another. I don't really understand networking all that much, but this probably slows things down considerably, and grabbing 100 images at a time would take a considerable amount of time.

I started looking at other alternatives like pycurl or httplib, but found them complicated to figure out compared to urllib2 and haven't found a lot of code snippets that I could just take and use.

Simply, how would I establish a persistent connection to a website and download a number of files and then close the connection only when I am done? (probably an explicit call to close it)

I don't think the few bytes of network overhead amount to much, compared to the size of images in general. It's probably not worth the trouble, unless you have evidence to the contrary. — Thomas, Commented Jan 31, 2011 at 20:40
I disagree that network overhead doesn't make a big difference. Because auf TCP Slow Start (en.wikipedia.org/wiki/Slow-start) every newly created connection will be slow at first. So reusing the same TCP connection will make a difference if the data is big enough (and I think 100 pictures will be between 10 and 100 MB) — Martin Thurau, Commented Jan 31, 2011 at 20:50
"would I establish a persistent connection to a website" Good question. Depends on the web server. You need to check the headers to see what software and what version of the HTTP protocol is supported. — S.Lott, Commented Jan 31, 2011 at 21:02

Corey Goldberg · Accepted Answer · 2011-01-31 23:57:36Z

2

since you asked for an httplib snippet:

import httplib

images = ['img1.png', 'img2.png', 'img3.png']

conn = httplib.HTTPConnection('www.example.com')

for image in images:
    conn.request('GET', '/images/%s' % image)
    resp = conn.getresponse()
    data = resp.read()
    with open(image, 'wb') as f:
        f.write(data)

conn.close()

this would issue multiple (sequential) GET requests for the images in the list, then close the connection.

answered Jan 31, 2011 at 23:57

Corey Goldberg

60.1k29 gold badges132 silver badges144 bronze badges

Thanks, I never thought of creating a list for the location of the images.
– MxLDevs
Commented Feb 1, 2011 at 13:51
Tried. Not enough reputation to do so.
– MxLDevs
Commented Feb 2, 2011 at 0:46

Add a comment |

Martin Thurau · Accepted Answer · 2011-01-31 21:01:15Z

I found urllib3 and it claims to reuse exisiting TCP connection.

As I already stated in a comment to the question I disagree with the claim, that this will not make a big difference: Because auf TCP Slow Start Algorithm every newly created connection will be slow at first. So reusing the same TCP socket will make a difference if the data is big enoug. And I think for 100 the data will be between 10 and 100 MB.

Here is a code sample from http://code.google.com/p/urllib3/source/browse/test/benchmark.py

TO_DOWNLOAD = [
'http://code.google.com/apis/apps/',
'http://code.google.com/apis/base/',
'http://code.google.com/apis/blogger/',
'http://code.google.com/apis/calendar/',
'http://code.google.com/apis/codesearch/',
'http://code.google.com/apis/contact/',
'http://code.google.com/apis/books/',
'http://code.google.com/apis/documents/',
'http://code.google.com/apis/finance/',
'http://code.google.com/apis/health/',
'http://code.google.com/apis/notebook/',
'http://code.google.com/apis/picasaweb/',
'http://code.google.com/apis/spreadsheets/',
'http://code.google.com/apis/webmastertools/',
'http://code.google.com/apis/youtube/',
]

from urllib3 import HTTPConnectionPool
import urllib

pool = HTTPConnectionPool.from_url(url_list[0])
for url in url_list:
    r = pool.get_url(url)

I will look into urllib3 and see what I can do. I have recently installed pycurl and used an example code that seems to run multiple threads to establish connections, but I don't understand the code and can only copy/paste so far. — MxLDevs, Commented Feb 1, 2011 at 13:54

Elalfer · Accepted Answer · 2011-01-31 20:42:21Z

If you are not going to make any complicated requests you could open a socket and make requests your self like:

import sockets

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((server_name, server_port))

for url in urls:
    sock.write('get %s\r\nhost: %s\r\n\r\n' % (url, server_name))
    # Parse HTTP header
    # Download picture (Size should be in the HTTP header)

sock.close()

But I do not think establishing 100 TCP sessions will make a lot of overhead in general.

Collectives™ on Stack Overflow

Python: Downloading files from HTTP server

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
python
http
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Not the answer you're looking for? Browse other questions tagged pythonhttp or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
http
or ask your own question.