0

I have written some python scripts to download images off an HTTP website, but because I'm using urllib2, it closes the existing connection and then opens another before opening another. I don't really understand networking all that much, but this probably slows things down considerably, and grabbing 100 images at a time would take a considerable amount of time.

I started looking at other alternatives like pycurl or httplib, but found them complicated to figure out compared to urllib2 and haven't found a lot of code snippets that I could just take and use.

Simply, how would I establish a persistent connection to a website and download a number of files and then close the connection only when I am done? (probably an explicit call to close it)

4
  • I don't think the few bytes of network overhead amount to much, compared to the size of images in general. It's probably not worth the trouble, unless you have evidence to the contrary.
    – Thomas
    Commented Jan 31, 2011 at 20:40
  • Dive into Python covered a similar task. Commented Jan 31, 2011 at 20:40
  • 1
    I disagree that network overhead doesn't make a big difference. Because auf TCP Slow Start (en.wikipedia.org/wiki/Slow-start) every newly created connection will be slow at first. So reusing the same TCP connection will make a difference if the data is big enough (and I think 100 pictures will be between 10 and 100 MB) Commented Jan 31, 2011 at 20:50
  • "would I establish a persistent connection to a website" Good question. Depends on the web server. You need to check the headers to see what software and what version of the HTTP protocol is supported.
    – S.Lott
    Commented Jan 31, 2011 at 21:02

3 Answers 3

2

since you asked for an httplib snippet:

import httplib

images = ['img1.png', 'img2.png', 'img3.png']

conn = httplib.HTTPConnection('www.example.com')

for image in images:
    conn.request('GET', '/images/%s' % image)
    resp = conn.getresponse()
    data = resp.read()
    with open(image, 'wb') as f:
        f.write(data)

conn.close()

this would issue multiple (sequential) GET requests for the images in the list, then close the connection.

2
  • Thanks, I never thought of creating a list for the location of the images.
    – MxLDevs
    Commented Feb 1, 2011 at 13:51
  • Tried. Not enough reputation to do so.
    – MxLDevs
    Commented Feb 2, 2011 at 0:46
1

I found urllib3 and it claims to reuse exisiting TCP connection.

As I already stated in a comment to the question I disagree with the claim, that this will not make a big difference: Because auf TCP Slow Start Algorithm every newly created connection will be slow at first. So reusing the same TCP socket will make a difference if the data is big enoug. And I think for 100 the data will be between 10 and 100 MB.

Here is a code sample from http://code.google.com/p/urllib3/source/browse/test/benchmark.py

TO_DOWNLOAD = [
'http://code.google.com/apis/apps/',
'http://code.google.com/apis/base/',
'http://code.google.com/apis/blogger/',
'http://code.google.com/apis/calendar/',
'http://code.google.com/apis/codesearch/',
'http://code.google.com/apis/contact/',
'http://code.google.com/apis/books/',
'http://code.google.com/apis/documents/',
'http://code.google.com/apis/finance/',
'http://code.google.com/apis/health/',
'http://code.google.com/apis/notebook/',
'http://code.google.com/apis/picasaweb/',
'http://code.google.com/apis/spreadsheets/',
'http://code.google.com/apis/webmastertools/',
'http://code.google.com/apis/youtube/',
]

from urllib3 import HTTPConnectionPool
import urllib

pool = HTTPConnectionPool.from_url(url_list[0])
for url in url_list:
    r = pool.get_url(url)
1
  • I will look into urllib3 and see what I can do. I have recently installed pycurl and used an example code that seems to run multiple threads to establish connections, but I don't understand the code and can only copy/paste so far.
    – MxLDevs
    Commented Feb 1, 2011 at 13:54
0

If you are not going to make any complicated requests you could open a socket and make requests your self like:

import sockets

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((server_name, server_port))

for url in urls:
    sock.write('get %s\r\nhost: %s\r\n\r\n' % (url, server_name))
    # Parse HTTP header
    # Download picture (Size should be in the HTTP header)

sock.close()

But I do not think establishing 100 TCP sessions will make a lot of overhead in general.

Not the answer you're looking for? Browse other questions tagged or ask your own question.