11
\$\begingroup\$

I've written my first web scraper, which (surprisingly) does the job. I'm scraping a popular comic website for the images (there are over 950 of them) but the problem is the scraper is way too slow.

For example, if I download a sample of 10 comics it takes an average of 4 to 5 secs per image (a total of > 40 secs for the sample), which is a bit too slow if you ask me because each image is approx. the size of 80KB to 800KB.

I've read I could switch to LXML to do the scraping asynchronously but the package is not compatible with Python 3.6.

I'm submitting the scraper for both the code and performance review. Any insight on both topic will be highly appreciated.

import time
import os
import sys
import re
import requests
import itertools
from requests import get
from bs4 import BeautifulSoup as bs

HOME_DIR = os.getcwd()
DEFAULT_DIR_NAME = 'poorly_created_folder'

def show_logo():
  print("""
a Python comic(al) scraper for poorlydwarnlines.com
                         __
.-----.-----.-----.----.|  |.--.--.
|  _  |  _  |  _  |   _||  ||  |  |
|   __|_____|_____|__|  |__||___  |
|__|                        |_____|
                __ __   __
.--.--.--.----.|__|  |_|  |_.-----.-----.
|  |  |  |   _||  |   _|   _|  -__|     |
|________|__|  |__|____|____|_____|__|__|

.-----.----.----.---.-.-----.-----.----.
|__ --|  __|   _|  _  |  _  |  -__|   _|
|_____|____|__| |___._|   __|_____|__|
                      |__|
version: 0.2 | author: baduker | https://github.com/baduker
  """)

def handle_menu():
  print("\nThe scraper has found {} comics.".format(len(found_comics)))
  print("How many comics do you want to download?")
  print("Type 0 to exit.")

  while True:
    try:
      global n_of_comics
      n_of_comics = int(input(">> ").strip())
    except ValueError:
      print("Error: incorrect value. Try again.")
      continue
    if n_of_comics > len(found_comics) or n_of_comics < 0:
      print("Error: incorrect number of comics to download. Try again.")
      continue
    elif n_of_comics == 0:
      sys.exit()
    else:
      break
  return n_of_comics

def move_to_dir(title):
  if os.getcwd() != HOME_DIR:
    os.chdir(HOME_DIR)
  try:
    os.mkdir(title)
    os.chdir(title)
  except FileExistsError:
    os.chdir(title)
  except:
    print("Couldn't create directory!")

def generate_comic_link(array, num):
  for link in itertools.islice(array, 0, num):
    yield link

def grab_image_src_url(link):
  req = requests.get(link)
  comic = req.text
  soup = bs(comic, 'html.parser')
  for i in soup.find_all('p'):
    for img in i.find_all('img', src=True):
      return img['src']

def download_image(link):
  file_name = url.split('/')[-1]
  with open(file_name, "wb") as file:
    response = get(url)
    file.write(response.content)

def fetch_comic_archive():
  url = 'http://www.poorlydrawnlines.com/archive/'
  req = requests.get(url)
  page = req.text
  soup = bs(page, 'html.parser')
  all_links = []
  for link in soup.find_all('a'):
    all_links.append(link.get('href'))
  return all_links

def filter_comic_archive(archive):
  pattern = re.compile(r'http://www.poorlydrawnlines.com/comic/.+')
  filtered_links = [i for i in archive if pattern.match(i)]
  return filtered_links

show_logo()

all_comics = fetch_comic_archive()
found_comics = filter_comic_archive(all_comics)

handle_menu()

start = time.time()
for link in generate_comic_link(found_comics, n_of_comics):
  print("Downloading: {}".format(link))
  move_to_dir(DEFAULT_DIR_NAME)
  url = grab_image_src_url(link)
  download_image(url)
end = time.time()

print("Successfully downloaded {} comics in {:.2f} seconds.".format(n_of_comics, end - start))
\$\endgroup\$
1
  • 3
    \$\begingroup\$ My guess is that the biggest issue is server-side. The comic server can be this slow. You can optimize your code with hardly noticeable time changes. You can implement threading (as done below) and thus stress the target server. But the question is - do you need the job to be done quick? The web scraping with respect to the target server is usually the way good web scrapers go. \$\endgroup\$
    – Jeyekomon
    Commented Mar 19, 2018 at 11:09

2 Answers 2

12
\$\begingroup\$

This seems like a cool project, and a nice first introduction to web scraping! I'll cover some general advice first, and then address your main concern: speed.


General

  1. You use two spaces for indentation. PEP-8 recommends 4 spaces (most code editors have the option to automatically convert tabs to spaces, which I highly recommend).

  2. While on the subject, PEP-8 also recommends two blank lines between top-level functions.

  3. In general, the use of globals is discouraged. There are some exceptions:

    • Global constants are okay;
    • Under certain circumstances, avoiding global state would result in unacceptably complicated code.

    However, neither of these apply here. Using global, non-const variables makes your code harder to debug, and is a burden for anyone reading / updating it in the future. You can avoid global state here easily by passing variables as parameters.

  4. In my opinion, show_logo() is a bit overkill. I'd make it a constant instead (LOGO).

  5. The while loop in handle_menu() can be improved:

    while True:
      try:
        global n_of_comics
        n_of_comics = int(input(">> ").strip())
    

    There's no need to call str.strip() before passing the return value to int.

      except ValueError:
        print("Error: incorrect value. Try again.")
        continue
    

    The error message is a bit vague. Why is the value incorrect? Maybe something like 'Error: expected a number. Try again.' would fit better.

      if n_of_comics > len(found_comics) or n_of_comics < 0:
        print("Error: incorrect number of comics to download. Try again.")
        continue
    

    This error message is quite clear :)

      elif n_of_comics == 0:
        sys.exit()
      else:
        break
    

    You don't need the else clause here, since the elif clause stops the program if executed. In fact, you can simply return n_of_comics there. Putting this all together:

    while True:
      try:
        global n_of_comics
        n_of_comics = int(input(">> "))
      except ValueError:
        print("Error: expected a number. Try again.")
        continue
      if n_of_comics > len(found_comics) or n_of_comics < 0:
        print("Error: incorrect number of comics to download. Try again.")
        continue
      elif n_of_comics == 0:
        sys.exit()
      return n_of_comics
    
  6. Your 'main routine' isn't encapsulated in a function, which means it's hard to test / explore functions individually from a Python interactive session (or from a separate file). I suggest putting all this top level code into a main() function:

    def main():
        show_logo()
    
        all_comics = fetch_comic_archive()
        found_comics = filter_comic_archive(all_comics)
    
        handle_menu()
    
        start = time.time()
        for link in generate_comic_link(found_comics, n_of_comics):
            print("Downloading: {}".format(link))
            move_to_dir(DEFAULT_DIR_NAME)
            url = grab_image_src_url(link)
            download_image(url)
        end = time.time()
    
        print("Successfully downloaded {} comics in {:.2f} seconds.".format(n_of_comics, end - start))
    

    You can then check if __name__ == "__main__", to run main if the script is called as the main program (see 'What does if __name__ == "__main__" do?'):

    if __name__ == "__main__":
        main()
    
  7. I don't get the point of move_to_dir(). It handles the case where the current working directory has been changed, and it's called for each comic to download. That seems pretty pointless to me. Instead, I would only create the directory once, in main():

    DEFAULT_DIR_NAME = "poorly_created_folder"
    COMICS_DIRECTORY = os.path.join(os.getcwd(), DEFAULT_DIR_NAME)
    
    ...
    
    def download_image(link):
        ...
        with open(os.path.join(COMICS_DIRECTORY, file_name), "wb") as file:
            ...
    
    ...
    
    def main():
        ...
        try:
            os.mkdir(COMICS_DIRECTORY)
        except OSError as exc:
            sys.exit("Failed to create directory (errno {})".format(exc.errno))
            # `sys.exit` will write the message to stderr and return with status code 1
    
  8. generate_comic_link() is superfluous. The following functions all do the same:

    def generate_comic_link(array, num):
        # Using itertools.islice()
        for link in itertools.islice(array, 0, num):
            yield link
    
    
    def generate_comic_link2(array, num):
        # Using slicing
        for link in array[:num]:
            yield link
    
    
    def generate_comic_link3(array, num):
        # Using slicing with yield from
        yield from array[:num]
    

    itertools.islice() is overkill (and a little harder to read). Since generate_comic_link3() is a one-liner, you can probably get rid of the function altogether, iterating over the URLs directly using slicing.

  9. A bit of a nitpick, but req = <requests>.get(<url>) is wrong. requests.get doesn't return the request, it returns the response. Thus, response = <requests>.get(<url>) makes more sense.

  10. Some of your variables could (should) be constants:

    url = 'http://www.poorlydrawnlines.com/archive/'
    # Might be
    ARCHIVE_URL = "http://www.poorlydrawnlines.com/archive/"
    
    pattern = re.compile(r'http://www.poorlydrawnlines.com/comic/.+')
    # Might be
    COMIC_PATTERN = re.compile(r"http://www.poorlydrawnlines.com/comic/.+")
    

    (I prefer double quotes, as you might be able to tell)

  11. In fetch_comic_archive():

    all_links = []
    for link in soup.find_all('a'):
        all_links.append(link.get('href'))
    return all_links
    

    ... might be a one-liner:

    return [link.get("href") for link in soup.find_all("a")]
    
  12. In filter_comic_archive():

    filtered_links = [i for i in archive if pattern.match(i)]
    return filtered_links
    

    There's no need for the intermediate variable.

Using threading.Thread

Now for the actual challenge: improving performance. You don't need asynchronous lxml, threads will do just fine here! By wrapping the relevant code into a function, we can spawn any amount of threads to do the work:

import threading

def download_comic(link):
    print("Downloading: {}".format(link))
    move_to_dir(DEFAULT_DIR_NAME)
    url = grab_image_src_url(link)
    download_image(url)

...

def main():
    ...
    for link in generate_comic_link(found_comics, n_of_comics):
        thread = threading.Thread(target=download_comic, args=(link,))
        thread.start()
    thread.join()
    # Join the last thread to make sure all comics have been
    # downloaded before printing the time difference
    ...

Rewrite

Putting everything together (I've changed names, reordered some functionality and added comments explaining certain changes):

import time
import os
import sys
import re
import threading

# PEP-8 recommends a blank line in between
# stdlib imports and third-party imports.

import requests
# Importing `requests` *and* `get` from `requests` is confusing
from bs4 import BeautifulSoup as bs

DEFAULT_DIR_NAME = "poorly_created_folder"
COMICS_DIRECTORY = os.path.join(os.getcwd(), DEFAULT_DIR_NAME)
LOGO = """
a Python comic(al) scraper for poorlydwarnlines.com
                         __
.-----.-----.-----.----.|  |.--.--.
|  _  |  _  |  _  |   _||  ||  |  |
|   __|_____|_____|__|  |__||___  |
|__|                        |_____|
                __ __   __
.--.--.--.----.|__|  |_|  |_.-----.-----.
|  |  |  |   _||  |   _|   _|  -__|     |
|________|__|  |__|____|____|_____|__|__|

.-----.----.----.---.-.-----.-----.----.
|__ --|  __|   _|  _  |  _  |  -__|   _|
|_____|____|__| |___._|   __|_____|__|
                      |__|
version: 0.2 | author: baduker | https://github.com/baduker
"""
ARCHIVE_URL = "http://www.poorlydrawnlines.com/archive/"
COMIC_PATTERN = re.compile(r"http://www.poorlydrawnlines.com/comic/.+")


def download_comics_menu(comics_found):
    print("\nThe scraper has found {} comics.".format(len(comics_found)))
    print("How many comics do you want to download?")
    print("Type 0 to exit.")

    while True:
        try:
            comics_to_download = int(input(">> "))
        except ValueError:
            print("Error: expected a number. Try again.")
            continue
        if comics_to_download > len(comics_found) or comics_to_download < 0:
            print("Error: incorrect number of comics to download. Try again.")
            continue
        elif comics_to_download == 0:
            sys.exit()
        return comics_to_download


def grab_image_src_url(url):
    response = requests.get(url)
    soup = bs(response.text, "html.parser")
    for i in soup.find_all("p"):
        for img in i.find_all("img", src=True):
            return img["src"]


def download_and_write_image(url):
    # `download_and_write_image` is a bit more accurate, since
    # it also writes the image to the disk
    file_name = url.split("/")[-1]
    with open(os.path.join(COMICS_DIRECTORY, file_name), "wb") as file:
        response = requests.get(url)
        # Replced `get` with `requests.get`
        file.write(response.content)


def fetch_comics_from_archive():
    # Merged `fetch_comic_archive` and `filter_comic_archive`
    # into a single function
    response = requests.get(ARCHIVE_URL)
    soup = bs(response.text, "html.parser")
    comics = [url.get("href") for url in soup.find_all("a")]
    return [url for url in comics if COMIC_PATTERN.match(url)]


def download_comic(url):
    print("Downloading: {}".format(url))
    url = grab_image_src_url(url)
    download_and_write_image(url)


def main():
    print(LOGO)

    comics = fetch_comics_from_archive()
    comics_to_download = download_comics_menu(comics)

    try:
        os.mkdir(DEFAULT_DIR_NAME)
    except OSError as exc:
        sys.exit("Failed to create directory (errno {})".format(exc.errno))

    start = time.time()
    for url in comics[:comics_to_download]:
        thread = threading.Thread(target=download_comic, args=(url,))
        thread.start()
    thread.join()

    end = time.time()
    print("Successfully downloaded {} comics in {:.2f} seconds.".format(
        comics_to_download, end - start)
    )

if __name__ == "__main__":
    main()

Results

  • Non-threaded:

    The scraper has found 957 comics.
    How many comics do you want to download?
    Type 0 to exit.
    >> 6
    Downloading: http://www.poorlydrawnlines.com/comic/new-phone/
    Downloading: http://www.poorlydrawnlines.com/comic/new-things/
    Downloading: http://www.poorlydrawnlines.com/comic/return-to-nature/
    Downloading: http://www.poorlydrawnlines.com/comic/phone/
    Downloading: http://www.poorlydrawnlines.com/comic/stars/
    Downloading: http://www.poorlydrawnlines.com/comic/big-dreams/
    Successfully downloaded 6 comics in 37.13 seconds.
    
  • Threaded:

    The scraper has found 957 comics.
    How many comics do you want to download?
    Type 0 to exit.
    >> 6
    Downloading: http://www.poorlydrawnlines.com/comic/new-phone/
    Downloading: http://www.poorlydrawnlines.com/comic/new-things/
    Downloading: http://www.poorlydrawnlines.com/comic/return-to-nature/
    Downloading: http://www.poorlydrawnlines.com/comic/phone/
    Downloading: http://www.poorlydrawnlines.com/comic/stars/
    Downloading: http://www.poorlydrawnlines.com/comic/big-dreams/
    Successfully downloaded 6 comics in 7.07 seconds.
    

Alternatively: using concurrent.futures.ThreadPoolExecutor

If you run the rewritten code, you may notice the program takes a couple of seconds to shut down after (reportedly) having downloaded all images. This is because the last thread started doesn't necessarily end last (that's the whole point of threads, after all!). To avoid this, and to get rid of some of the boilerplate code, we can use ThreadPoolExecutor.map() and ThreadPoolExecutor.shutdown().

I've created a GitHub gist here which uses ThreadPoolExecutor together with requests.Session, which reuses the underlying TCP connection, potentially resulting in even better performance.

\$\endgroup\$
3
  • \$\begingroup\$ Also, since OP is using requests, why not make use of sessions from requests-html(no bs4 needed) \$\endgroup\$
    – hjpotter92
    Commented Mar 17, 2018 at 12:41
  • \$\begingroup\$ @hjpotter92 I've implemented requests.Session (without requests-html) \$\endgroup\$
    – Daniel
    Commented Mar 17, 2018 at 17:57
  • 3
    \$\begingroup\$ @Coal_ This is a fantastically thorough review that helped me improve my code in all sorts of manners. I'll keep in my mind all your suggestions for future projects, as there are going to be more of those because your review greatly encouraged me to learn even more Python (I've been dabbling in it for about two months now). Thank you for your time and effort. I truly appreciate it. \$\endgroup\$
    – baduker
    Commented Mar 18, 2018 at 10:46
6
\$\begingroup\$

I'm adding a few points to what @Coal_ wrote in his excellent answer.

  • You can use lxml instead of html.parser as it is faster (stated in the docs). Change it everywhere where you've used html.parser.

  • The multiple for loops in the grab_image_src_url seem redundant. You can use the following:

    def grab_image_src_url(url):
        response = requests.get(url)
        soup = bs(response.text, "lxml")
        return soup.find("div", class_="post").find("img")["src"]
    
  • The fetch_comics_from_archive() function can be further optimized. Currently, it's using one list comprehension to find all the URLs and another to filter them using RegEx. It can be done in a single list comprehension, using CSS selector with partial match, without using RegEx. You can change the function to:

    def fetch_comics_from_archive():
        response = requests.get(ARCHIVE_URL)
        soup = bs(response.text, "lxml")
        return [url.get("href") for url in soup.select('a[href^="http://www.poorlydrawnlines.com/archive/"]')]
    

    Or, the same can be done without the use of partial match (^). All the required links are located inside the div tag with class="content page". So, the CSS selector would be div[class="content page"] a. Or, even a shorter selector like ".content.page a" would work, as there are no other tags with class="content page".

    def fetch_comics_from_archive():
        response = requests.get(ARCHIVE_URL)
        soup = bs(response.text, "lxml")
        return [url["href"] for url in soup.select(".content.page a")]
    
\$\endgroup\$
0

Not the answer you're looking for? Browse other questions tagged or ask your own question.