33
\$\begingroup\$

This came out on the spur of the moment, as a quick and dirty tool to have the job done.

This is a simple image scraper for immensely popular and legendary comic website inspired by a Python Easter Egg.

For those who don't know it, run your Python interpreter and type import antigravity and hit Enter. :)

As for the code below, I'd appreciate any feedback, particularly in regards to threading, as I'm new to this.

#!/usr/bin/python3

import os
import sys
import time
import threading
from pathlib import Path
from shutil import copyfileobj


import requests
from lxml import html


BASE_URL = "https://www.xkcd.com/"
ARCHIVE = "https://www.xkcd.com/archive"
SAVE_DIRECTORY = Path('xkcd_comics')
LOGO = """
       _           _                      
 tiny | |  image  | | downloader for
 __  _| | _____ __| |  ___ ___  _ __ ___  
 \ \/ / |/ / __/ _` | / __/ _ \| '_ ` _ \ 
  >  <|   < (_| (_| || (_| (_) | | | | | |
 /_/\_\_|\_\___\__,_(_)___\___/|_| |_| |_|
 version 0.1
"""


def show_logo():
    print(LOGO)


def fetch_url(url: str) -> requests.Response:
    return requests.get(url)


def head_option(values: list) -> str:
    return next(iter(values), None)


def get_penultimate(url: str) -> int:
    page = fetch_url(url)
    tree = html.fromstring(page.content)
    newest_comic = head_option(
        tree.xpath('//*[@id="middleContainer"]/a[1]/@href'))
    return int(newest_comic.replace("/", ""))


def get_images_from_page(url: str) -> str:
    page = fetch_url(url)
    tree = html.fromstring(page.content)
    return head_option(tree.xpath('//*[@id="comic"]//img/@src'))


def get_number_of_pages(latest_comic: int) -> int:
    print(f"There are {latest_comic} comics.")
    print(f"How many do you want to download? Type 0 to exit.")
    while True:
        try:
            number_of_comics = int(input(">> "))
        except ValueError:
            print("Error: Expected a number. Try again.")
            continue
        if number_of_comics > latest_comic or number_of_comics < 0:
            print("Error: Incorrect number of comics. Try again.")
            continue
        elif number_of_comics == 0:
            sys.exit()
        return number_of_comics


def clip_url(img: str) -> str:
    return img.rpartition("/")[-1]


def make_dir():
    return os.makedirs(SAVE_DIRECTORY, exist_ok=True)


def save_image(img: str):
    comic_name = clip_url(img)
    print(f"Downloading: {comic_name}")
    f_name = SAVE_DIRECTORY / comic_name
    with requests.get("https:" + img, stream=True) as img, open(f_name, "wb") \
            as output:
        copyfileobj(img.raw, output)


def show_time(seconds: int) -> int:
    minutes, seconds = divmod(seconds, 60)
    hours, minutes = divmod(minutes, 60)
    time_elapsed = f"{hours:02d}:{minutes:02d}:{seconds:02d}" 
    return time_elapsed


def get_xkcd():
    show_logo()
    make_dir()

    collect_garbage = []
    latest_comic = get_penultimate(ARCHIVE)
    pages = get_number_of_pages(latest_comic)

    start = time.time()
    for page in reversed(range(latest_comic - pages + 1, latest_comic + 1)):
        print(f"Fetching page {page} out of {latest_comic}")
        try:
            url = get_images_from_page(f"{BASE_URL}{page}/")
            thread = threading.Thread(target=save_image, args=(url, ))
            thread.start()
        except (ValueError, AttributeError, requests.exceptions.MissingSchema):
            print(f"WARNING: Invalid comic image source url.")
            collect_garbage.append(f"{BASE_URL}{page}")
            continue
    thread.join()
    end = time.time()

    print(f"Downloaded {pages} comic(s) in {show_time(int(end - start))}.")

    if len(collect_garbage) > 0:
        print("However, was unable to download images for these pages:")
        print("\n".join(page for page in collect_garbage))


def main():
    get_xkcd()


if __name__ == '__main__':
    main()
\$\endgroup\$
10
  • 17
    \$\begingroup\$ I have rolled back your last edit. Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers. \$\endgroup\$
    – Heslacher
    Commented Sep 2, 2019 at 8:57
  • 2
    \$\begingroup\$ While your intentions by removing part of the code are good, it does invalidate existing answers. We can't have that. \$\endgroup\$
    – Mast
    Commented Sep 2, 2019 at 8:58
  • 5
    \$\begingroup\$ @Thomas there's an API there? If so, that's a game changer! \$\endgroup\$
    – baduker
    Commented Sep 2, 2019 at 9:51
  • 25
    \$\begingroup\$ xkcd.com/json.html \$\endgroup\$
    – Thomas
    Commented Sep 2, 2019 at 10:05
  • 3
    \$\begingroup\$ Relevant xkcd \$\endgroup\$
    – BruceWayne
    Commented Sep 2, 2019 at 21:05

7 Answers 7

88
\$\begingroup\$

Stop hammering XKCD server

Just in case it wasn't clear : don't do that.

I suppose the load of XKCD servers is high enough that your code won't make any noticeable difference. Still, the primary goal of your code should be to stay under the radar. Seen from outside, there shouldn't be any difference between launching your script and casually browsing https://xkcd.com/.

It isn't just as a courtesy to fellow developers : you might be banned by the remote servers if you send too many requests too fast.

Solutions

  • Wait at least one second before downloading any picture. The desired pause might be defined as Crawl-delay in robots.txt.
  • Remove multi-threading or at least limit the number of threads.
  • Don't download a file again if it is already in SAVE_DIRECTORY. You can check if it's here, if its size is larger than 0 or if it is indeed a PNG file.

Unrelated note

You know the id for each comic. You should probably write it in the filename : geologic_time.png could be called 2187_geologic_time.png or 02187_geologic_time.png.

This way, comics are sorted chronologically inside SAVE_DIRECTORY.

\$\endgroup\$
14
  • 9
    \$\begingroup\$ I was completely unaware of this and had no intention whatsoever to cause or invite others to perform DoS and/or DDoS. I'll remove the threading part from the code. Actually, once I've accidentally written it, I now fully understand what and how dangerous DoS can be. O.o \$\endgroup\$
    – baduker
    Commented Sep 2, 2019 at 8:56
  • 27
    \$\begingroup\$ @baduker: It shouldn't be a problem because XKCD is so popular. Calling your code a DDOS is a bit of a stretch but I wanted to make my point very clear. When scraping information, the goal should be to stay under the radar. \$\endgroup\$ Commented Sep 2, 2019 at 9:19
  • 3
    \$\begingroup\$ I appreciate your feedback since I'm still rather new to the scraping realm. However, that's a valuable lesson for the future. \$\endgroup\$
    – baduker
    Commented Sep 2, 2019 at 9:24
  • 11
    \$\begingroup\$ @baduker This is even for the benefits of your crawler, the first defense line against crawlers/bots is the requests speed, I ban bots on my website if they made more than 5 requests in 10 seconds, and I told them that in my robots.txt file Crawl-delay: 2000 \$\endgroup\$ Commented Sep 2, 2019 at 9:26
  • 3
    \$\begingroup\$ @Accountantم that sounds like another answer. I'd love to see the code for pinging the robots.txt and querying relevant parts (such as respecting no-robots, crawl delay, etc). \$\endgroup\$ Commented Sep 2, 2019 at 14:04
15
\$\begingroup\$

Globals

As it is, there would be an advantage to LOGO being a local instead of a global - it's only used by show_logo, and moving it there would clean up the global namespace.

That said (and as others have pointed out), it's fairly common to see stuff like this at the top of Python files in global scope. However, the larger issue is that if you move it to local scope, you have to get clever with indentation. There are no great solutions to this - either you have to de-indent all but the first line, which is ugly; or you have to post-process the string to remove all whitespace at the beginning of each line. So this one is kind of a wash.

Base URLs

You correctly saved a base URL, but then didn't use it in the correct contexts. Particularly problematic:

ARCHIVE = "https://www.xkcd.com/archive"

This ignores the BASE_URL entirely, when it shouldn't.

fetch_url is currently useless - it doesn't add anything to requests.get. You could make it useful by making the argument a path relative to the base path.

with requests.get("https:" + img
# ...
url = get_images_from_page(f"{BASE_URL}{page}/")

Naive string concatenation is not the right thing to do, here. Python has a full-featured urllib to deal with URL parsing and construction.

show_time

divmod on a numeric time interval is not the right thing to do. Use datetime.timedelta.

\$\endgroup\$
4
  • 8
    \$\begingroup\$ What's wrong with listing hard-coded values as constants at the beginning of a module? It looks like common practice to me. \$\endgroup\$
    – Vincent
    Commented Sep 2, 2019 at 8:15
  • \$\begingroup\$ @Vincent Sure; though the larger problem is indentation. I've added some commentary to what's a more nuanced issue than I first indicated. \$\endgroup\$
    – Reinderien
    Commented Sep 2, 2019 at 13:08
  • \$\begingroup\$ @Vincent That's common, it doesn't mean it's best. From my point of view, when I see a whole bunch of constants declared at the beginning of the module, and those constants' names imply that they might change depending on the situation (base url might change in the future), I think they belong to a config file instead. Yes, in this case it might be overkill, but that's the idea. \$\endgroup\$
    – ChatterOne
    Commented Sep 2, 2019 at 14:03
  • 1
    \$\begingroup\$ @ChatterOne While config file is fine too, using it makes you import configparser and handle paths just for the sake of having a config file. It makes no sense to do such a thing unless the constants are meant to be pluggable instead of just having one place in file so that I don't need to rewrite strings everywhere if the script suddenly stops working. \$\endgroup\$ Commented Sep 3, 2019 at 8:27
11
\$\begingroup\$
for page in reversed(range(latest_comic - pages + 1, latest_comic + 1)):
    print(f"Fetching page {page} out of {latest_comic}")
    try:
        url = get_images_from_page(f"{BASE_URL}{page}/")
        thread = threading.Thread(target=save_image, args=(url, ))
        thread.start()
    except (ValueError, AttributeError, requests.exceptions.MissingSchema):
        print(f"WARNING: Invalid comic image source url.")
        collect_garbage.append(f"{BASE_URL}{page}")
        continue
thread.join()

Here you create several threads that download the pages. There are at least two problems with this code:

  1. You create a number of threads, but only join() the last one you created. There is no guarantee that all threads have finished before the last one does. Maintain a list of your threads.
  2. No rate limit. If you try to download 100 pages, it will try to do all 100 simultaneously. That is not a good idea. Only create a limited amount of threads at a time.
\$\endgroup\$
3
  • \$\begingroup\$ As far as threading is concerned and in what OP is most interested in, I wonder that they decided to accept the other answer where this one is much more valuable. \$\endgroup\$
    – t3chb0t
    Commented Sep 4, 2019 at 10:21
  • 1
    \$\begingroup\$ @t3chb0t To be fair, the accepted answer explained the issue with #2 better than me. \$\endgroup\$
    – kutschkem
    Commented Sep 4, 2019 at 11:06
  • \$\begingroup\$ @t3chb0t: You're right, this answer (and every other here) contains valuable information for OP. It's hard to pick an accepted answer in this case, and the choice must have been somewhat arbitrary. \$\endgroup\$ Commented Sep 4, 2019 at 14:58
11
\$\begingroup\$

As mentioned in a comment by Thomas, another option is to use XKCD's JSON interface rather than scraping HTML:

import requests, time, tempfile, os.path
from shutil import copyfileobj

path = tempfile.mkdtemp()
print(path)
f_name = "{num:04d}-{safe_title}.{ext}"
current_comic = requests.get("https://xkcd.com/info.0.json").json()

# Iterates over numbers from comic 1 to the comic before current
for n in range(1,current_comic["num"]):
    comic_req = requests.get("https://xkcd.com/{}/info.0.json".format(n))
    # if status code is 2**
    if comic_req.status_code <= 299:
        comic = comic_req.json()
        comic["ext"] = comic["img"][-3:]
        fn = f_name.format(**comic)

        img = requests.get(comic_req.json()["img"], stream=True)
        with open(os.path.join(path, fn), "wb") as output:
            copyfileobj(img.raw, output)
        img.close()

        print("Saved {}".format(os.path.join(path, fn)))
\$\endgroup\$
8
\$\begingroup\$

As for the code below, I'd appreciate any feedback, particularly in regards to threading

You might want to use a ThreadPoolExecutor to manage your threads. This approach has two advantages:

  • The executor can be used as a context manager to make sure all the threads are joined.
  • It lets you limit the number of threads in the thread pool.

Example:

    with ThreadPoolExecutor(max_workers=8) as executor:
        for page in reversed(range(latest_comic - pages + 1, latest_comic + 1)):
            print(f"Fetching page {page} out of {latest_comic}")
            try:
                url = get_images_from_page(f"{BASE_URL}{page}/")
                executor.submit(save_image, url)
            except (ValueError, AttributeError, requests.exceptions.MissingSchema):
                print(f"WARNING: Invalid comic image source url.")
                collect_garbage.append(f"{BASE_URL}{page}")
\$\endgroup\$
8
\$\begingroup\$

Not a problem in terms of functionality (seems to have been covered pretty well already) but in terms of clarity/readability:

Misleading name get_penultimate

It appears to functionally be get_last (get_ultimate, if you will), so there is a clear mismatch between this name and the implemented functionality. The functionality makes sense, so the name appears to be wrong.
To explicitly point out the mismatch: penultimate means second to last.

\$\endgroup\$
0
\$\begingroup\$

Why have main() be a single line calling a single function? Either rename main() to get_xkcd() or just call get_xkcd().

Also, maybe use BeautifulSoup instead of lxml.

\$\endgroup\$
4
  • 1
    \$\begingroup\$ Without the main() function, the code would get executed even if the script was imported as a module. Also, the function makes it possible to run that code with: import module; module.main(). If the code was just in the if block, it couldn't be run from elsewhere. \$\endgroup\$
    – baduker
    Commented Sep 4, 2019 at 16:41
  • 4
    \$\begingroup\$ @baduker Tbf, I don't think that Kurt is suggesting that any code should be moved out of a function. To my understanding, their suggestion is rather to either rename get_xkcd to main or to call get_xkcd directly in the if __name__ == '__main__' block. \$\endgroup\$ Commented Sep 4, 2019 at 19:51
  • \$\begingroup\$ @HåkanLindqvist That's exactly what I was suggesting. Thanks for the downvote baduker... \$\endgroup\$
    – Kurt
    Commented Sep 4, 2019 at 22:25
  • \$\begingroup\$ Lxml VS beautiful soup, could you elaborate on that on your answer? \$\endgroup\$
    – zardilior
    Commented Sep 24, 2019 at 21:46

Not the answer you're looking for? Browse other questions tagged or ask your own question.