4
\$\begingroup\$

I built a webscraper to take all the historical data from CoinMarketCap.com. A site which records historical data for cryptocurrencies which unfortunately does not have a feature to download all of their data in one convenient file.

It works but it's somewhat slow and probably could be designed better. I'm self taught so I'd love some advice on how to improve the code's functionality, speed, and readability. Thanks a lot in advance.

def scrape_coin_marketcap():
    print('Loading Packages...')
    from bs4 import BeautifulSoup
    import requests
    import pandas as pd

    print('Scraping links...')
    #Download HTML of links to historical CoinMarketCap data
    url = 'https://coinmarketcap.com/historical/'
    r  = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data,'html.parser')

    #scrape a list of links to historical data
    raw_links = []
    for link in soup.find_all('a'):
        raw_links.append(link.get('href'))

    #Remove non-historical links
    historical_links = []
    for link in raw_links:
        if "201" in str(link):
            historical_links.append('https://coinmarketcap.com' + link)

    print('Scraping Data....')

    #Scrape historical data from each time period
    master_df = pd.DataFrame()
    num_links = len(historical_links)
    print(str(num_links) + " dates to be scraped...")
    for link in historical_links:
        try:
            res = requests.get(link)
            soup = BeautifulSoup(res.content, 'lxml')
            table = soup.find_all('table')[0]
            df = pd.read_html(str(table))[0]
            date = str(soup.find_all('h1')[0])[51:-5]
            df['date'] = date
            master_df = master_df.append(df)
            print("    Scraping: " + str(date))
        except:
            print("    ERROR Scraping: " + str(link))

    print('Saving to disk...')
    master_df.to_csv('CoinMarketCap.csv', index = False)
    print("Completed.")

if __name__ == "__main__":
    scrape_coin_marketcap()
\$\endgroup\$

3 Answers 3

10
\$\begingroup\$

Welcome to Code Review SE, nice first question! Here are some of my comments, not pertaining to performance:

  • There's no reason to import modules in the body of scrape_coin_marketcap(). Just do it at the top of the source code file. If you need to be sure the modules are present at all, you can catch ImportError:

    import sys
    
    try:
        from bs4 import BeautifulSoup
        import requests
        import pandas as pd
    except ImportError:
        sys.exit("This script requires Beautiful Soup 4, Requests and Pandas.")
    

    Note: I'm using sys.exit() here instead of print(), because I'm assuming this is a standalone script and an ImportError should therefore be fatal.

  • The URL for the webpage is unlikely to change in the future, and isn't supposed to be modified at runtime, so you should mark it as a constant variable. Constants use the UPPERCASE_WITH_UNDERSCORES naming convention:

    HISTORICAL_DATA_URL = "https://coinmarketcap.com/historical/"
    
  • This comment:

    #Download HTML of links to historical CoinMarketCap data
    

    ... is pretty pointless. Anyone familiar with Python doesn't need that comment to understand what the snippet of code below it does. Only add comments to difficult to comprehend code, or to snippets that need extra attention for whatever reason.

  • The following pattern:

    my_list = []
    for x in some_iterable:
        my_list.append(some_function(x))
    

    ... is a good candidate for a list comprehension:

    my_list = [some_function(x) for x in some_iterable]
    

    ... so in your case:

    raw_links = [link.get("href") for link in soup.find_all("a")]
    

    In fact, the part where you remove non-historical links could benefit from a list comprehension too:

    historical_links = [
        "https://coinmarketcap.com/" + link for link in raw_links
        if "201" in str(link)
    ]
    
  • Don't mix single and double quotes. Either choice is valid, but mixing them is a bad idea. As an exception, you are encouraged to use single quotes if you need to output literal double quotes, and vice versa.

  • Unless you need access to the result of r.text later, don't bind it to a variable.

  • Debug / error messages should not be sent to stdout, but to stderr. You can do that by explicitly passing sys.stderr as a parameter to print():

    import sys
    
    ...
    
    print("Loading Packages...", file=sys.stderr)
    
\$\endgroup\$
4
  • \$\begingroup\$ Thanks a lot for the excellent advice! Just out of curiosity what is the issue with using mixed double quotes and single quotes? Is there an explicit reason or is it just stylistic? \$\endgroup\$ Commented May 6, 2018 at 22:19
  • 1
    \$\begingroup\$ @TheSaint321 No, it's almost entirely a stylistic issue. It might lead to some confusion because mixing single and double quotes is illegal in lots of other languages (most famously C-derived languages). \$\endgroup\$
    – Daniel
    Commented May 6, 2018 at 22:22
  • \$\begingroup\$ Got it! Thanks. Another question: Why use stderr versus stoud? I put the print statements both for debugging as well as so that the user can see how far along the program is since it takes a few minutes to run on a normal computer. Is this stylistic since it is not an actual output or is there a deeper reason for this? \$\endgroup\$ Commented May 6, 2018 at 22:28
  • \$\begingroup\$ @TheSaint321 Yes and no. It's simply common practice, but it's a good idea to adopt it. There's multiple examples of how this could be useful: for one, it allows people using your script to 'disable' debugging by only reading from stdout. Or maybe the caller wants to redirect stdout to another file, without the debug messages. It's all just by convention, but it standardizes program behavior. \$\endgroup\$
    – Daniel
    Commented May 6, 2018 at 22:38
7
\$\begingroup\$

A couple of other things which, I think, are worth considering:

  • instead of doing requests.get(), initialize a web-scraping session and reuse - this would improve performance:

    if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase

  • you don't need to call BeautifulSoup, look for a table and pass it to Pandas. You can just directly call .read_html() and pass an HTML string into it:

    df = pd.read_html(res.content)[0]
    
  • on the other hand, you may go for another performance trick - SoupStrainer which would instruct BeautifulSoup to only parse what is needed - the first table in your case:

    parse_only = SoupStrainer("table")
    with requests.Session() as session:
        for link in historical_links:
            data = session.get(link)
            soup = BeautifulSoup(data, 'lxml', parse_only=parse_only)
    
            df = pd.read_html(soup.table)[0]
    

    don't forget to import SoupStrainer:

    from bs4 import BeautifulSoup, SoupStrainer
    

All that said though, the main bottleneck, of course, is the synchronous nature of the script - you are processing links sequentially in a blocking manner - not processing the next link until the current is finished being processed. Look into alternative asynchronous approaches if performance is a concern (Scrapy is the first thing coming into mind).

\$\endgroup\$
1
  • 1
    \$\begingroup\$ These are fantastic suggestions. It is definitely clear that I need to study more about the functionalities of beautiful soup and requests. Thank you. \$\endgroup\$ Commented May 7, 2018 at 11:24
6
\$\begingroup\$

Imports

You should define your import at the top of your file, not in the method. I know you only have one method, but if it wasn't the case it would get complicated to track where each import is.

Exceptions

Depending on how "normal" it is not to be able to scrape a link, you might want to stop your process there and check. So in your except you could use raise, which keeps the stacktrace intact.

Apart from these two little details I don't see much problems in your code.

\$\endgroup\$
2
  • \$\begingroup\$ Great advice! Thank you very much. For the raise statement are you just saying to cause it to output the type exception so that debugging is easier or to try and program something to handle the exception? I added that exception handler in case they decided to change the format of one of the historical pages so that it wouldn't crash and delete the rest of the scrape. \$\endgroup\$ Commented May 6, 2018 at 22:08
  • \$\begingroup\$ Well I was raising a concern. Is it still valuable if there's a part of the information that is not scraped? Something exceptions need to be raised. \$\endgroup\$
    – IEatBagels
    Commented May 6, 2018 at 23:19

Not the answer you're looking for? Browse other questions tagged or ask your own question.