SlideShare a Scribd company logo
Web Scraping with Scrapy
        Virendra Rajput

       Hacker @Markitty
Agenda
●   What is web scraping and why it's fun
●   My experiments with web scraping
●   Getting started with Scrapy
●   How Scrapy works and a quick Demo
●   Why Scrapy
●   Questions
What is Web Scraping?
● Extracting information from websites
● Problem:
  ○ Static websites
  ○ No access to APIs to extract the data you
     need
  ○ Need to extract data periodically
● Manual solution - go to the website and copy
  the required data
● Smarter solution: Web Scraping
My Experiments with Scraping
Web Scraping in Python
● Download webpage with urllib2, requests

● Parse the page with BeautifulSoup/lxml

● Select with XPath or css selectors
Scrapy - fast high Level Screen
Scraping and web crawling
Framework
●   Pick a website
●   Define the data you want to scrape
●   Write the spider to extract the data
●   Run the spider
●   Store the Data
Demo
Getting started with Scrapy in Python
Why Scrapy
●   Simplicity
●   Fast
●   Productive/ Extensible
●   Portable
●   Well docs & Healthy community
●   Commercial Support
Advanced Features (built in)
● Interactive shell for trying XPaths (useful for
  debugging)
● selecting and extracting data from html
  sources
● cleaning and sanitizing the scraped data
● generating feed exports (JSON, CSV)
● media pipeline for downloading stuff
● Middlewares for (cookies, HTTP
  compression, cache, user-agent spoofing,
  etc)
questions
   ?

More Related Content

Getting started with Scrapy in Python

  • 1. Web Scraping with Scrapy Virendra Rajput Hacker @Markitty
  • 2. Agenda ● What is web scraping and why it's fun ● My experiments with web scraping ● Getting started with Scrapy ● How Scrapy works and a quick Demo ● Why Scrapy ● Questions
  • 3. What is Web Scraping? ● Extracting information from websites ● Problem: ○ Static websites ○ No access to APIs to extract the data you need ○ Need to extract data periodically ● Manual solution - go to the website and copy the required data ● Smarter solution: Web Scraping
  • 5. Web Scraping in Python ● Download webpage with urllib2, requests ● Parse the page with BeautifulSoup/lxml ● Select with XPath or css selectors
  • 6. Scrapy - fast high Level Screen Scraping and web crawling Framework ● Pick a website ● Define the data you want to scrape ● Write the spider to extract the data ● Run the spider ● Store the Data
  • 9. Why Scrapy ● Simplicity ● Fast ● Productive/ Extensible ● Portable ● Well docs & Healthy community ● Commercial Support
  • 10. Advanced Features (built in) ● Interactive shell for trying XPaths (useful for debugging) ● selecting and extracting data from html sources ● cleaning and sanitizing the scraped data ● generating feed exports (JSON, CSV) ● media pipeline for downloading stuff ● Middlewares for (cookies, HTTP compression, cache, user-agent spoofing, etc)