Scrapy workshop

SCRAPY WORKSHOP
Karthik Ananth
karthik@scrapinghub.com

Karthik Ananth
Who am I?
! Leading professional services
@ Scrapinghub
! I have vision to synergise data
generation and analytics
! Open source promoter

APIs
Why Web Scraping
Semantic web

What is Web Scraping
The main goal in scraping is to
extract structured data from
unstructured sources, typically,
web pages.

What for
! Monitor prices
! Leads generation
! Aggregate information
! Your imagination is the limit

Do you speak HTTP?
Headers, Query String
Status Codes
Methods
Persistence
GET, POST, PUT, HEAD…
2XX, 3XX, 4XX, 418 , 5XX, 999
Accept-language, UA*…
Cookies

Standard Library
HTTP for humans
Let’s perform a request
urllib2
python-requests

import requests
req = requests.get('http://scrapinghub.com/about/')
Show me the code!
What now?

lxml pythonic binding for the C libraries libxml2
and libxslt
beautifulsoup html.parser, lxml, html5lib
HTMLParsers

import requests 
import lxml.html 
req = requests.get(‘http://nyc2015.pydata.org/schedule/') 
tree = lxml.html.fromstring(req.text) 
for tr in tree.xpath('//span[@class="speaker"]'): 
name = tr.xpath('text()') 
url = tr.xpath('@href') 
print name 
print url
Show me the code!

“Those who don't understand xpath
are cursed to reinvent it, poorly.”

“An open source and collaborative framework for
extracting the data you need from websites. In a
fast, simple, yet extensible way.”

$ conda install -c scrapinghub scrapy

$ scrapy shell <url>
An interactive shell console
Invaluable tool for developing and debugging your spiders

An interactive shell console
>>> response.url
'http://example.com'
>>> response.xpath('//h1/text()')
[<Selector xpath='//h1/text()' data=u'Example Domain'>]
>>> view(response) # open in browser
>>> fetch('http://www.google.com') # fetch other URL

$ scrapy startproject <name>
pydata
├── pydata
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│   └── __init__.py
└── scrapy.cfg
Starting a project

import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/',
]
def parse(self, response):
msg = 'A response from %s just arrived!' % response.url
self.logger.info(msg)
What is a Spider?

import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
‘http://www.example.com/'
]
def parse(self, response):
for h3 in response.xpath(‘//h3/text()’).extract():
yield {‘title’: h3}
for url in response.xpath('//a/@href').extract():
yield scrapy.Request(url, callback=self.parse)
What is a Spider? 1.0

Batteries included
! Logging
! Stats collection
! Testing: contracts
! Telnet console: inspect a Scrapy process

Avoid getting banned
! Rotate your User Agent
! Disable cookies
! Randomized download delays
! Use a pool of rotating IPs
! Crawlera

A service daemon to run Scrapy spiders
$ scrapyd-deploy
Deployment 1.0
scrapyd

TONSofOpenSource
Fullyremotedistributedteam
About us

Mandatory Sales Slide
try.scrapinghub.com/pydatanyc
Crawl the web, at scale
• cloud-based platform
• smart proxy rotator
Get data, hassle-free
• off-the-shelf datasets
• turn-key web scraping

Scrapy workshop

Related slideshows

More Related Content

Scrapy workshop