Scrapy workshop
- 2. Karthik Ananth
Who am I?
! Leading professional services
@ Scrapinghub
! I have vision to synergise data
generation and analytics
! Open source promoter
- 4. What is Web Scraping
The main goal in scraping is to
extract structured data from
unstructured sources, typically,
web pages.
- 5. What for
! Monitor prices
! Leads generation
! Aggregate information
! Your imagination is the limit
- 6. Do you speak HTTP?
Headers, Query String
Status Codes
Methods
Persistence
GET, POST, PUT, HEAD…
2XX, 3XX, 4XX, 418 , 5XX, 999
Accept-language, UA*…
Cookies
- 10. lxml pythonic binding for the C libraries libxml2
and libxslt
beautifulsoup html.parser, lxml, html5lib
HTMLParsers
- 11. import requests
import lxml.html
req = requests.get(‘http://nyc2015.pydata.org/schedule/')
tree = lxml.html.fromstring(req.text)
for tr in tree.xpath('//span[@class="speaker"]'):
name = tr.xpath('text()')
url = tr.xpath('@href')
print name
print url
Show me the code!
- 14. “An open source and collaborative framework for
extracting the data you need from websites. In a
fast, simple, yet extensible way.”
- 16. $ scrapy shell <url>
An interactive shell console
Invaluable tool for developing and debugging your spiders
- 17. An interactive shell console
>>> response.url
'http://example.com'
>>> response.xpath('//h1/text()')
[<Selector xpath='//h1/text()' data=u'Example Domain'>]
>>> view(response) # open in browser
>>> fetch('http://www.google.com') # fetch other URL
- 18. $ scrapy startproject <name>
pydata
├── pydata
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ └── __init__.py
└── scrapy.cfg
Starting a project
- 20. import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/',
]
def parse(self, response):
msg = 'A response from %s just arrived!' % response.url
self.logger.info(msg)
What is a Spider?
- 21. import scrapy
class MySpider(scrapy.Spider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
‘http://www.example.com/'
]
def parse(self, response):
for h3 in response.xpath(‘//h3/text()’).extract():
yield {‘title’: h3}
for url in response.xpath('//a/@href').extract():
yield scrapy.Request(url, callback=self.parse)
What is a Spider? 1.0
- 23. Avoid getting banned
! Rotate your User Agent
! Disable cookies
! Randomized download delays
! Use a pool of rotating IPs
! Crawlera
- 24. A service daemon to run Scrapy spiders
$ scrapyd-deploy
Deployment 1.0
scrapyd