/me wants it. Scraping sites to get data.

/me wants it
Scraping Sites to get Data

Rob Coup
robert@coup.net.nz

Who am I?

• Koordinates
• Open data
open.org.nz

• Geek
• Pythonista

But I
want
to mix
it up!

http://www.ﬂickr.com/photos/bowbrick/2365377635

DATA
http://ﬂ1p51d3.deviantart.com/art/The-Matrix-4594403

And when do I want it?

http://www.ﬂickr.com/photos/davidmaddison/102584440

First Example

• Wanganui District Council Food Gradings
• http://j.mp/i4yNZ

Review
• POST to URLs for each Grade
• Parse HTML response for:
• Business Name
• Address
• Grading
• Output as CSV

What to POST?
• Tools: Firebug, Charles
http://www.wanganui.govt.nz/services/foodgrading/
SearchResults.asp

txtGrading=A
[ B, C, D, E, “Exempt”, “Currently Not Graded” ]
Submit=Go

POSTing in Python
import urllib
import urllib2

url = 'http://www.wanganui.govt.nz/services/foodgrading/
SearchResults.asp'
post_data = {
'txtGrading': 'A',
'Submit': 'Go',
}

post_encoded = urllib.urlencode(post_data)
html = urllib2.urlopen(url, post_encoded).read()

print html

Results
…
<TD class="bodytext">
<h2>Search results...</h2>
39 South 
159 Victoria Ave 
Wanganui 
Grading: A
<hr />
Alma Junction Dairy 
1 Alma Rd 
Wanganui 
Grading: A
<hr />
…

Getting Data Out
• Tools: BeautifulSoup

• Parses HTML-ish documents
• Easy navigation & searching of tree

Our Parser
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)
container = soup.find('td', {'class':'bodytext'})

for hr_el in container.findAll('hr'):
# NAME ADDRESS_0 ADDRESS_1 Grading:GRADE<hr/>
text_parts = hr_el.findPreviousSiblings(text=True, limit=3)
# ['Grading:', 'ADDRESS_1', 'ADDRESS_0']
address = (text_parts[2], text_parts[1])
el_parts = hr_el.findPreviousSiblings('b', limit=2)
# [GRADE, NAME]
grade = el_parts[0].string
name = el_parts[1].string
print name, address, grade

Putting it all together

• loop over the grading values
• write CSV output

Advanced Crawlers

• Form ﬁlling
• Authentication & cookies

Mechanize

• http://wwwsearch.sourceforge.net/mechanize/

• programmable browser in Python

• ﬁlls forms, navigates links & pages, eats cookies

Data Parsing

• JSON: SimpleJSON (pre-Py2.6)
• XML: ElementTree
• HTML: BeautifulSoup
• Nasties: Abobe PDF, Microsoft Excel
“PDF ﬁles are where data goes to die”

Reading nasties in
Python

• Abobe PDF: PDFMiner, pdftable
• MS Excel: xlrd

Example Two

• Palmerston North City Food Gradings
• http://j.mp/31YuRH

Review
• Get HTML page
• Find current PDF link
• Download PDF
• Parse table
• Name
• Grading

Parsing PDF
import urllib2
from cStringIO import StringIO
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.layout import LAParams

pdf_file = StringIO(urllib2.urlopen(pdf_url).read())

text = StringIO()
rsrc = PDFResourceManager()
device = TextConverter(rsrc, text, laparams=LAParams())
process_pdf(rsrc, device, pdf_file)
device.close()

print text.getvalue()

Summary

• Python has some great tools for:
• querying websites
• parsing HTML & other formats

• Open data as data, not websites

/me wants it. Scraping sites to get data.

Related slideshows

More Related Content

/me wants it. Scraping sites to get data.

Editor's Notes