/me wants it. Scraping sites to get data.
- 1. /me wants it
Scraping Sites to get Data
Rob Coup
robert@coup.net.nz
- 2. Who am I?
• Koordinates
• Open data
open.org.nz
• Geek
• Pythonista
- 5. DATA
http://fl1p51d3.deviantart.com/art/The-Matrix-4594403
- 6. And when do I want it?
http://www.flickr.com/photos/davidmaddison/102584440
- 9. Review
• POST to URLs for each Grade
• Parse HTML response for:
• Business Name
• Address
• Grading
• Output as CSV
- 10. What to POST?
• Tools: Firebug, Charles
http://www.wanganui.govt.nz/services/foodgrading/
SearchResults.asp
txtGrading=A
[ B, C, D, E, “Exempt”, “Currently Not Graded” ]
Submit=Go
- 11. POSTing in Python
import urllib
import urllib2
url = 'http://www.wanganui.govt.nz/services/foodgrading/
SearchResults.asp'
post_data = {
'txtGrading': 'A',
'Submit': 'Go',
}
post_encoded = urllib.urlencode(post_data)
html = urllib2.urlopen(url, post_encoded).read()
print html
- 12. Results
…
<TD class="bodytext">
<h2>Search results...</h2>
<B>39 South</B><br />
159 Victoria Ave<br />
Wanganui<br />
Grading: <B>A</b>
<hr />
<B>Alma Junction Dairy</B><br />
1 Alma Rd<br />
Wanganui<br />
Grading: <B>A</b>
<hr />
…
- 13. Getting Data Out
• Tools: BeautifulSoup
• Parses HTML-ish documents
• Easy navigation & searching of tree
- 14. Our Parser
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
container = soup.find('td', {'class':'bodytext'})
for hr_el in container.findAll('hr'):
# <b>NAME</b><br/>ADDRESS_0<br/>ADDRESS_1<br/>Grading:<b>GRADE</b><hr/>
text_parts = hr_el.findPreviousSiblings(text=True, limit=3)
# ['Grading:', 'ADDRESS_1', 'ADDRESS_0']
address = (text_parts[2], text_parts[1])
el_parts = hr_el.findPreviousSiblings('b', limit=2)
# [<b>GRADE</b>, <b>NAME</b>]
grade = el_parts[0].string
name = el_parts[1].string
print name, address, grade
- 15. Putting it all together
• loop over the grading values
• write CSV output
- 17. Mechanize
• http://wwwsearch.sourceforge.net/mechanize/
• programmable browser in Python
• fills forms, navigates links & pages, eats cookies
- 18. Data Parsing
• JSON: SimpleJSON (pre-Py2.6)
• XML: ElementTree
• HTML: BeautifulSoup
• Nasties: Abobe PDF, Microsoft Excel
“PDF files are where data goes to die”
- 21. Review
• Get HTML page
• Find current PDF link
• Download PDF
• Parse table
• Name
• Grading
- 22. Parsing PDF
import urllib2
from cStringIO import StringIO
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.layout import LAParams
pdf_file = StringIO(urllib2.urlopen(pdf_url).read())
text = StringIO()
rsrc = PDFResourceManager()
device = TextConverter(rsrc, text, laparams=LAParams())
process_pdf(rsrc, device, pdf_file)
device.close()
print text.getvalue()
- 23. Summary
• Python has some great tools for:
• querying websites
• parsing HTML & other formats
• Open data as data, not websites
Editor's Notes
- We&#x2019;ve ended up with this datasets-as-websites problem.
- I might want to create an alternative presentation. Use it for something different, that the creator would never have conceived of. Or maybe just compare or combine it with other data.
http://www.flickr.com/photos/bowbrick/2365377635
- So, I need the raw data. Not some pretty webpages.
http://fl1p51d3.deviantart.com/art/The-Matrix-4594403
- At 3am on a Sunday morning of course. When my interest is up. No use having some mail-in-take-21-working-days option.
http://www.flickr.com/photos/davidmaddison/102584440
- Usually it&#x2019;s easier to ask forgiveness than permission.