SlideShare a Scribd company logo
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Building a data processing pipeline in Python
Joe Cabrera
https://github.com/greedo
@greedoshotlast
jcabrera@eminorlabs.com
PyGotham, 2015
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Outline
1 The problem
2 Data ingestion
3 Data parsing
4 Data cleansing
5 Scaling out
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Poorly formatted data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Poorly formatted data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Poorly formatted data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Largely dispersed across the web
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
No standard data processing library
Pandas
Bubbles
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Data processing
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Requests and Futures
Requests makes it easy to send the required parameters
Concurrent Futures allows for the asynchronous execution
of download requests
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Parsers
Python tokenize
BeautifulSoup
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Why BeautifulSoup
More forgiving than standard XML or HTML libraries
Supports regex
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Celery job scheduling
Each download job is a task
Each parse job is a task
Each cleanse job is a task
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Re-insert cleansed data
Cleanup data after raw ingest
Separate stores for raw and clean data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Distributed task queue
Distribute data processing jobs to many machines
Distribute jobs on a given machine across many CPUs
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
SQL-Alchemy basic sharding API
Each databases each has a shard id
We query for data based on which shard contains the data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Questions
Thanks!
https://github.com/greedo
@greedoshotlast
jcabrera@eminorlabs.com
Joe Cabrera Building a data processing pipeline in Python

More Related Content

Building a data processing pipeline in Python

  • 1. The problem Data ingestion Data parsing Data cleansing Scaling out Building a data processing pipeline in Python Joe Cabrera https://github.com/greedo @greedoshotlast jcabrera@eminorlabs.com PyGotham, 2015 Joe Cabrera Building a data processing pipeline in Python
  • 2. The problem Data ingestion Data parsing Data cleansing Scaling out Outline 1 The problem 2 Data ingestion 3 Data parsing 4 Data cleansing 5 Scaling out Joe Cabrera Building a data processing pipeline in Python
  • 3. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  • 4. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  • 5. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  • 6. The problem Data ingestion Data parsing Data cleansing Scaling out Largely dispersed across the web Joe Cabrera Building a data processing pipeline in Python
  • 7. The problem Data ingestion Data parsing Data cleansing Scaling out No standard data processing library Pandas Bubbles Joe Cabrera Building a data processing pipeline in Python
  • 8. The problem Data ingestion Data parsing Data cleansing Scaling out Data processing Joe Cabrera Building a data processing pipeline in Python
  • 9. The problem Data ingestion Data parsing Data cleansing Scaling out Requests and Futures Requests makes it easy to send the required parameters Concurrent Futures allows for the asynchronous execution of download requests Joe Cabrera Building a data processing pipeline in Python
  • 10. The problem Data ingestion Data parsing Data cleansing Scaling out Parsers Python tokenize BeautifulSoup Joe Cabrera Building a data processing pipeline in Python
  • 11. The problem Data ingestion Data parsing Data cleansing Scaling out Why BeautifulSoup More forgiving than standard XML or HTML libraries Supports regex Joe Cabrera Building a data processing pipeline in Python
  • 12. The problem Data ingestion Data parsing Data cleansing Scaling out Celery job scheduling Each download job is a task Each parse job is a task Each cleanse job is a task Joe Cabrera Building a data processing pipeline in Python
  • 13. The problem Data ingestion Data parsing Data cleansing Scaling out Re-insert cleansed data Cleanup data after raw ingest Separate stores for raw and clean data Joe Cabrera Building a data processing pipeline in Python
  • 14. The problem Data ingestion Data parsing Data cleansing Scaling out Distributed task queue Distribute data processing jobs to many machines Distribute jobs on a given machine across many CPUs Joe Cabrera Building a data processing pipeline in Python
  • 15. The problem Data ingestion Data parsing Data cleansing Scaling out SQL-Alchemy basic sharding API Each databases each has a shard id We query for data based on which shard contains the data Joe Cabrera Building a data processing pipeline in Python
  • 16. The problem Data ingestion Data parsing Data cleansing Scaling out Questions Thanks! https://github.com/greedo @greedoshotlast jcabrera@eminorlabs.com Joe Cabrera Building a data processing pipeline in Python