SlideShare a Scribd company logo
1
PLANET
Massively Parallel Learning of Tree Ensembles with
MapReduce
Joshua Herbach*
Google Inc., AdWords
*Joint work with Biswanath Panda, Sugato Basu, Roberto J. Bayardo
2
Outline
• PLANET – infrastructure for building trees
• Decision trees
• Usage and motivation
• MapReduce
• PLANET details
• Results
• Future Work
.42
3
Tree Models
• Classic data mining model
• Interpretable
• Good when built with ensemble
techniques like bagging
and boosting
|D|=100
|D|=10 |D|=90
|D|=45|D|=45
|D|=20 |D|=30
|D|=15
|D|=25
A
C
D E
F G H I
X1 < v1
X2 є {v2, v3}
|D|=100
|D|=10 |D|=90
|D|=45|D|=45
|D|=20 |D|=30
|D|=15
|D|=25
A
B C
D E
F G H I
Construction
X1 < v1
X2 є {v2, v3}
.42
4
5
Find Best Split
6
Trees at Google
• Large Datasets
 Iterating through a large dataset (10s, 100s, or 1000s of GB) is slow
 Computing values based on the records in a large dataset is really slow
• Parallelism!
 Break up dataset across many processing units and then combine results
 Super computers with specialized parallel hardware to support high throughput
are expensive
 Computers made from commodity hardware are cheap
• Enter MapReduce
7
MapReduce*
*http://labs.google.com/papers/mapreduce.html
Input 1
Mappers
Input 2
Input 3
Input 4
Input 5
Key A Value
Key B Value
Key A Value
Key B Value
Key C Value
Key B Value
Key C Value
Reducers
Output 1 Output 2
Can use a secondary key to control
ordering reducers see key-value pairs
8
PLANET
• Parallel Learner for Assembling Numerous Ensemble Trees
• PLANET is a learner for training decision trees that is built on MapReduce
 Regression models (or classification using logistic regression)
 Supports boosting, bagging and combinations thereof
 Scales to very large datasets
9
System Components
• Master
 Monitors and controls everything
• MapReduce Initialization Task
 Identifies all the attribute values which need to be considered for splits
• MapReduce FindBestSplit Task
 MapReduce job to find best split when there is too much data to fit in memory
• MapReduce InMemoryGrow Task
 Task to grow an entire subtree once the data for it fits in memory
• Model File
 A file describing the state of the model
10
Architecture
Master
Input Data
Initialization
MapReduce
Attribute
Metadata
Model
FindBestSplit
MapReduce
Intermediate
Results
|D|=100 A
|D|=10 |D|=90C
X1 < v1
.42
|D|=45|D|=45 D E
X2 є {v2, v3}
|D|=20
|D|=15
|D|=25
F G H I
InMemory
MapReduce
11
Master
• Controls the entire process
• Determines the state of the tree and grows it
 Decides if nodes should be leaves
 If there is relatively little data entering a node; launch an InMemory MapReduce
job to grow the entire subtree
 For larger nodes, launches a MapReduce job to find candidate best splits
 Collects results from MapReduce jobs and chooses the best split for a node
 Updates Model
• Periodically checkpoints system
• Maintains status page for monitoring
12
Status page
13
Initialization MapReduce
• Identifies all the attribute values which need to be considered for splits
• Continuous attributes
 Compute an approximate equi-depth histogram*
 Boundary points of histogram used for potential splits
• Categorical attributes
 Identify attribute's domain
• Generates an “attribute file” to be loaded in memory by other tasks
*G. S. Manku, S. Rajagopalan, and B. G. Lindsay, SIGMOD, 1999.
14
FindBestSplit MapReduce
• MapReduce job to find best split when there is too much data to fit in memory
• Mapper
 Initialize by loading attribute file from Initialization task and current model file
 For each record run the Map algorithm
 For each node output to all reducers
<Node.Id, <Sum Result, Sum Squared Result, Count>>
 For each split output <Split.Id, <Sum Result, Sum Squared Result, Count>>
Map(data):
Node = TraverseTree(data, Model)
if Node to be grown:
Node.stats.AddData(data)
for feature in data:
Split = FindSplitForValue(Node.Id, feature)
Split.stats.AddData(data)
15
FindBestSplit MapReduce
• MapReduce job to find best split when there is too much data to fit in memory
• Reducer (Continuous Attributes)
 Load in all the <Node_Id, List<Sum Result, Sum Squared Result, Count>> pairs
and aggregate the per_node statistics.
 For each <Split_Id, List<Sum Result, Sum Squared Result, Count>> run the
Reduce algorithm
 For each Node_Id, output the best split found
Reduce(Split_Id, values):
Split = NewSplit(Split_Id)
best = FindBestSplitSoFar(Split.Node.Id)
for stats in values
split.stats.AddStats(stats)
left = ComputeImpurity(split.stats)
right = ComputeImpurity(split.node.stats – split.stats)
split.impurity = left + right
if split.impurity < best.impurity:
UpdateBestSplit(Split.Node.Id, split)
16
FindBestSplit MapReduce
• MapReduce job to find best split when there is too much data to fit in memory
 Reducer (Categorical Attributes)
• Modification to reduce algorithm:
 Compute the aggregate stats for each individual value
 Sort values by average target value
 Iterate through list and find optimal subsequence in list*
*L. Breiman, J. H. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. 1984.
17
InMemoryGrow MapReduce
• Task to grow an entire subtree once the data for it fits in memory
• Mapper
 Initialize by loading current model file
 For each record identify the node it falls under and if that node is to be grown,
output <Node_Id, Record>
• Reducer
 Initialize by loading attribute file from Initialization task
 For each <Node_Id, List<Record>> run the basic tree growing algorithm on the
records
 Output the best splits for each node in the subtree
18
Ensembles
• Bagging
 Construct multiple trees in parallel, each on a sample of the data
 Sampling without replacement is easy to implement on the Mapper side for both
types of MapReduce tasks
• Compute a hash of <Tree_Id, Record_Id> and if it's below a threshold then sample it
 Get results by combining the output of the trees
• Boosting
 Construct multiple trees in a series, each on a sample of the data*
 Modify the target of each record to be the residual of the target and the model's
prediction for the record
• For regression, the residual z is the target y minus the model prediction F(x)
• For classification, z = y – 1 / (1 + exp(-F(x)))
 Get results by combining output from each tree
*J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 2001.
19
Performance Issues
• Set up and Tear down
 Per-MapReduce overhead is significant for large forests or deep trees
 Reduce tear-down cost by polling for output instead of waiting for a task to return
 Reduce start-up cost through forward scheduling
• Maintain a set of live MapReduce jobs and assign them tasks instead of starting new
jobs from scratch
• Categorical Attributes
 Basic implementation stored and tracked these as strings
• This made traversing the tree expensive
 Improved latency by instead considering fingerprints of these values
• Very high dimensional data
 If the number of splits is too large the Mapper might run out of memory
 Instead of defining split tasks as a set of nodes to grow, define them as a set of
nodes grow and a set of attributes to explore.
20
Results
21
Conclusions
• Large-scale learning is increasingly important
• Computing infrastructures like MapReduce can be leveraged for large-scale
learning
• PLANET scales efficiently with larger datasets and complex models.
• Future work
 Adding support for sampling with replacement
 Categorical attributes with large domains
• Might run out of memory
 Only support splitting on single values
 Area for future exploration
Google Confidential and Proprietary 22
Thank You!
Q&A

More Related Content

Planet

  • 1. 1 PLANET Massively Parallel Learning of Tree Ensembles with MapReduce Joshua Herbach* Google Inc., AdWords *Joint work with Biswanath Panda, Sugato Basu, Roberto J. Bayardo
  • 2. 2 Outline • PLANET – infrastructure for building trees • Decision trees • Usage and motivation • MapReduce • PLANET details • Results • Future Work
  • 3. .42 3 Tree Models • Classic data mining model • Interpretable • Good when built with ensemble techniques like bagging and boosting |D|=100 |D|=10 |D|=90 |D|=45|D|=45 |D|=20 |D|=30 |D|=15 |D|=25 A C D E F G H I X1 < v1 X2 є {v2, v3}
  • 4. |D|=100 |D|=10 |D|=90 |D|=45|D|=45 |D|=20 |D|=30 |D|=15 |D|=25 A B C D E F G H I Construction X1 < v1 X2 є {v2, v3} .42 4
  • 6. 6 Trees at Google • Large Datasets  Iterating through a large dataset (10s, 100s, or 1000s of GB) is slow  Computing values based on the records in a large dataset is really slow • Parallelism!  Break up dataset across many processing units and then combine results  Super computers with specialized parallel hardware to support high throughput are expensive  Computers made from commodity hardware are cheap • Enter MapReduce
  • 7. 7 MapReduce* *http://labs.google.com/papers/mapreduce.html Input 1 Mappers Input 2 Input 3 Input 4 Input 5 Key A Value Key B Value Key A Value Key B Value Key C Value Key B Value Key C Value Reducers Output 1 Output 2 Can use a secondary key to control ordering reducers see key-value pairs
  • 8. 8 PLANET • Parallel Learner for Assembling Numerous Ensemble Trees • PLANET is a learner for training decision trees that is built on MapReduce  Regression models (or classification using logistic regression)  Supports boosting, bagging and combinations thereof  Scales to very large datasets
  • 9. 9 System Components • Master  Monitors and controls everything • MapReduce Initialization Task  Identifies all the attribute values which need to be considered for splits • MapReduce FindBestSplit Task  MapReduce job to find best split when there is too much data to fit in memory • MapReduce InMemoryGrow Task  Task to grow an entire subtree once the data for it fits in memory • Model File  A file describing the state of the model
  • 10. 10 Architecture Master Input Data Initialization MapReduce Attribute Metadata Model FindBestSplit MapReduce Intermediate Results |D|=100 A |D|=10 |D|=90C X1 < v1 .42 |D|=45|D|=45 D E X2 є {v2, v3} |D|=20 |D|=15 |D|=25 F G H I InMemory MapReduce
  • 11. 11 Master • Controls the entire process • Determines the state of the tree and grows it  Decides if nodes should be leaves  If there is relatively little data entering a node; launch an InMemory MapReduce job to grow the entire subtree  For larger nodes, launches a MapReduce job to find candidate best splits  Collects results from MapReduce jobs and chooses the best split for a node  Updates Model • Periodically checkpoints system • Maintains status page for monitoring
  • 13. 13 Initialization MapReduce • Identifies all the attribute values which need to be considered for splits • Continuous attributes  Compute an approximate equi-depth histogram*  Boundary points of histogram used for potential splits • Categorical attributes  Identify attribute's domain • Generates an “attribute file” to be loaded in memory by other tasks *G. S. Manku, S. Rajagopalan, and B. G. Lindsay, SIGMOD, 1999.
  • 14. 14 FindBestSplit MapReduce • MapReduce job to find best split when there is too much data to fit in memory • Mapper  Initialize by loading attribute file from Initialization task and current model file  For each record run the Map algorithm  For each node output to all reducers <Node.Id, <Sum Result, Sum Squared Result, Count>>  For each split output <Split.Id, <Sum Result, Sum Squared Result, Count>> Map(data): Node = TraverseTree(data, Model) if Node to be grown: Node.stats.AddData(data) for feature in data: Split = FindSplitForValue(Node.Id, feature) Split.stats.AddData(data)
  • 15. 15 FindBestSplit MapReduce • MapReduce job to find best split when there is too much data to fit in memory • Reducer (Continuous Attributes)  Load in all the <Node_Id, List<Sum Result, Sum Squared Result, Count>> pairs and aggregate the per_node statistics.  For each <Split_Id, List<Sum Result, Sum Squared Result, Count>> run the Reduce algorithm  For each Node_Id, output the best split found Reduce(Split_Id, values): Split = NewSplit(Split_Id) best = FindBestSplitSoFar(Split.Node.Id) for stats in values split.stats.AddStats(stats) left = ComputeImpurity(split.stats) right = ComputeImpurity(split.node.stats – split.stats) split.impurity = left + right if split.impurity < best.impurity: UpdateBestSplit(Split.Node.Id, split)
  • 16. 16 FindBestSplit MapReduce • MapReduce job to find best split when there is too much data to fit in memory  Reducer (Categorical Attributes) • Modification to reduce algorithm:  Compute the aggregate stats for each individual value  Sort values by average target value  Iterate through list and find optimal subsequence in list* *L. Breiman, J. H. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. 1984.
  • 17. 17 InMemoryGrow MapReduce • Task to grow an entire subtree once the data for it fits in memory • Mapper  Initialize by loading current model file  For each record identify the node it falls under and if that node is to be grown, output <Node_Id, Record> • Reducer  Initialize by loading attribute file from Initialization task  For each <Node_Id, List<Record>> run the basic tree growing algorithm on the records  Output the best splits for each node in the subtree
  • 18. 18 Ensembles • Bagging  Construct multiple trees in parallel, each on a sample of the data  Sampling without replacement is easy to implement on the Mapper side for both types of MapReduce tasks • Compute a hash of <Tree_Id, Record_Id> and if it's below a threshold then sample it  Get results by combining the output of the trees • Boosting  Construct multiple trees in a series, each on a sample of the data*  Modify the target of each record to be the residual of the target and the model's prediction for the record • For regression, the residual z is the target y minus the model prediction F(x) • For classification, z = y – 1 / (1 + exp(-F(x)))  Get results by combining output from each tree *J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 2001.
  • 19. 19 Performance Issues • Set up and Tear down  Per-MapReduce overhead is significant for large forests or deep trees  Reduce tear-down cost by polling for output instead of waiting for a task to return  Reduce start-up cost through forward scheduling • Maintain a set of live MapReduce jobs and assign them tasks instead of starting new jobs from scratch • Categorical Attributes  Basic implementation stored and tracked these as strings • This made traversing the tree expensive  Improved latency by instead considering fingerprints of these values • Very high dimensional data  If the number of splits is too large the Mapper might run out of memory  Instead of defining split tasks as a set of nodes to grow, define them as a set of nodes grow and a set of attributes to explore.
  • 21. 21 Conclusions • Large-scale learning is increasingly important • Computing infrastructures like MapReduce can be leveraged for large-scale learning • PLANET scales efficiently with larger datasets and complex models. • Future work  Adding support for sampling with replacement  Categorical attributes with large domains • Might run out of memory  Only support splitting on single values  Area for future exploration
  • 22. Google Confidential and Proprietary 22 Thank You! Q&A