SlideShare a Scribd company logo
Makoto Onizuka, Hiroyuki Kato, Soichiro
Hidaka, Keisuke Nakano, Zhenjiang Hu
1
Demand for Big Data Analysis
 Big Data Analysis
 Cyber space: Click log, query log
 Real space: shopping log, sensing data
 Machine learning
 Algorithm: classification, clustering
 Data type: relation, vector, graph, time series
 Distributed computing framework
 Interface: MPI, MapReduce, BSP (bulk
synchronous parallel)
2
Iterative analysis examples
 Clustering
 Partitioning: k-means, EM-algorithm, affinity
propagation
 Hierarchical clustering: Ward's method,
BIRCH
 Matrix factorization
 Graph mining
 PageRank, Random walk with restarts
3
Running example: PageRank
This program is not efficient. Which parts?
4
map function shuffles
whole graph structure
in every iteration
scores are computed
even if the nodes have
converged
Issues for iterative analysis
 How to optimize the program?
 Reusing the intermediate (shuffled) data
 Skip computing the scores of converted nodes
 Possible but difficult to manually remove
the above redundant computations
 Actually, Spark, HaLoop, REX force
programmers to manually remove them
 Our goal: Automatically remove redundant
computations for iterative queries
5
Overview
 OptIQ is a new optimization framework for
iterative queries with convergence property
 Declarative high level language; programmers
are freed from burden of removing redundancy
 OptIQ Integrates traditional optimization
techniques in database and compiler areas
 Two techniques for removing redundancy
 view materialization for invariant views
 incrementalization for variant views
 We implement on Hive and Spark
6
Iterative query language
 SQL extended with iteration
 Syntax
 Behavior
 initialize: statements before iteration
 update table is updated by step query repeatedly
until convergence
 return: statements after iteration
7
Example: PageRank
8
Example: k-means
9
Query Optimization
 Goal: remove redundant computations
 Question: What is redundant computation?
 Operations on unmodified attributes of tuples
 Operations on attributes of unmodified tuples
 OptIQ reuses partial results of step queries
 View materialization reuses operations on
unmodified attributes
 incrementalization reuses operations on unmodified
tuples
10
Query Optimization cont.
11
View materialization
 Purpose is to reuse unmodified attributes
of update table during iterations
 Procedure
1. Decompose update table into variant and
invariant tables by conservative analysis
2. Materialize sub-query in step query that only
accesses invariant table
3. Rewrite step query to use materialized view,
query processing using view
12
Table decomposition
 discriminate modified/unmodified attributes
 unmodified attribute: src, dest in Graph
 modified attribute: score in Graph
 decompose update table
 Graph’ = select src, IT.dest, VT.score
from VT, IT
where VT.src = IT.src 13
Example: PageRank
 Table decomposition
 Remove Graph’ table from query
 discriminate
14
simplification
Subquery lifting
 construct read-only (invariant) views
accessed by step queries
 extract loop-invariant computations by
using unmodified attributes
 Procedure
1. Constant let statement lifting (to initialize
clause)
2. Invariant subquery lifting (to initialize clause)
3. Common subquery elimination with query
rewrite, unnesting, identity query elimination
15
Example: PageRank
16
Invariant subquery lifting
Identity query elimination, VT = Score
Example: k-means
17
table decomposition
query elimination (for VT)
simplification
Automatic incrementalization
 Not all records are updated in iterations.
Purpose is to reuse unmodified tuples in
variant views.
 Procedure
1. Detect delta table between iterations before
starting 1st iteration.
2. Derive incremental queries. Both input and
output are delta tables.
3. Execute queries in incremental mode as much
as possible.
18
Delta table detection
 Delta table is detected easily, since we
have already identified variant views.
 ΔT = T’ – T,
 Update operations for update tables
 insertion
 deletion
 update
19
Deriving incremental queries
 Many literatures for incremental query
evaluations [9,13, 19]
 We focus on incremental query
evaluation for update operations, since
they are frequent in iterative queries.
20
Deriving incremental queries
 Query:
where step query q, update table T, delta table ΔT,
terminate condition φ
 Suppose q is distributed:
We obtain incremental query:
where ψ is an optional filter
21
Distribution rules
 Rules for relational operators
 selection
 projection
 join
 group-by
22
Example: PageRank
 Remember the query after lifting
 In algebraic form:
23
Example: PageRank
 This is re-written to:
24
Additional rules for group-by
 insertion/deletion rules for group-by
 sum, count: insertion and deletion
 max, min: only for insertion (not distributive for deletion)
25
MapReduce implementation
 We extend Hive for OptIQ
 Iterative query processing
 convergence is tested by joining old and new
update tables
 View materialization
 partition invariant views by group-by/join keys
for efficient group-by/join operations
 Incrementalization
 apply incrementalization as much as possible
 delta table is kept on DFS
 Putting MR design patterns together
26
Experiments
 Purpose
 How effective OptIQ is for real analysis?
 How much errors occur caused by
incrementalization?
 OptIQ is applicable for MapReduce and Spark?
 Environment: 11 computers
 Workload
�� Datasets: graph (wikipedia, web graph),
multidimensional data (US cencus, mnist8m)
 Analysis: PageRank, RWR, k-means clustering
27
PageRank: performance
28
PageRank: convergence
29
k-means: performance
30
k-means: convergence
31
Related work
 Iterative MapReduce runtime system
 Twister: iterative MR computation
 Iterative mapReduce programming models
 HaLoop: manual view caching
 iMapReuce:
 Spark: in-memory cluster computing for iterative
applications, manual optimization for map-side join
 Pregel: Bulk synchronous parallel model
 GraphLab: Distributed graph computation model
 PEGAUS: matrix multiplication model on MapReduce
32
Related work cont.
 Declaratiave MapReduce programming
 HiveQL and Pig : SQL on MapReduce
 HadoopDB: Integration of RBMS and MapReduce
 MRQL: iterative query language, algebraic/MR-level
optimization; map fusion, join/group-by fusion
 Query optimization in MapReduce
 Comet: algebraic-level (shared selection, grouping,
time-spanned views) and MR-level sharing (shared
scan, shared shuffle)
 Ysmart: sharing among group-by and joins
 REX: explicit incremental computation
33
Conclusion
 OptIQ is optimization for iterative queries
with convergence property
 Two techniques for removing redundancy
 view materialization for invariant views
 incrementalization for variant views
 We implement on Hive and Spark
 OptIQ improves the performance up to five
times faster
34
Future work
 Apply OptIQ to another analysis: NMF, affinity
propagation, logistic regression
 adaptive and incremental evaluation techniques
for matrix computation, such as PageRank, NMF,
centrality computation
35

More Related Content

Optimization for iterative queries on Mapreduce

  • 1. Makoto Onizuka, Hiroyuki Kato, Soichiro Hidaka, Keisuke Nakano, Zhenjiang Hu 1
  • 2. Demand for Big Data Analysis  Big Data Analysis  Cyber space: Click log, query log  Real space: shopping log, sensing data  Machine learning  Algorithm: classification, clustering  Data type: relation, vector, graph, time series  Distributed computing framework  Interface: MPI, MapReduce, BSP (bulk synchronous parallel) 2
  • 3. Iterative analysis examples  Clustering  Partitioning: k-means, EM-algorithm, affinity propagation  Hierarchical clustering: Ward's method, BIRCH  Matrix factorization  Graph mining  PageRank, Random walk with restarts 3
  • 4. Running example: PageRank This program is not efficient. Which parts? 4 map function shuffles whole graph structure in every iteration scores are computed even if the nodes have converged
  • 5. Issues for iterative analysis  How to optimize the program?  Reusing the intermediate (shuffled) data  Skip computing the scores of converted nodes  Possible but difficult to manually remove the above redundant computations  Actually, Spark, HaLoop, REX force programmers to manually remove them  Our goal: Automatically remove redundant computations for iterative queries 5
  • 6. Overview  OptIQ is a new optimization framework for iterative queries with convergence property  Declarative high level language; programmers are freed from burden of removing redundancy  OptIQ Integrates traditional optimization techniques in database and compiler areas  Two techniques for removing redundancy  view materialization for invariant views  incrementalization for variant views  We implement on Hive and Spark 6
  • 7. Iterative query language  SQL extended with iteration  Syntax  Behavior  initialize: statements before iteration  update table is updated by step query repeatedly until convergence  return: statements after iteration 7
  • 10. Query Optimization  Goal: remove redundant computations  Question: What is redundant computation?  Operations on unmodified attributes of tuples  Operations on attributes of unmodified tuples  OptIQ reuses partial results of step queries  View materialization reuses operations on unmodified attributes  incrementalization reuses operations on unmodified tuples 10
  • 12. View materialization  Purpose is to reuse unmodified attributes of update table during iterations  Procedure 1. Decompose update table into variant and invariant tables by conservative analysis 2. Materialize sub-query in step query that only accesses invariant table 3. Rewrite step query to use materialized view, query processing using view 12
  • 13. Table decomposition  discriminate modified/unmodified attributes  unmodified attribute: src, dest in Graph  modified attribute: score in Graph  decompose update table  Graph’ = select src, IT.dest, VT.score from VT, IT where VT.src = IT.src 13
  • 14. Example: PageRank  Table decomposition  Remove Graph’ table from query  discriminate 14 simplification
  • 15. Subquery lifting  construct read-only (invariant) views accessed by step queries  extract loop-invariant computations by using unmodified attributes  Procedure 1. Constant let statement lifting (to initialize clause) 2. Invariant subquery lifting (to initialize clause) 3. Common subquery elimination with query rewrite, unnesting, identity query elimination 15
  • 16. Example: PageRank 16 Invariant subquery lifting Identity query elimination, VT = Score
  • 17. Example: k-means 17 table decomposition query elimination (for VT) simplification
  • 18. Automatic incrementalization  Not all records are updated in iterations. Purpose is to reuse unmodified tuples in variant views.  Procedure 1. Detect delta table between iterations before starting 1st iteration. 2. Derive incremental queries. Both input and output are delta tables. 3. Execute queries in incremental mode as much as possible. 18
  • 19. Delta table detection  Delta table is detected easily, since we have already identified variant views.  ΔT = T’ – T,  Update operations for update tables  insertion  deletion ��� update 19
  • 20. Deriving incremental queries  Many literatures for incremental query evaluations [9,13, 19]  We focus on incremental query evaluation for update operations, since they are frequent in iterative queries. 20
  • 21. Deriving incremental queries  Query: where step query q, update table T, delta table ΔT, terminate condition φ  Suppose q is distributed: We obtain incremental query: where ψ is an optional filter 21
  • 22. Distribution rules  Rules for relational operators  selection  projection  join  group-by 22
  • 23. Example: PageRank  Remember the query after lifting  In algebraic form: 23
  • 24. Example: PageRank  This is re-written to: 24
  • 25. Additional rules for group-by  insertion/deletion rules for group-by  sum, count: insertion and deletion  max, min: only for insertion (not distributive for deletion) 25
  • 26. MapReduce implementation  We extend Hive for OptIQ  Iterative query processing  convergence is tested by joining old and new update tables  View materialization  partition invariant views by group-by/join keys for efficient group-by/join operations  Incrementalization  apply incrementalization as much as possible  delta table is kept on DFS  Putting MR design patterns together 26
  • 27. Experiments  Purpose  How effective OptIQ is for real analysis?  How much errors occur caused by incrementalization?  OptIQ is applicable for MapReduce and Spark?  Environment: 11 computers  Workload  Datasets: graph (wikipedia, web graph), multidimensional data (US cencus, mnist8m)  Analysis: PageRank, RWR, k-means clustering 27
  • 32. Related work  Iterative MapReduce runtime system  Twister: iterative MR computation  Iterative mapReduce programming models  HaLoop: manual view caching  iMapReuce:  Spark: in-memory cluster computing for iterative applications, manual optimization for map-side join  Pregel: Bulk synchronous parallel model  GraphLab: Distributed graph computation model  PEGAUS: matrix multiplication model on MapReduce 32
  • 33. Related work cont.  Declaratiave MapReduce programming  HiveQL and Pig : SQL on MapReduce  HadoopDB: Integration of RBMS and MapReduce  MRQL: iterative query language, algebraic/MR-level optimization; map fusion, join/group-by fusion  Query optimization in MapReduce  Comet: algebraic-level (shared selection, grouping, time-spanned views) and MR-level sharing (shared scan, shared shuffle)  Ysmart: sharing among group-by and joins  REX: explicit incremental computation 33
  • 34. Conclusion  OptIQ is optimization for iterative queries with convergence property  Two techniques for removing redundancy  view materialization for invariant views  incrementalization for variant views  We implement on Hive and Spark  OptIQ improves the performance up to five times faster 34
  • 35. Future work  Apply OptIQ to another analysis: NMF, affinity propagation, logistic regression  adaptive and incremental evaluation techniques for matrix computation, such as PageRank, NMF, centrality computation 35