Optimization for iterative queries on Mapreduce
- 2. Demand for Big Data Analysis
Big Data Analysis
Cyber space: Click log, query log
Real space: shopping log, sensing data
Machine learning
Algorithm: classification, clustering
Data type: relation, vector, graph, time series
Distributed computing framework
Interface: MPI, MapReduce, BSP (bulk
synchronous parallel)
2
- 3. Iterative analysis examples
Clustering
Partitioning: k-means, EM-algorithm, affinity
propagation
Hierarchical clustering: Ward's method,
BIRCH
Matrix factorization
Graph mining
PageRank, Random walk with restarts
3
- 4. Running example: PageRank
This program is not efficient. Which parts?
4
map function shuffles
whole graph structure
in every iteration
scores are computed
even if the nodes have
converged
- 5. Issues for iterative analysis
How to optimize the program?
Reusing the intermediate (shuffled) data
Skip computing the scores of converted nodes
Possible but difficult to manually remove
the above redundant computations
Actually, Spark, HaLoop, REX force
programmers to manually remove them
Our goal: Automatically remove redundant
computations for iterative queries
5
- 6. Overview
OptIQ is a new optimization framework for
iterative queries with convergence property
Declarative high level language; programmers
are freed from burden of removing redundancy
OptIQ Integrates traditional optimization
techniques in database and compiler areas
Two techniques for removing redundancy
view materialization for invariant views
incrementalization for variant views
We implement on Hive and Spark
6
- 7. Iterative query language
SQL extended with iteration
Syntax
Behavior
initialize: statements before iteration
update table is updated by step query repeatedly
until convergence
return: statements after iteration
7
- 10. Query Optimization
Goal: remove redundant computations
Question: What is redundant computation?
Operations on unmodified attributes of tuples
Operations on attributes of unmodified tuples
OptIQ reuses partial results of step queries
View materialization reuses operations on
unmodified attributes
incrementalization reuses operations on unmodified
tuples
10
- 12. View materialization
Purpose is to reuse unmodified attributes
of update table during iterations
Procedure
1. Decompose update table into variant and
invariant tables by conservative analysis
2. Materialize sub-query in step query that only
accesses invariant table
3. Rewrite step query to use materialized view,
query processing using view
12
- 13. Table decomposition
discriminate modified/unmodified attributes
unmodified attribute: src, dest in Graph
modified attribute: score in Graph
decompose update table
Graph’ = select src, IT.dest, VT.score
from VT, IT
where VT.src = IT.src 13
- 15. Subquery lifting
construct read-only (invariant) views
accessed by step queries
extract loop-invariant computations by
using unmodified attributes
Procedure
1. Constant let statement lifting (to initialize
clause)
2. Invariant subquery lifting (to initialize clause)
3. Common subquery elimination with query
rewrite, unnesting, identity query elimination
15
- 18. Automatic incrementalization
Not all records are updated in iterations.
Purpose is to reuse unmodified tuples in
variant views.
Procedure
1. Detect delta table between iterations before
starting 1st iteration.
2. Derive incremental queries. Both input and
output are delta tables.
3. Execute queries in incremental mode as much
as possible.
18
- 19. Delta table detection
Delta table is detected easily, since we
have already identified variant views.
ΔT = T’ – T,
Update operations for update tables
insertion
deletion
��� update
19
- 20. Deriving incremental queries
Many literatures for incremental query
evaluations [9,13, 19]
We focus on incremental query
evaluation for update operations, since
they are frequent in iterative queries.
20
- 21. Deriving incremental queries
Query:
where step query q, update table T, delta table ΔT,
terminate condition φ
Suppose q is distributed:
We obtain incremental query:
where ψ is an optional filter
21
- 25. Additional rules for group-by
insertion/deletion rules for group-by
sum, count: insertion and deletion
max, min: only for insertion (not distributive for deletion)
25
- 26. MapReduce implementation
We extend Hive for OptIQ
Iterative query processing
convergence is tested by joining old and new
update tables
View materialization
partition invariant views by group-by/join keys
for efficient group-by/join operations
Incrementalization
apply incrementalization as much as possible
delta table is kept on DFS
Putting MR design patterns together
26
- 27. Experiments
Purpose
How effective OptIQ is for real analysis?
How much errors occur caused by
incrementalization?
OptIQ is applicable for MapReduce and Spark?
Environment: 11 computers
Workload
Datasets: graph (wikipedia, web graph),
multidimensional data (US cencus, mnist8m)
Analysis: PageRank, RWR, k-means clustering
27
- 32. Related work
Iterative MapReduce runtime system
Twister: iterative MR computation
Iterative mapReduce programming models
HaLoop: manual view caching
iMapReuce:
Spark: in-memory cluster computing for iterative
applications, manual optimization for map-side join
Pregel: Bulk synchronous parallel model
GraphLab: Distributed graph computation model
PEGAUS: matrix multiplication model on MapReduce
32
- 33. Related work cont.
Declaratiave MapReduce programming
HiveQL and Pig : SQL on MapReduce
HadoopDB: Integration of RBMS and MapReduce
MRQL: iterative query language, algebraic/MR-level
optimization; map fusion, join/group-by fusion
Query optimization in MapReduce
Comet: algebraic-level (shared selection, grouping,
time-spanned views) and MR-level sharing (shared
scan, shared shuffle)
Ysmart: sharing among group-by and joins
REX: explicit incremental computation
33
- 34. Conclusion
OptIQ is optimization for iterative queries
with convergence property
Two techniques for removing redundancy
view materialization for invariant views
incrementalization for variant views
We implement on Hive and Spark
OptIQ improves the performance up to five
times faster
34
- 35. Future work
Apply OptIQ to another analysis: NMF, affinity
propagation, logistic regression
adaptive and incremental evaluation techniques
for matrix computation, such as PageRank, NMF,
centrality computation
35