Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for large-scale data analytics
- 2. About Databricks
Founded by creators of Spark in 2013
Cloud service for end-to-end data processing
• Interactive notebooks, dashboards,
and production jobs
We are hiring!
- 8. “Spark is the Taylor Swift
of big data software.”
- Derrick Harris, Fortune
- 9. Who is this guy?
Co-founder & architect for Spark at Databricks
Former PhD student at UC Berkeley AMPLab
A “systems” guy, which means I won’t be showing equations and this
talk might be the easiest to consume in HDS
- 10. This talk
1. Develop intuitions on these sketches so you know when to use it
2. Understand how certain parts in distributed data processing (e.g.
Spark) work
- 12. Sketch: Reynold’s not-so-scientific definition
1. Use small amount of space to summarize a large dataset.
2. Go over each data point once, a.k.a. “streaming algorithm”, or
“online algorithm”
3. Parallelizable, but only small amount of communication
- 14. Sketches in Spark
Set membership (Bloom filter)
Cardinality (HyperLogLog)
Histogram (count-min sketch)
Frequent pattern mining
Frequent items
Stratified Sampling
…
- 15. This Talk
Set membership (Bloom filter)
Cardinality (HyperLogLog)
Histogram (count-min sketch)
Frequent pattern mining
Frequent items
Stratified Sampling
…
- 18. Exact set membership
Track every member of the set
• Space: size of data
• One pass: yes
• Parallelizable & communication: size of data
- 19. Approximate set membership
Take 1. Use a 32-bit integer hash map to track
• ~4 bytes per record
• Max 4 billion items
Take 2. Hash items to 256 buckets
• Memory usage only 256 bits
• Good if num records is small
• Bad if num records is large (256+ items, collision rate 100%!)
- 20. Bloom filter
Bloom filter algorithm
• k hash functions
• hash item into k separate positions
• if any of the k positions is not set, then item is not in set
Properties
• ~500MB needed to have 10% error rate on 1 billion items
• See http://hur.st/bloomfilter?n=1000000000&p=0.1
• False positives possible
- 21. Use case beyond exploration
SELECT * FROM A join B on A.key = B.key
1. Assume A and B are both large, i.e. “shuffle join”
2. Some rows in A might not have matched rows in B
3. Wouldn’t it be nice if we only need to shuffle rows that match?
Answer: use a bloom filter to filter the ones that don’t match
- 25. 4,474
3,146
2,352
1,749
1,2931,248
1,1071,0941,065
907 835 793 789 737
598 582 517 482 447 444 420 409 409 405 400 381 378 369 367 366
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Twitterfollowersinthousands
Twitter Followers of NBA teams (in 1,000s), September 2015
Source: http://www.statista.com/statistics/240386/twitter-followers-of-national-basketball-association-teams/
- 27. Frequent Items: Exact Algorithm
SELECT
item,
count(*)
cnt
FROM
corpus
GROUP
BY
item
HAVING
cnt
>
k
*
cnt
• Space: linear to |item|
• One pass: no (two passes)
• Parallelizable & communication: linear to |item|
- 40. How do we implement this?
Maintain a hash table of counts
- 46. When the hash table has k items,
remove 1 from each item and
remove the item if count = 0
4 => 3
1 => 0
- 52. Implementation
Maintains a hash table of counts
• For each item, increment its count
• If hash table size == k:
– decrement 1 from each item; and
– remove items whose count == 0
Parallelization: merge hash tables of max size k
- 53. Comparing Exact vs Approximate
Naïve Exact Sketch
# Passes 2 1
Memory |item| k
Communication |item| k
- 54. Comparing Exact vs Approximate
Naïve Exact Sketch Smart Exact
# Passes 2 1 2
(1st pass using sketch)
Memory |item| k k
Communication |item| k k
- 56. How to use it in Spark?
Frequent items for multiple columns independently
• df.stat.freqItems([“columnA”,
“columnB”,
…])
Frequent items for composite keys
• df.stat.freqItems(struct(“columnA”,
“columnB”))
- 58. Bernoulli sampling & Variance
Sample US population (300m) using rate 0.000002 (~600)
• Wyoming (0.5m) should have 1
• Bernoulli sampling likely leads to Wyoming having 0
Intuition: uniform sampling leads to ~ 600 samples.
• i.e. it might be 600, or 601, or 599, or …
• Impact on WY when going from 600 to 601 is much larger than that on CA’s
- 60. Random sort
Example: sampling probability p = 0.1 on 100 items.
1. Generate random keys
• (0.644, t1), (0.378, t2), … (0.500, t99), (0.471, t100)
2. Sort and select the smallest 10 items
• (0.028, t94), (0.029, t44), …, (0.137, t69), …, (0.980, t26), (0.988, t60)
- 61. Heuristics
Qualitatively speaking
• If u is “much larger” than p, then t is “unlikely” to be selected
• If u is “much smaller” than p, then it is “likely” to be selected
Set two thresholds q1 and q2, such that:
• If u < q1, accept t directly
• If u > q2, reject t directly
• Otherwise, put t in a buffer to be sorted
- 62. Spark’s stratified sampling algorithm
Combines “exact” and “sketch” to achieve parallelization & low
memory overhead
df.stat.sampleByKeyExact(col,
fractions,
seed)
Xiangrui Meng. Scalable Simple Random Sampling and Stratified
Sampling. ICML 2013
- 63. This Talk
Set membership (Bloom filter)
Cardinality (HyperLogLog)
Histogram (count-min sketch)
Frequent pattern mining
Frequent items
Stratified Sampling
…
- 64. Conclusion
Sketches can be useful in exploration, feature engineering, as
well as building faster exact algorithms.
We are building a lot of these into Spark so you don’t need to
reinvent the wheel!