SlideShare a Scribd company logo
introduction to big data
Albert Bifet (@abifet)
Paris, 7 October 2015
albert.bifet@telecom-paristech.fr
who am i
• Associate Professor at Télécom ParisTech  
• I work on data stream mining algorithms and systems
• MOA: Massive Online Analytics
• Apache SAMOA: Scalable Advanced Massive Online Analytics
• PhD: UPC BarcelonaTech, 2009
• Previous affiliations:
• University of Waikato (New Zealand)
• Yahoo! Labs (Barcelona)
• Huawei (Hong Kong)
1
big data
BIG DATA are data sets so large or complex that
traditional data processing applications can not
deal with.
BIG DATA is an OPEN SOURCE Software Revolution.
2
big data
BIG DATA are data sets so large or complex that
traditional data processing applications can not
deal with.
BIG DATA is an OPEN SOURCE Software Revolution.
2
digital universe
EMC Digital Universe with Research & Analysis by
IDC
The Digital Universe of Opportunities: Rich
Data and the Increasing Value of the Internet
of Things
April 2014
3
digital universe
Figure 1: EMC Digital Universe, 2014
4
digital universe
Memory unit Size Binary size
kilobyte (kB/KB) 103 210
megabyte (MB) 106 220
gigabyte (GB) 109 230
terabyte (TB) 1012 240
petabyte (PB) 1015 250
exabyte (EB) 1018 260
zettabyte (ZB) 1021 270
yottabyte (YB) 1024 280
5
digital universe
Figure 2: EMC Digital Universe, 2014
6
digital universe
Figure 3: EMC Digital Universe, 2014
7
digital universe
Figure 4: EMC Digital Universe, 2014
8
digital universe
Figure 5: EMC Digital Universe, 2014
9
big data 6v’s
• Volume
• Variety
• Velocity
• Value
• Variability
• Veracity
10
controversy of big data
• All data is BIG now
• Hype to sell Hadoop based systems
• Ethical concerns about accessibility
• Limited access to Big Data creates new digital divides
• Statistical Significance:
• When the number of variables grow, the number of fake
correlations also grow Leinweber: S&P 500 stock index
correlated with butter production in Bangladesh
11
future challenges for big data
• Evaluation
• Time evolving data
• Distributed mining
• Compression
• Visualization
• Hidden Big Data
12
big data ecosystem
13
batch and streaming engines
Figure 6: Batch, streaming and hybrid data processing engines.
14
motivation mapreduce
how many servers does google have?
Figure 7: Asking Google 16
a google server room
Figure 8:
https://www.youtube.com/watch?t=3&v=avP5d16wEp0
17
typical big data challenges
• How do we break up a large problem into smaller tasks that
can be executed in parallel?
• How do we assign tasks to workers distributed across a
potentially large number of machines?
• How do we ensure that the workers get the data they need?
• How do we coordinate synchronization among the different
workers?
• How do we share partial results from one worker that is
needed by another?
• How do we accomplish all of the above in the face of
software errors and hardware faults?
18
google 2004
There was need for an abstraction that hides many
system-level details from the programmer.
MapReduce addresses this challenge by providing a
simple abstraction for the developer, transparently
handling most of the details behind the scenes in a
scalable, robust, and efficient manner.
19
google 2004
There was need for an abstraction that hides many
system-level details from the programmer.
MapReduce addresses this challenge by providing a
simple abstraction for the developer, transparently
handling most of the details behind the scenes in a
scalable, robust, and efficient manner.
19
jeff dean
MapReduce, BigTable, Spanner
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat. OSDI’04: Sixth
Symposium on Operating System Design and Implementation
20
jeff dean facts
Google Culture Facts
”When Jeff Dean designs software, he first codes the binary
and then writes the source as documentation.”
21
jeff dean facts
Google Culture Facts
”Jeff Dean compiles and runs his code before submitting, but
only to check for compiler and CPU bugs.”
21
jeff dean facts
Google Culture Facts
“The rate at which Jeff Dean produces code jumped by a factor
of 40 in late 2000 when he upgraded his keyboard to USB2.0.”
21
jeff dean facts
Google Culture Facts
”The speed of light in a vacuum used to be about 35 mph.
Then Jeff Dean spent a weekend optimizing physics.”
21
jeff dean facts
Google Culture Facts
Compilers don’t warn Jeff Dean. Jeff Dean warns compilers
21
mapreduce
references
23
numbers everyone should know (jeff dean)
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 100 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 10,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from network 10,000,000 ns
Read 1 MB sequentially from disk 30,000,000 ns
Send packet CA to Netherlands to CA 150,000,000 ns
24
typical big data problem
• Iterate over a large number of records
• Extract something of interest from each
• Shuffle and sort intermediate results
• Aggregate intermediate results
• Generate final output
25
typical big data problem
• Iterate over a large number of records
• Extract something of interest from each –MAP–
• Shuffle and sort intermediate results
• Aggregate intermediate results –REDUCE–
• Generate final output
25
functional programming
Figure 9: Map as a transformation function and Fold as an
aggregation function 26
map and reduce functions
• In MapReduce, the programmer defines the program logic as
two functions:
• map: (k1,v1) → list[(k2,v2)]
• Map transforms the input into key-value pairs to process
• reduce: (k2,list[v2]) → list[(k3,v3)]
• Reduce aggregates the list of values for each key
• The MapReduce environment takes in charge distribution
aspects.
• A complex program can be decomposed as a succession of
Map and Reduce tasks
27
simplified view of mapreduce
Figure 10: Two-stage processing structure
28
an example application: word count
Input Data
foo.txt: Sweet, this is the foo file
bar.txt: This is the bar file
Output Data
sweet 1
this 2
is 2
the 2
foo 1
bar 1
file 2
29
wordcount example
1: class Mapper
2: method Map(docid a,doc d)
3: for all term t ∈ doc d do
4: Emit(term t,count 1)
5: end for
6: end method
7: end class
1: class Reducer
2: method Reduce(term t,counts [c1,c2,...])
3: sum ← 0
4: for all count c ∈ counts [c1,c2,...] do
5: sum ← sum+c
6: end for
7: Emit(term t,count sum)
8: end method
9: end class
30
simple mapreduce variations
No Reducers
Each mapper output is directly written to a file
disk
No Mappers
Not possible!
Identity Function Mappers
Sorting and regrouping the input data
Identity Function Reducers
Sorting and regrouping the data from mappers
31
simple mapreduce variations
No Reducers
Each mapper output is directly written to a file
disk
No Mappers
Not possible!
Identity Function Mappers
Sorting and regrouping the input data
Identity Function Reducers
Sorting and regrouping the data from mappers
31
simple mapreduce variations
No Reducers
Each mapper output is directly written to a file
disk
No Mappers
Not possible!
Identity Function Mappers
Sorting and regrouping the input data
Identity Function Reducers
Sorting and regrouping the data from mappers
31
simple mapreduce variations
No Reducers
Each mapper output is directly written to a file
disk
No Mappers
Not possible!
Identity Function Mappers
Sorting and regrouping the input data
Identity Function Reducers
Sorting and regrouping the data from mappers
31
simple mapreduce variations
No Reducers
Each mapper output is directly written to a file
disk
No Mappers
Not possible!
Identity Function Mappers
Sorting and regrouping the input data
Identity Function Reducers
Sorting and regrouping the data from mappers
31
mapreduce framework
Figure 11: Runtime Framework
32
mapreduce framework
• Handles scheduling
• Assigns workers to map and reduce tasks
• Handles “data distribution”
• Moves processes to data
• Handles synchronization
• Gathers, sorts, and shuffles intermediate data
• Handles errors and faults
• Detects worker failures and restarts
• Everything happens on top of a distributed filesystem
33
fault tolerance
The Master periodically checks the availability and reachability
of the tasktrackers (heartbeats) and whether map or reduce
jobs make any progress
• if a mapper fails, its task is reassigned to another tasktracker
• if a reducer fails, its task is reassigned to another
tasktracker; this usually require restarting mapper tasks as
well (to produce intermediate groups)
• if the jobtracker fails, the whole job should be re-initiated
Speculative execution: schedule redundant copies of the
remaining tasks across several nodes
34
complete mapreduce framework
35
partitioners and combiners
Partitioners
Divide up the intermediate key space and assign intermediate
key-value pairs to reducers: “simple hash of the key”
partition: (k, number of partitions) → partition for k
Combiners
Optimization in MapReduce that allow for local aggregation
before the shuffle and sort phase: “mini-reducers”
combine: (k2,list[v2]) → list[(k3,v3)]
Run in memory, and their goal is to reduce network traffic.
36
apache hadoop
origins of apache hadoop
• Hadoop was created by Doug Cutting (Apache Lucene) when
he was building Apache Nutch, an open source web search
engine.
• Cutting was an employee of Yahoo!, where he led the
Hadoop project.
• The name comes from a favorite stuffed elephant of his son.
38
differences between hadoop mapreduce and google mapreduce
• In Hadoop MapReduce, the list of values that arrive to the
reducers are not ordered. In Google MapReduce it is possible
to specify a secondary sort key for ordering the values.
• In Google MapReduce reducers, the output key should be
the same as the input key. Hadoop MapReduce reducers can
ouput different key-value pairs (with different keys to the
input key)
• In Google MapReduce mappers output to combiners, and in
Hadoop MapReduce mappers output to partitioners.
39
what is apache hadoop?
The Apache Hadoop project develops open-source software
for reliable, scalable, distributed computing.
It includes these modules:
• Hadoop Common: The common utilities that support the
other Hadoop modules.
• Hadoop Distributed File System (HDFS): A distributed file
system that provides high-throughput access to application
data.
• Hadoop YARN: A framework for job scheduling and cluster
resource management.
• Hadoop MapReduce: A YARN-based system for parallel
processing of large data sets 40
hadoop v2
Figure 13: Apache Hadoop NextGen MapReduce (YARN)
41
apache hadoop nextgen mapreduce (yarn)
Figure 14: MRv2 splits up the two major functionalities of the
JobTracker, resource management and job scheduling/monitoring,
into separate daemons. An application is either a single job in the
classical sense of Map-Reduce jobs or a DAG of jobs.
42
apache hadoop nextgen mapreduce (yarn)
In YARN, the ResourceManager has two main components:
• The Scheduler: responsible for allocating resources to the
various running applications subject to familiar constraints
of capacities, queues etc.
• The ApplicationsManager: responsible for accepting
job-submissions, negotiating the first container for
executing the application specific ApplicationMaster and
provides the service for restarting the ApplicationMaster
container on failure.
43
the hadoop distributed file system hdfs
Assumptions and Goals
• Hardware Failure
• Streaming Data Access
• Large Data Sets
• Simple Coherency Model ( write-once-read-many access
model)
• “Moving Computation is Cheaper than Moving Data”
• Portability Across Heterogeneous Hardware and Software
Platforms
44
the distributed file system
Figure 15: Distributed File System Architecture
45
the distributed file system
Figure 16: Block Replication
46
an example application: word count
Input Data
foo.txt: Sweet, this is the foo file
bar.txt: This is the bar file
Output Data
sweet 1
this 2
is 2
the 2
foo 1
bar 1
file 2
47
wordcount example
1: class Mapper
2: method Map(docid a,doc d)
3: for all term t ∈ doc d do
4: Emit(term t,count 1)
5: end for
6: end method
7: end class
1: class Reducer
2: method Reduce(term t,counts [c1,c2,...])
3: sum ← 0
4: for all count c ∈ counts [c1,c2,...] do
5: sum ← sum+c
6: end for
7: Emit(term t,count sum)
8: end method
9: end class
48
mapper java code
public s t at i c class TokenizerMapper
extends Mapper<Object , Text , Text , IntWritable >{
private f i n a l s ta t i c IntWritable one = new IntWritable ( 1 ) ;
private Text word = new Text ( ) ;
public void map( Object key , Text value , Context context
) throws IOException , InterruptedException {
StringTokenizer i t r = new StringTokenizer ( value . toString ( ) ) ;
while ( i t r . hasMoreTokens ( ) ) {
word . set ( i t r . nextToken ( ) ) ;
context . write ( word , one ) ;
}
}
}
49
reducer java code
public s t at i c class IntSumReducer
extends Reducer <Text , IntWritable , Text , IntWritable > {
private IntWritable result = new IntWritable ( ) ;
public void reduce ( Text key , Iterable < IntWritable > values ,
Context context
) throws IOException , InterruptedException {
i n t sum = 0;
for ( IntWritable val : values ) {
sum += val . get ( ) ;
}
result . set (sum ) ;
context . write ( key , result ) ;
}
}
50
driver java code
public s t at i c void main ( String [ ] args ) throws Exception {
Configuration conf = new Configuration ( ) ;
Job job = Job . getInstance ( conf , ”word count ” ) ;
job . setJarByClass ( WordCount . class ) ;
job . setMapperClass ( TokenizerMapper . class ) ;
job . setCombinerClass ( IntSumReducer . class ) ;
job . setReducerClass ( IntSumReducer . class ) ;
job . setOutputKeyClass ( Text . class ) ;
job . setOutputValueClass ( IntWritable . class ) ;
FileInputFormat . addInputPath ( job , new Path ( args [ 0 ] ) ) ;
FileOutputFormat . setOutputPath ( job , new Path ( args [ 1 ] ) ) ;
System . e x i t ( job . waitForCompletion ( true ) ? 0 : 1 ) ;
}
51
hadoop mapreduce data flow
Figure 17: High-level MapReduce pipeline
52
hadoop mapreduce data flow
Figure 18: Detailed Hadoop MapReduce data flow 53
hadoop mapreduce data flow
Figure 19: Combiner step inserted into the MapReduce data flow
54
mapreduce algorithms
simple mapreduce algorithms
Distributed Grep
• Grep: reports matching lines on input files
• Split all files across the nodes
• Map: emits a line if it matches the specified pattern
• Reduce: identity function
Count of URL Access Frequency
• Processing logs of web access
• Map: outputs <URL,1>
• Reduce: Adds together and outputs <URL, Total Count>
56
simple mapreduce algorithms
Reverse Web-Link Graph
• Computes source list of web pages linked to target URLs
• Map: outputs <target,source>
• Reduce: Concatenates together and outputs <target,
list(source)>
Inverted Index
• Build an inverted index
• Map: emits a sequence of <word, docID>
• Reduce: outputs <word, list(docID)>
57
joins in mapreduce
Two datasets, A and B that we need to join for a MapReduce
task
• If one of the dataset is small, it can be sent over fully to each
tasktracker and exploited inside the map (and possibly
reduce) functions
• Otherwise, each dataset should be grouped according to the
join key, and the result of the join can be computed in the
reduce function
58
wordcount example revisited
1: class Mapper
2: method Map(docid a,doc d)
3: for all term t ∈ doc d do
4: Emit(term t,count 1)
5: end for
6: end method
7: end class
1: class Reducer
2: method Reduce(term t,counts [c1,c2,...])
3: sum ← 0
4: for all count c ∈ counts [c1,c2,...] do
5: sum ← sum+c
6: end for
7: Emit(term t,count sum)
8: end method
9: end class
59
wordcount example revisited
1: class Mapper
2: method Map(docid a,doc d)
3: for all term t ∈ doc d do
4: Emit(term t,count 1)
5: end for
6: end method
7: end class
1: class Mapper
2: method Map(docid a,doc d)
3: H ← new AssociativeArray
4: for all term t ∈ doc d do
5: H{t} ← H{t}+1 ▷ Tally counts for entire document
6: end for
7: for all term t ∈ H do
8: Emit(term t,count H{t})
9: end for
10: end method
11: end class 60
wordcount example revisited
1: class Mapper
2: method Initialize
3: H ← new AssociativeArray
4: end method
5: method Map(docid a,doc d)
6: for all term t ∈ doc d do
7: H{t} ← H{t}+1 ▷ Tally counts across documents
8: end for
9: end method
10: method Close
11: for all term t ∈ H do
12: Emit(term t,count H{t})
13: end for
14: end method
15: end class
Word count mapper using the“in-mapper combining”.
61
average computing example
Example
Given a large number of key-values pairs, where
• keys are strings
• values are integers
find all average of values by key
Example
• Input: <“a”,1>, <“b”,2>, <“c”,10>, <“b”,4>,
<“a”,7>
• Output: <“a”,4>, <“b”,3>, <“c”,10>
62
average computing example
1: class Mapper
2: method Map(string t,integer r)
3: Emit(string t,integer r)
4: end method
5: end class
1: class Reducer
2: method Reduce(string t,integers [r1,r2,...])
3: sum ← 0
4: cnt ← 0
5: for all integer r ∈ integers [r1,r2,...] do
6: sum ← sum+r
7: cnt ← cnt+1
8: end for
9: ravg ← sum/cnt
10: Emit(string t,integer ravg)
11: end method
12: end class
63
average computing example
Example
Given a large number of key-values pairs, where
• keys are strings
• values are integers
find all average of values by key
Average computing is not associative
• average(1,2,3,4,5) ̸= average( average(1,2), average(3,4,5))
• 3 ̸= average( 1.5, 4) = 2.75
64
average computing example
1: class Mapper
2: method Map(string t,integer r)
3: Emit(string t,pair (r,1))
4: end method
5: end class
1: class Combiner
2: method Combine(string t,pairs [(s1,c1),(s2,c2)...])
3: sum ← 0
4: cnt ← 0
5: for all pair (s,c) ∈ pairs [(s1,c1),(s2,c2)...] do
6: sum ← sum+s
7: cnt ← cnt+c
8: end for
9: Emit(string t,pair (sum,cnt))
10: end method
11: end class
1: class Reducer
2: method Reduce(string t,pairs [(s1,c1),(s2,c2)...])
3: sum ← 0
4: cnt ← 0
5: for all pair (s,c) ∈ pairs [(s1,c1),(s2,c2)...] do
6: sum ← sum+s
7: cnt ← cnt+c
8: end for
9: ravg ← sum/cnt
10: Emit(string t,integer ravg)
11: end method
12: end class
65
monoidify!
Monoids as a Design Principle for Efficient MapReduce
Algorithms (Jimmy Lin)
Given a set S, an operator ⊕ and an identity element e, for all a,
b,c in S:
• Closure: a⊕b is also in S.
• Associativity: a⊕(b⊕c) = (a⊕b)⊕c
• Identity: e⊕a = a⊕e = e
66
average computing example
1: class Mapper
2: method Initialize
3: S ← new AssociativeArray
4: C ← new AssociativeArray
5: end method
6: method Map(string t,integer r)
7: S{t} ← S{t}+r
8: C{t} ← C{t}+1
9: end method
10: method Close
11: for all term t ∈ S do
12: Emit(term t,pair (S{t},C{t}))
13: end for
14: end method
15: end class
67
compute word co-occurrence matrices
Problem of building word co-occurrence matrices from large
corpora
• The co-occurrence matrix of a corpus is a square n×n
matrix where n is the number of unique words in the corpus
(i.e., the vocabulary size).
• A cell mij contains the number of times word wi co-occurs
with word wj within a specific context
• a sentence,
• a paragraph
• a document,
• a certain window of m words (where m is an
application-dependent parameter).
• Co-occurrence is a symmetric relation
68
compute word co-occurrence (“pairs” approach)
1: class Mapper
2: method Map(docid a,doc d)
3: for all term w ∈ doc d do
4: for all term u ∈ Neighbors(w) do
5: Emit(pair (w,u),count 1)
6: end for
7: end for
8: end method
9: end class
1: class Reducer
2: method Reduce(pair p,counts [c1,c2,...])
3: s ← 0
4: for all count c ∈ counts [c1,c2,...] do
5: s ← s+c
6: end for
7: Emit(pair p,count s)
8: end method
9: end class
69
compute word co-occurrence (“stripes” approach)
1: class Mapper
2: method Map(docid a,doc d)
3: for all term w ∈ doc d do
4: H ← new AssociativeArray
5: for all term u ∈ Neighbors(w) do
6: H{u} ← H{u}+1
7: end for
8: Emit(Term w, Stripe H)
9: end for
10: end method
11: end class
1: class Reducer
2: method Reduce(term w, stripes [H1,H2,H3,...])
3: Hf ← new AssociativeArray
4: for all stripe H ∈ stripes [H1,H2,H3,...] do
5: Sum(Hf,H)
6: end for
7: Emit(term w,stripe Hf)
8: end method
9: end class
70
mapreduce big data processing
A given application may have:
• A chain of map functions
• (input processing, filtering, extraction. . . )
• A sequence of several map-reduce jobs
• No reduce task when everything can be expressed in the
map (zero reducers, or the identity reducer function)
Prefer:
• Simple map and reduce functions
• Mapper tasks processing large data chunks (at least the size
of distributed filesystem blocks)
71

More Related Content

Introduction to Big Data

  • 1. introduction to big data Albert Bifet (@abifet) Paris, 7 October 2015 albert.bifet@telecom-paristech.fr
  • 2. who am i • Associate Professor at Télécom ParisTech   • I work on data stream mining algorithms and systems • MOA: Massive Online Analytics • Apache SAMOA: Scalable Advanced Massive Online Analytics • PhD: UPC BarcelonaTech, 2009 • Previous affiliations: • University of Waikato (New Zealand) • Yahoo! Labs (Barcelona) • Huawei (Hong Kong) 1
  • 3. big data BIG DATA are data sets so large or complex that traditional data processing applications can not deal with. BIG DATA is an OPEN SOURCE Software Revolution. 2
  • 4. big data BIG DATA are data sets so large or complex that traditional data processing applications can not deal with. BIG DATA is an OPEN SOURCE Software Revolution. 2
  • 5. digital universe EMC Digital Universe with Research & Analysis by IDC The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things April 2014 3
  • 6. digital universe Figure 1: EMC Digital Universe, 2014 4
  • 7. digital universe Memory unit Size Binary size kilobyte (kB/KB) 103 210 megabyte (MB) 106 220 gigabyte (GB) 109 230 terabyte (TB) 1012 240 petabyte (PB) 1015 250 exabyte (EB) 1018 260 zettabyte (ZB) 1021 270 yottabyte (YB) 1024 280 5
  • 8. digital universe Figure 2: EMC Digital Universe, 2014 6
  • 9. digital universe Figure 3: EMC Digital Universe, 2014 7
  • 10. digital universe Figure 4: EMC Digital Universe, 2014 8
  • 11. digital universe Figure 5: EMC Digital Universe, 2014 9
  • 12. big data 6v’s • Volume • Variety • Velocity • Value • Variability • Veracity 10
  • 13. controversy of big data • All data is BIG now • Hype to sell Hadoop based systems • Ethical concerns about accessibility • Limited access to Big Data creates new digital divides • Statistical Significance: • When the number of variables grow, the number of fake correlations also grow Leinweber: S&P 500 stock index correlated with butter production in Bangladesh 11
  • 14. future challenges for big data • Evaluation • Time evolving data • Distributed mining • Compression • Visualization • Hidden Big Data 12
  • 16. batch and streaming engines Figure 6: Batch, streaming and hybrid data processing engines. 14
  • 18. how many servers does google have? Figure 7: Asking Google 16
  • 19. a google server room Figure 8: https://www.youtube.com/watch?t=3&v=avP5d16wEp0 17
  • 20. typical big data challenges • How do we break up a large problem into smaller tasks that can be executed in parallel? • How do we assign tasks to workers distributed across a potentially large number of machines? • How do we ensure that the workers get the data they need? • How do we coordinate synchronization among the different workers? • How do we share partial results from one worker that is needed by another? • How do we accomplish all of the above in the face of software errors and hardware faults? 18
  • 21. google 2004 There was need for an abstraction that hides many system-level details from the programmer. MapReduce addresses this challenge by providing a simple abstraction for the developer, transparently handling most of the details behind the scenes in a scalable, robust, and efficient manner. 19
  • 22. google 2004 There was need for an abstraction that hides many system-level details from the programmer. MapReduce addresses this challenge by providing a simple abstraction for the developer, transparently handling most of the details behind the scenes in a scalable, robust, and efficient manner. 19
  • 23. jeff dean MapReduce, BigTable, Spanner MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat. OSDI’04: Sixth Symposium on Operating System Design and Implementation 20
  • 24. jeff dean facts Google Culture Facts ”When Jeff Dean designs software, he first codes the binary and then writes the source as documentation.” 21
  • 25. jeff dean facts Google Culture Facts ”Jeff Dean compiles and runs his code before submitting, but only to check for compiler and CPU bugs.” 21
  • 26. jeff dean facts Google Culture Facts “The rate at which Jeff Dean produces code jumped by a factor of 40 in late 2000 when he upgraded his keyboard to USB2.0.” 21
  • 27. jeff dean facts Google Culture Facts ”The speed of light in a vacuum used to be about 35 mph. Then Jeff Dean spent a weekend optimizing physics.” 21
  • 28. jeff dean facts Google Culture Facts Compilers don’t warn Jeff Dean. Jeff Dean warns compilers 21
  • 31. numbers everyone should know (jeff dean) L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 100 ns Main memory reference 100 ns Compress 1K bytes with Zippy 10,000 ns Send 2K bytes over 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from network 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Send packet CA to Netherlands to CA 150,000,000 ns 24
  • 32. typical big data problem • Iterate over a large number of records • Extract something of interest from each • Shuffle and sort intermediate results • Aggregate intermediate results • Generate final output 25
  • 33. typical big data problem • Iterate over a large number of records • Extract something of interest from each –MAP– • Shuffle and sort intermediate results • Aggregate intermediate results –REDUCE– • Generate final output 25
  • 34. functional programming Figure 9: Map as a transformation function and Fold as an aggregation function 26
  • 35. map and reduce functions • In MapReduce, the programmer defines the program logic as two functions: • map: (k1,v1) → list[(k2,v2)] • Map transforms the input into key-value pairs to process • reduce: (k2,list[v2]) → list[(k3,v3)] • Reduce aggregates the list of values for each key • The MapReduce environment takes in charge distribution aspects. • A complex program can be decomposed as a succession of Map and Reduce tasks 27
  • 36. simplified view of mapreduce Figure 10: Two-stage processing structure 28
  • 37. an example application: word count Input Data foo.txt: Sweet, this is the foo file bar.txt: This is the bar file Output Data sweet 1 this 2 is 2 the 2 foo 1 bar 1 file 2 29
  • 38. wordcount example 1: class Mapper 2: method Map(docid a,doc d) 3: for all term t ∈ doc d do 4: Emit(term t,count 1) 5: end for 6: end method 7: end class 1: class Reducer 2: method Reduce(term t,counts [c1,c2,...]) 3: sum ← 0 4: for all count c ∈ counts [c1,c2,...] do 5: sum ← sum+c 6: end for 7: Emit(term t,count sum) 8: end method 9: end class 30
  • 39. simple mapreduce variations No Reducers Each mapper output is directly written to a file disk No Mappers Not possible! Identity Function Mappers Sorting and regrouping the input data Identity Function Reducers Sorting and regrouping the data from mappers 31
  • 40. simple mapreduce variations No Reducers Each mapper output is directly written to a file disk No Mappers Not possible! Identity Function Mappers Sorting and regrouping the input data Identity Function Reducers Sorting and regrouping the data from mappers 31
  • 41. simple mapreduce variations No Reducers Each mapper output is directly written to a file disk No Mappers Not possible! Identity Function Mappers Sorting and regrouping the input data Identity Function Reducers Sorting and regrouping the data from mappers 31
  • 42. simple mapreduce variations No Reducers Each mapper output is directly written to a file disk No Mappers Not possible! Identity Function Mappers Sorting and regrouping the input data Identity Function Reducers Sorting and regrouping the data from mappers 31
  • 43. simple mapreduce variations No Reducers Each mapper output is directly written to a file disk No Mappers Not possible! Identity Function Mappers Sorting and regrouping the input data Identity Function Reducers Sorting and regrouping the data from mappers 31
  • 44. mapreduce framework Figure 11: Runtime Framework 32
  • 45. mapreduce framework • Handles scheduling • Assigns workers to map and reduce tasks • Handles “data distribution” • Moves processes to data • Handles synchronization • Gathers, sorts, and shuffles intermediate data • Handles errors and faults • Detects worker failures and restarts • Everything happens on top of a distributed filesystem 33
  • 46. fault tolerance The Master periodically checks the availability and reachability of the tasktrackers (heartbeats) and whether map or reduce jobs make any progress • if a mapper fails, its task is reassigned to another tasktracker • if a reducer fails, its task is reassigned to another tasktracker; this usually require restarting mapper tasks as well (to produce intermediate groups) • if the jobtracker fails, the whole job should be re-initiated Speculative execution: schedule redundant copies of the remaining tasks across several nodes 34
  • 48. partitioners and combiners Partitioners Divide up the intermediate key space and assign intermediate key-value pairs to reducers: “simple hash of the key” partition: (k, number of partitions) → partition for k Combiners Optimization in MapReduce that allow for local aggregation before the shuffle and sort phase: “mini-reducers” combine: (k2,list[v2]) → list[(k3,v3)] Run in memory, and their goal is to reduce network traffic. 36
  • 50. origins of apache hadoop • Hadoop was created by Doug Cutting (Apache Lucene) when he was building Apache Nutch, an open source web search engine. • Cutting was an employee of Yahoo!, where he led the Hadoop project. • The name comes from a favorite stuffed elephant of his son. 38
  • 51. differences between hadoop mapreduce and google mapreduce • In Hadoop MapReduce, the list of values that arrive to the reducers are not ordered. In Google MapReduce it is possible to specify a secondary sort key for ordering the values. • In Google MapReduce reducers, the output key should be the same as the input key. Hadoop MapReduce reducers can ouput different key-value pairs (with different keys to the input key) • In Google MapReduce mappers output to combiners, and in Hadoop MapReduce mappers output to partitioners. 39
  • 52. what is apache hadoop? The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. It includes these modules: • Hadoop Common: The common utilities that support the other Hadoop modules. • Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data. • Hadoop YARN: A framework for job scheduling and cluster resource management. • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets 40
  • 53. hadoop v2 Figure 13: Apache Hadoop NextGen MapReduce (YARN) 41
  • 54. apache hadoop nextgen mapreduce (yarn) Figure 14: MRv2 splits up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. 42
  • 55. apache hadoop nextgen mapreduce (yarn) In YARN, the ResourceManager has two main components: • The Scheduler: responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. • The ApplicationsManager: responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure. 43
  • 56. the hadoop distributed file system hdfs Assumptions and Goals • Hardware Failure • Streaming Data Access • Large Data Sets • Simple Coherency Model ( write-once-read-many access model) • “Moving Computation is Cheaper than Moving Data” • Portability Across Heterogeneous Hardware and Software Platforms 44
  • 57. the distributed file system Figure 15: Distributed File System Architecture 45
  • 58. the distributed file system Figure 16: Block Replication 46
  • 59. an example application: word count Input Data foo.txt: Sweet, this is the foo file bar.txt: This is the bar file Output Data sweet 1 this 2 is 2 the 2 foo 1 bar 1 file 2 47
  • 60. wordcount example 1: class Mapper 2: method Map(docid a,doc d) 3: for all term t ∈ doc d do 4: Emit(term t,count 1) 5: end for 6: end method 7: end class 1: class Reducer 2: method Reduce(term t,counts [c1,c2,...]) 3: sum ← 0 4: for all count c ∈ counts [c1,c2,...] do 5: sum ← sum+c 6: end for 7: Emit(term t,count sum) 8: end method 9: end class 48
  • 61. mapper java code public s t at i c class TokenizerMapper extends Mapper<Object , Text , Text , IntWritable >{ private f i n a l s ta t i c IntWritable one = new IntWritable ( 1 ) ; private Text word = new Text ( ) ; public void map( Object key , Text value , Context context ) throws IOException , InterruptedException { StringTokenizer i t r = new StringTokenizer ( value . toString ( ) ) ; while ( i t r . hasMoreTokens ( ) ) { word . set ( i t r . nextToken ( ) ) ; context . write ( word , one ) ; } } } 49
  • 62. reducer java code public s t at i c class IntSumReducer extends Reducer <Text , IntWritable , Text , IntWritable > { private IntWritable result = new IntWritable ( ) ; public void reduce ( Text key , Iterable < IntWritable > values , Context context ) throws IOException , InterruptedException { i n t sum = 0; for ( IntWritable val : values ) { sum += val . get ( ) ; } result . set (sum ) ; context . write ( key , result ) ; } } 50
  • 63. driver java code public s t at i c void main ( String [ ] args ) throws Exception { Configuration conf = new Configuration ( ) ; Job job = Job . getInstance ( conf , ”word count ” ) ; job . setJarByClass ( WordCount . class ) ; job . setMapperClass ( TokenizerMapper . class ) ; job . setCombinerClass ( IntSumReducer . class ) ; job . setReducerClass ( IntSumReducer . class ) ; job . setOutputKeyClass ( Text . class ) ; job . setOutputValueClass ( IntWritable . class ) ; FileInputFormat . addInputPath ( job , new Path ( args [ 0 ] ) ) ; FileOutputFormat . setOutputPath ( job , new Path ( args [ 1 ] ) ) ; System . e x i t ( job . waitForCompletion ( true ) ? 0 : 1 ) ; } 51
  • 64. hadoop mapreduce data flow Figure 17: High-level MapReduce pipeline 52
  • 65. hadoop mapreduce data flow Figure 18: Detailed Hadoop MapReduce data flow 53
  • 66. hadoop mapreduce data flow Figure 19: Combiner step inserted into the MapReduce data flow 54
  • 68. simple mapreduce algorithms Distributed Grep • Grep: reports matching lines on input files • Split all files across the nodes • Map: emits a line if it matches the specified pattern • Reduce: identity function Count of URL Access Frequency • Processing logs of web access • Map: outputs <URL,1> • Reduce: Adds together and outputs <URL, Total Count> 56
  • 69. simple mapreduce algorithms Reverse Web-Link Graph • Computes source list of web pages linked to target URLs • Map: outputs <target,source> • Reduce: Concatenates together and outputs <target, list(source)> Inverted Index • Build an inverted index • Map: emits a sequence of <word, docID> • Reduce: outputs <word, list(docID)> 57
  • 70. joins in mapreduce Two datasets, A and B that we need to join for a MapReduce task • If one of the dataset is small, it can be sent over fully to each tasktracker and exploited inside the map (and possibly reduce) functions • Otherwise, each dataset should be grouped according to the join key, and the result of the join can be computed in the reduce function 58
  • 71. wordcount example revisited 1: class Mapper 2: method Map(docid a,doc d) 3: for all term t ∈ doc d do 4: Emit(term t,count 1) 5: end for 6: end method 7: end class 1: class Reducer 2: method Reduce(term t,counts [c1,c2,...]) 3: sum ← 0 4: for all count c ∈ counts [c1,c2,...] do 5: sum ← sum+c 6: end for 7: Emit(term t,count sum) 8: end method 9: end class 59
  • 72. wordcount example revisited 1: class Mapper 2: method Map(docid a,doc d) 3: for all term t ∈ doc d do 4: Emit(term t,count 1) 5: end for 6: end method 7: end class 1: class Mapper 2: method Map(docid a,doc d) 3: H ← new AssociativeArray 4: for all term t ∈ doc d do 5: H{t} ← H{t}+1 ▷ Tally counts for entire document 6: end for 7: for all term t ∈ H do 8: Emit(term t,count H{t}) 9: end for 10: end method 11: end class 60
  • 73. wordcount example revisited 1: class Mapper 2: method Initialize 3: H ← new AssociativeArray 4: end method 5: method Map(docid a,doc d) 6: for all term t ∈ doc d do 7: H{t} ← H{t}+1 ▷ Tally counts across documents 8: end for 9: end method 10: method Close 11: for all term t ∈ H do 12: Emit(term t,count H{t}) 13: end for 14: end method 15: end class Word count mapper using the“in-mapper combining”. 61
  • 74. average computing example Example Given a large number of key-values pairs, where • keys are strings • values are integers find all average of values by key Example • Input: <“a”,1>, <“b”,2>, <“c”,10>, <“b”,4>, <“a”,7> • Output: <“a”,4>, <“b”,3>, <“c”,10> 62
  • 75. average computing example 1: class Mapper 2: method Map(string t,integer r) 3: Emit(string t,integer r) 4: end method 5: end class 1: class Reducer 2: method Reduce(string t,integers [r1,r2,...]) 3: sum ← 0 4: cnt ← 0 5: for all integer r ∈ integers [r1,r2,...] do 6: sum ← sum+r 7: cnt ← cnt+1 8: end for 9: ravg ← sum/cnt 10: Emit(string t,integer ravg) 11: end method 12: end class 63
  • 76. average computing example Example Given a large number of key-values pairs, where • keys are strings • values are integers find all average of values by key Average computing is not associative • average(1,2,3,4,5) ̸= average( average(1,2), average(3,4,5)) • 3 ̸= average( 1.5, 4) = 2.75 64
  • 77. average computing example 1: class Mapper 2: method Map(string t,integer r) 3: Emit(string t,pair (r,1)) 4: end method 5: end class 1: class Combiner 2: method Combine(string t,pairs [(s1,c1),(s2,c2)...]) 3: sum ← 0 4: cnt ← 0 5: for all pair (s,c) ∈ pairs [(s1,c1),(s2,c2)...] do 6: sum ← sum+s 7: cnt ← cnt+c 8: end for 9: Emit(string t,pair (sum,cnt)) 10: end method 11: end class 1: class Reducer 2: method Reduce(string t,pairs [(s1,c1),(s2,c2)...]) 3: sum ← 0 4: cnt ← 0 5: for all pair (s,c) ∈ pairs [(s1,c1),(s2,c2)...] do 6: sum ← sum+s 7: cnt ← cnt+c 8: end for 9: ravg ← sum/cnt 10: Emit(string t,integer ravg) 11: end method 12: end class 65
  • 78. monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms (Jimmy Lin) Given a set S, an operator ⊕ and an identity element e, for all a, b,c in S: • Closure: a⊕b is also in S. • Associativity: a⊕(b⊕c) = (a⊕b)⊕c • Identity: e⊕a = a⊕e = e 66
  • 79. average computing example 1: class Mapper 2: method Initialize 3: S ← new AssociativeArray 4: C ← new AssociativeArray 5: end method 6: method Map(string t,integer r) 7: S{t} ← S{t}+r 8: C{t} ← C{t}+1 9: end method 10: method Close 11: for all term t ∈ S do 12: Emit(term t,pair (S{t},C{t})) 13: end for 14: end method 15: end class 67
  • 80. compute word co-occurrence matrices Problem of building word co-occurrence matrices from large corpora • The co-occurrence matrix of a corpus is a square n×n matrix where n is the number of unique words in the corpus (i.e., the vocabulary size). • A cell mij contains the number of times word wi co-occurs with word wj within a specific context • a sentence, • a paragraph • a document, • a certain window of m words (where m is an application-dependent parameter). • Co-occurrence is a symmetric relation 68
  • 81. compute word co-occurrence (“pairs” approach) 1: class Mapper 2: method Map(docid a,doc d) 3: for all term w ∈ doc d do 4: for all term u ∈ Neighbors(w) do 5: Emit(pair (w,u),count 1) 6: end for 7: end for 8: end method 9: end class 1: class Reducer 2: method Reduce(pair p,counts [c1,c2,...]) 3: s ← 0 4: for all count c ∈ counts [c1,c2,...] do 5: s ← s+c 6: end for 7: Emit(pair p,count s) 8: end method 9: end class 69
  • 82. compute word co-occurrence (“stripes” approach) 1: class Mapper 2: method Map(docid a,doc d) 3: for all term w ∈ doc d do 4: H ← new AssociativeArray 5: for all term u ∈ Neighbors(w) do 6: H{u} ← H{u}+1 7: end for 8: Emit(Term w, Stripe H) 9: end for 10: end method 11: end class 1: class Reducer 2: method Reduce(term w, stripes [H1,H2,H3,...]) 3: Hf ← new AssociativeArray 4: for all stripe H ∈ stripes [H1,H2,H3,...] do 5: Sum(Hf,H) 6: end for 7: Emit(term w,stripe Hf) 8: end method 9: end class 70
  • 83. mapreduce big data processing A given application may have: • A chain of map functions • (input processing, filtering, extraction. . . ) • A sequence of several map-reduce jobs • No reduce task when everything can be expressed in the map (zero reducers, or the identity reducer function) Prefer: • Simple map and reduce functions • Mapper tasks processing large data chunks (at least the size of distributed filesystem blocks) 71