My mapreduce1 presentation

MapReduce
Simplified Data Processing on Large
Clusters
Google , Inc.
Presented by

Noha El-Prince
Winter 2011

Problem and Motivations
—  Large Data Size
—  Limited CPU Powers
—  Difficulties of Distributed , Parallel Computing

2

MapReduce
—  MapReduce is a Software framework
—  introduced by Google
—  Enables automatic parallelization and distribution
of large-scale computations

—  Hides the details of parallelization, data

distribution, load balancing and fault tolerance.

—  Achieves high performance

3

Outline
—  MapReduce : Execution Example
—  Programming Model
—  MapReduce: Distributed Execution
—  More Examples
—  Customization on Cluster
—  Refinements
—  Performance measurement
—  Conclusion and Future Work
—  MapReduce in other companies
4

Programming Model
Intermediate
data

Raw
Data

Reduced
Processed
data

MapReduce Library
(k’,v’)
(k,v)

<k’, v’>*

(k’,<v’>*)
Mu

Ru
5

Example
q  Input:
—  Page 1: the weather is good
—  Page 2: today is good
—  Page 3: good weather is good.

q  Output Desired:
The frequency each word is encountered in all pages.
(the 1), (is 3), (weather 2),(today 1), (good 4)

6

Input
The weather is good Today is good Good weather is good
Data
map(key, value):
for each word w in value:
emit(w, 1)
M
M
M
M
M
M
M

Intermediate
Data

(The,1) (weather,1) (is,1) (good,1) (Today,1) (is,1) (good,1)
(good,1) (weather,1) (is,1) (good,1)

Group by Key
reduce(key, values):
result=0
Grouped
for each count v in values
(The,[1]) (weather,[1,1]) (is,[1,1,1]) (good,[1,1,1]) (Today,[1])
Data
result += v
emit(key, result)
R
R
R
R
R
Output
Data

(The,1) (weather, 2) (is, 3) (good,3) (Today,1)

7

Programming Model
§  Input : A set of key/value pairs
§  Programmer specifies two functions:

Map

Reduce

•  map
(k,v)
à
<k’,
v’>

•  reduce
(k’,<v’>*)
à
<k’,v’>*

All v’ with same k’ are reduced together
8

Distributed Execution Overview

fork
assign
map
Input Data
Split 0 read
Split 1
Split 2

fork
Master

fork

assign
reduce

Worker
Worker

Worker

local
write

Worker
Worker

remote
read,
sort

write

Output
File 0
Output
File 1

9

MapReduce Examples
Distributed Grep: Search pattern (key) : virus
A….virus
B…….
C..virus…

MAP

virus, A…
virus, B…

RED

Virus,
[A..,B..]

Web Page

10

MapReduce Examples
—  Count of URL Access Frequency:

www.cbc.com
www.cnn.com
www.bbc.com
www.cbc.com
www.cbc.com
www.bbc.com

MAP

CBC, [1,1,1]
CNN [1]
BBC [1,1]

RED

CBC, 3
BBC, 2
CNN, 1

Web server logs

11

MapReduce Examples
q Reverse Web-Link Graph:
www.facebook.com

www.youtube.com

source

target

MAP

(facebook,youtube)
(facebook, disney)

Facebook.com
Twitter.com
RED

www.disney.com
Facebook.com

Web server logs

(Facebook, [youtube, disney])

12

MapReduce Examples
q  Term-Vector per Host:.

MAP

word1>
word2>
word2>
word2>

….
RED

Documents of the
facebook (hostname)

<facebook,
<facebook,
<facebook,
<facebook,

<facebook, [word2, …]>
Summary of the most popular words

13

MapReduce Examples
q Inverted Index:

MAP

Docs

<word1,
<word2,
…
<word3,
<word1,
…
<word1,

docID1>
docID1>

<word1, [docID1,
docID2, docID3]>
RED

<word2, [docID1]>

docID2>
docID2>
docID3>

14

Outline
þ—  MapReduce : Execution
þ—  Example
þ—  Programming Model
þ—  MapReduce: Distributed Execution
þ—  More Examples
—  Customizations on Clusters
—  Conclusion & Future Work
—  MapReduce in other companies
15

Customizations on Clusters
—  Coordination
—  Scheduling
—  Fault Tolerance
—  Task Granularity
—  Backup Tasks

16

q Coordination
Master Data Structure
M

250.133.22.7

Completed

Root/intFile.txt

M

250.133.22.8

inprogress

Root/intFile.txt

R

250.123.23.3

idle

Root/outFile.txt

17

q Scheduling
Master scheduling policy: (objective: conserve network bandwidth)
1.  GFS divides each file into 64MB block.
2.  I/P data are stored on the worker’s local disks (managed by GFS)
Ø  Locality :using the same cluster for both data storage and data
processing.
3.  GFS stores multiple copies of each block (typically 3 copies)
on different machines.

18

q Fault Tolerance
On worker failure:
• 
• 
• 
• 

Detect failure via periodic heartbeats
Re-execute completed and in-progress map tasks
Re-execute in progress reduce tasks
Task completion committed through master

Master failure:

•  Could handle, but don't yet (master failure unlikely)
•  MapReduce task is aborted and client is notified

19

q  Task Granularity

(How tasks are divided ?)

Rule of thumb:
Make M and R much larger than the number of worker machines
à  Improves dynamic load balancing
à  speeds recovery from worker failure

Usually R is smaller than M
20

q  Backup tasks
—  Problem of stragglers (machines taking long time
to complete one of the last few tasks )

—  When a MapReduce operation is about to complete:
Ø 
Ø 

Master schedules backup executions of the
remaining tasks
Task is marked “complete” whenever either the
primary or the backup execution completes.

Effect: dramatically shortens job completion time
21

Outline
— 
þ Customizations on Clusters
—  Companies using MapReduce
22

Refinements
—  Partitioning functions.
—  Skipping bad records.
—  Status info.
—  Other Refinements

23

Refinements : Partitioning Function
—  MapRedue users specify no. of tasks/output files desired (R)
—  For reduce, we need to ensure that records with the same
intermediate key end up at the same worker

—  System uses a default partition function
e.g., hash(key) mod R ( results fairly well-balanced partitions )

—  Sometimes useful to override
—  E.g., hash(hostname(URL key)) mod R
Ø 

ensures URLs from a host end up in the same output file

24

Refinements : Skipping Bad Records
§  Map/Reduce functions sometimes fail for particular
inputs
•  MapReduce has a special treatment for ‘bad’
input data, i.e. input data that repeatedly leads to
the crash of a task.
Ø  The master, tracking crashes of tasks, recognizes
such situations and, after a number of failed retries,
will decide to ignore this piece of data.
•  Effect: Can work around bugs in third-party libraries
25

Refinements : Status Information
—  Status pages shows the computation progress
—  Links to standard error and output files generated
by each task.

—  User can
Ø  Predict the computational length
Ø  Add more resources if needed
Ø  Know which workers have failed

—  Useful in user code bug diagnosis

26

Other Refinements
§  Combiner function: Compression of intermediate data
Ø  useful for saving network bandwidth

§  User-defined counters
Ø  periodically propagated to the master from worker machines
Ø Useful for checking behavior of MaReduce operations (appears on
master status page )
27

Outline
— 

þ Refinements
— 
28

Performance
§  Tests run on cluster of 1800 machines: each machine has:
—  4 GB of memory
—  Dual-processor 2 GHz Xeons with Hyperthreading
—  Dual 160 GB IDE disks
—  Gigabit Ethernet link
—  Bisection bandwidth approximately 100-200 Gbps
§  Two benchmarks:
—  Grep:
Scan 1010 100-byte records to extract records
matching a rare pattern (92K matching records)
—  Sort:
Sort 1010 100-byte records

29

Grep

1764 workers
M=15000 (input split= 64MB)
R=1
Assume all machines has same host
Search pattern: 3 characters
Found in: 92,337 records

•  1800 machines read 1 TB of data at peak of ~31 GB/s
•  Startup overhead is significant for short jobs (entire
computation = 80 + 1 minute start up

30

Sort

M=15000 (input split= 64MB)
R=4000, # of workers = 1746
Fig.(a) Btr than Terasoft benchmark reported result of 1057 s

(a) Normal Execution

(b) No backup tasks

(c) 200 tasks killed

(a) Locality optimization èInput rate > shuffle rate and output rate
Output phase writes 2 copies of sorted data è Shuffle rate > output rate
(b) 5 Stragglers à Entire computation rate increases 44% than normal 31

Experience:

Rewrite of Production
Indexing System

§  New code is simpler, easier to understand
§  MapReduce takes care of failures, slow machines
§  Easy to make indexing faster by adding more
machines

32

Outline
— 
— 
þ Refinements
— 
þ Performance measurement
33

Conclusion & Future Work
—  MapReduce has proven to be a useful abstraction
—  Greatly simplifies large-scale computations
—  Fun to use: focus on problem, let library deal w/
messy details

34

MapReduce Advantages/Disadvantages
Now it s easy to program for many CPUs
•  Communication management effectively gone
Ø  I/O scheduling done for us

•  Fault tolerance, monitoring
Ø  machine failures, suddenly-slow machines, etc are handled

•  Can be much easier to design and program!
•  Can cascade several (many?) MapReduce tasks

But … it further restricts solvable problems
•  Might be hard to express problem in MapReduce
•  Data parallelism is key
Ø  Need to be able to break up a problem by data chunks

•  MapReduce is closed-source (to Google) C++
Ø  Hadoop is open-source Java-based rewrite

35

Outline
— 
— 
þ Refinements
— 
þ Performance measurement
— 
þ Conclusion & Future Work
36

Companies using MapReduce
v  Amazon: Amazon Elastic MapReduce :
§  a web service
§  enables businesses, researchers, data analysts, and

developers to easily and cost-effectively process vast
amounts of data.
§  It utilizes a hosted Hadoop framework running on the webscale infrastructure of Amazon Elastic Compute Cloud
(Amazon EC2) and Amazon Simple Storage Service
(Amazon S3).
§  allows you to use Hadoop with no hardware investment

—  http://aws.amazon.com/elasticmapreduce/

37

Companies using MapReduce
—  Amazon: to build product search indices
—  Facebook: processing of web logs, via both Map-Reduce and
Hive

—  IBM and Google: making large compute clusters available to
higher ed and research organizations

—  New York Times: large scale image conversions
—  Yahoo: use Map Reduce and Pig for web log processing, data

model training, web map construction, and much, much more

—  Many universities for teaching parallel and large data
systems

And many more, see them all at

http://wiki.apache.org/hadoop/ PoweredBy
38

My mapreduce1 presentation

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to My mapreduce1 presentation

Similar to My mapreduce1 presentation (20)

More from Noha Elprince

More from Noha Elprince (6)

Recently uploaded

Recently uploaded (20)

My mapreduce1 presentation