SlideShare a Scribd company logo
MapReduce
Simplified Data Processing on Large
Clusters
Google , Inc.
Presented by

Noha El-Prince
Winter 2011
Problem and Motivations
—  Large Data Size
—  Limited CPU Powers
—  Difficulties of Distributed , Parallel Computing

2
MapReduce
—  MapReduce is a Software framework
—  introduced by Google
—  Enables automatic parallelization and distribution
of large-scale computations

—  Hides the details of parallelization, data

distribution, load balancing and fault tolerance.

—  Achieves high performance

3
Outline
—  MapReduce : Execution Example
—  Programming Model
—  MapReduce: Distributed Execution
—  More Examples
—  Customization on Cluster
—  Refinements
—  Performance measurement
—  Conclusion and Future Work
—  MapReduce in other companies
4

Recommended for you

[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례
[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례
[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례

DEVIEW 2014 [2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례

deview
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...

Lab seminar introduces Ting Chen's recent 3 works: - Pix2seq: A Language Modeling Framework for Object Detection (ICLR’22) - A Unified Sequence Interface for Vision Tasks (NeurIPS’22) - A Generalist Framework for Panoptic Segmentation of Images and Videos (submitted to ICLR’23)

deep learningmachine learningartificial inteligence
Elk devops
Elk devopsElk devops
Elk devops

La gestione dei log è da sempre un argomento complesso e nel tempo si sono cercate varie soluzioni più o meno complesse, spesso difficili da integrare nel proprio stack applicativo. Daremo un’ overview generale dei principali sistemi di aggregazione evoluta dei log in realtime (Fluentd, Greylog, eccetera) e illustreremo del motivo ci ha spinto a scegliere ELK per risolvere un’esigenza del nostro cliente; ovvero di consultare i log in modo piu comprensibile da persone non tecniche. Lo stack ELK (Elasticsearch Logstash Kibana) permette agli sviluppatori di consultare i log in fase di debug / produzione senza avvalersi dello staff sistemistico. Dimostreremo come abbiamo eseguito il deployment dello stack ELK e lo abbiamo implementato per interpretare e strutturare i log applicativi di Magento.

elkideatomagento
Programming Model
Intermediate
data

Raw
Data

Reduced
Processed
data

MapReduce Library
(k’,v’)
(k,v)

<k’, v’>*

(k’,<v’>*)
Mu

Ru
5
Example
q  Input:
—  Page 1: the weather is good
—  Page 2: today is good
—  Page 3: good weather is good.

q  Output Desired:
The frequency each word is encountered in all pages.
(the 1), (is 3), (weather 2),(today 1), (good 4)

6
Input
The weather is good Today is good Good weather is good
Data
map(key, value):
for each word w in value:
emit(w, 1)
M
M
M
M
M
M
M

Intermediate
Data

(The,1) (weather,1) (is,1) (good,1) (Today,1) (is,1) (good,1)
(good,1) (weather,1) (is,1) (good,1)

Group by Key
reduce(key, values):
result=0
Grouped
for each count v in values
(The,[1]) (weather,[1,1]) (is,[1,1,1]) (good,[1,1,1]) (Today,[1])
Data
result += v
emit(key, result)
R
R
R
R
R
Output
Data

(The,1) (weather, 2) (is, 3) (good,3) (Today,1)

7
Programming Model
§  Input : A set of key/value pairs
§  Programmer specifies two functions:

Map

Reduce

•  map	
  (k,v)	
  à	
  <k’,	
  v’>	
  
•  reduce	
  (k’,<v’>*)	
  à	
  <k’,v’>*	
  
All v’ with same k’ are reduced together
8

Recommended for you

Natural Language Processing: Comparing NLTK and OpenNLP
Natural Language Processing: Comparing NLTK and OpenNLPNatural Language Processing: Comparing NLTK and OpenNLP
Natural Language Processing: Comparing NLTK and OpenNLP

In this presentation presented in AI & ML meetup on 2nd Feb, Sangram Mishra develops the same NLP solution using NLTK and OpenNLP, Sangram compares and contrasts the two open source technologies for deeper understanding and insights on choosing and using them for real-world projects.

natural language processingnlpai
Precise LSTM Algorithm
Precise LSTM AlgorithmPrecise LSTM Algorithm
Precise LSTM Algorithm

An introductory but very precise slide on mathematics of RNN/LSTM algorithms. You would get a clearer understanding on RNN back/forward propagation with this. *This slide is not finished yet. If you like it, please give me some feedback to motivate me. I made this slide as an intern in DATANOMIQ Gmbh URL: https://www.datanomiq.de/

lstmrnndeep learning
系统性能分析和优化.ppt
系统性能分析和优化.ppt系统性能分析和优化.ppt
系统性能分析和优化.ppt
Distributed Execution Overview

fork
assign
map
Input Data
Split 0 read
Split 1
Split 2

fork
Master

fork

assign
reduce

Worker
Worker

Worker

local
write

Worker
Worker

remote
read,
sort

write

Output
File 0
Output
File 1

9
MapReduce Examples
Distributed Grep: Search pattern (key) : virus
A….virus
B…….
C..virus…

MAP

virus, A…
virus, B…

RED

Virus,
[A..,B..]

Web Page

10
MapReduce Examples
—  Count of URL Access Frequency:

www.cbc.com
www.cnn.com
www.bbc.com
www.cbc.com
www.cbc.com
www.bbc.com

MAP

CBC, [1,1,1]
CNN [1]
BBC [1,1]

RED

CBC, 3
BBC, 2
CNN, 1

Web server logs

11
MapReduce Examples
q Reverse Web-Link Graph:
www.facebook.com

www.youtube.com

source

target

MAP

(facebook,youtube)
(facebook, disney)

Facebook.com
Twitter.com
RED

www.disney.com
Facebook.com

Web server logs

(Facebook, [youtube, disney])

12

Recommended for you

Cache in API Gateway
Cache in API GatewayCache in API Gateway
Cache in API Gateway

This document provides an overview of caching strategies when using an API gateway. It first discusses different types of caches like HTTP caches, DNS caches, and database caches. It then focuses on caching strategies for web services, specifically caching within an API gateway. It discusses patterns like cache-aside, read-through, and write-through. It also covers using replication caches for API gateway infrastructure data and distributing caches between nodes. Finally, it provides examples of caching technologies that could be used in an API gateway like Ehcache, Infinispan, Hazelcast, and Redis.

api gatewaycache
Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々
Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々
Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々

JJUG CCC 2016 Spring I-5 Session

java jjug
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...

The document discusses the Vision Transformer (ViT) model for computer vision tasks. It covers: 1. How ViT tokenizes images into patches and uses position embeddings to encode spatial relationships. 2. ViT uses a class embedding to trigger class predictions, unlike CNNs which have decoders. 3. The receptive field of ViT grows as the attention mechanism allows elements to attend to other distant elements in later layers. 4. Initial results showed ViT performance was comparable to CNNs when trained on large datasets but lagged CNNs trained on smaller datasets like ImageNet.

deep learningspatial transformercomputer vision
MapReduce Examples
q  Term-Vector per Host:.

MAP

word1>
word2>
word2>
word2>

….
RED

Documents of the
facebook (hostname)

<facebook,
<facebook,
<facebook,
<facebook,

<facebook, [word2, …]>
Summary of the most popular words

13
MapReduce Examples
q Inverted Index:

MAP

Docs

<word1,
<word2,
…
<word3,
<word1,
…
<word1,

docID1>
docID1>

<word1, [docID1,
docID2, docID3]>
RED

<word2, [docID1]>

docID2>
docID2>
docID3>

14
Outline
þ—  MapReduce : Execution
þ—  Example
þ—  Programming Model
þ—  MapReduce: Distributed Execution
þ—  More Examples
—  Customizations on Clusters
—  Refinements
—  Performance measurement
—  Conclusion & Future Work
—  MapReduce in other companies
15
Customizations on Clusters
—  Coordination
—  Scheduling
—  Fault Tolerance
—  Task Granularity
—  Backup Tasks

16

Recommended for you

【PlayCanvas×NCMB 勉強会+ハンズオン】HTML5ゲームにバックエンド機能をらくらく追加!ハンズオン(2017/09/05講演)
【PlayCanvas×NCMB 勉強会+ハンズオン】HTML5ゲームにバックエンド機能をらくらく追加!ハンズオン(2017/09/05講演)【PlayCanvas×NCMB 勉強会+ハンズオン】HTML5ゲームにバックエンド機能をらくらく追加!ハンズオン(2017/09/05講演)
【PlayCanvas×NCMB 勉強会+ハンズオン】HTML5ゲームにバックエンド機能をらくらく追加!ハンズオン(2017/09/05講演)

2017/9/5公開

html5gamesplaycanvas
One-Shot Learning
One-Shot LearningOne-Shot Learning
One-Shot Learning

Tensorflow KR 2차 모임 라이트닝톡

[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)

MLOps KR 커뮤니티 그룹에서 진행한 MLOps 춘추 전국 시대 정리 발표자료입니다! 다운로드 받으시면 고화질로 볼 수 있습니다!! :)

mlopsmachine learningdeeplearning
Customizations on Clusters
q Coordination
Master Data Structure
M

250.133.22.7

Completed

Root/intFile.txt

M

250.133.22.8

inprogress

Root/intFile.txt

R

250.123.23.3

idle

Root/outFile.txt

17
Customizations on Clusters
q Scheduling
Master scheduling policy: (objective: conserve network bandwidth)
1.  GFS divides each file into 64MB block.
2.  I/P data are stored on the worker’s local disks (managed by GFS)
Ø  Locality :using the same cluster for both data storage and data
processing.
3.  GFS stores multiple copies of each block (typically 3 copies)
on different machines.

18
Customizations on Clusters
q Fault Tolerance
On worker failure:
• 
• 
• 
• 

Detect failure via periodic heartbeats
Re-execute completed and in-progress map tasks
Re-execute in progress reduce tasks
Task completion committed through master

Master failure:

•  Could handle, but don't yet (master failure unlikely)
•  MapReduce task is aborted and client is notified

19
Customizations on Clusters
q  Task Granularity

(How tasks are divided ?)

Rule of thumb:
Make M and R much larger than the number of worker machines
à  Improves dynamic load balancing
à  speeds recovery from worker failure

Usually R is smaller than M
20

Recommended for you

Data Driven Decision을 위한 데이터플랫폼구축기@kakaomobility
Data Driven Decision을 위한 데이터플랫폼구축기@kakaomobilityData Driven Decision을 위한 데이터플랫폼구축기@kakaomobility
Data Driven Decision을 위한 데이터플랫폼구축기@kakaomobility

데이터야놀자 2021 https://datayanolja.kr/ https://www.youtube.com/watch?v=j-xiEKFF2-U

big datadata platformab테스트
Donald Knuth
Donald KnuthDonald Knuth
Donald Knuth

Donald Knuth is an American computer scientist, mathematician, and professor emeritus at Stanford University. He began writing "The Art of Computer Programming" in 1962, which is a comprehensive monograph that covers various programming algorithms and their analysis. The work is divided into multiple volumes that cover different aspects of computer programming such as fundamental algorithms, sorting and searching, and syntactic algorithms. In developing the book, Knuth also popularized the use of asymptotic notation or "Big O" notation to characterize the growth rate of functions. Frustrated with publishing tools at the time, he developed the TeX computer typesetting system, which later became known as LaTeX. Knuth is strongly opposed to software patents, arguing that ideas that should be easily

Manual de aplicação I'green (Empresa Braskem)
Manual de aplicação I'green (Empresa Braskem)Manual de aplicação I'green (Empresa Braskem)
Manual de aplicação I'green (Empresa Braskem)

1. O manual apresenta as diferentes assinaturas do selo I'm green da Braskem, incluindo assinaturas completas, simples e com descritivos. 2. São descritas as proporções corretas para aplicação de cada assinatura e são fornecidas versões em cores e monocromáticas. 3. O documento também especifica a área de reserva do selo e reduções máximas de acordo com diferentes suportes e técnicas de impressão.

plásticossustentabilidaderesponsabilidade ecológica.
Customizations on Clusters
q  Backup tasks
—  Problem of stragglers (machines taking long time
to complete one of the last few tasks )

—  When a MapReduce operation is about to complete:
Ø 
Ø 

Master schedules backup executions of the
remaining tasks
Task is marked “complete” whenever either the
primary or the backup execution completes.

Effect: dramatically shortens job completion time
21
Outline
þ—  MapReduce : Execution
þ—  Example
þ—  Programming Model
þ—  MapReduce: Distributed Execution
þ—  More Examples
— 
þ Customizations on Clusters
—  Refinements
—  Performance measurement
—  Conclusion & Future Work
—  Companies using MapReduce
22
Refinements
—  Partitioning functions.
—  Skipping bad records.
—  Status info.
—  Other Refinements

23
Refinements : Partitioning Function
—  MapRedue users specify no. of tasks/output files desired (R)
—  For reduce, we need to ensure that records with the same
intermediate key end up at the same worker

—  System uses a default partition function
e.g., hash(key) mod R ( results fairly well-balanced partitions )

—  Sometimes useful to override
—  E.g., hash(hostname(URL key)) mod R
Ø 

ensures URLs from a host end up in the same output file

24

Recommended for you

Superpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), Foolad
Superpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), FooladSuperpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), Foolad
Superpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), Foolad

What is a superpixel? This presentation describes Superpixel algorithms such as watershed, mean-shift, SLIC, BSLIC (SLIC superpixels based on boundary term) references: [1] Luc Vincent and Pierre Soille. Watersheds in digital spaces: An efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(6):583–598, 1991. [2] D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, May 2002. [3] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk, SLIC Superpixels Compared to State-of-the-art Superpixel Methods, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, num. 11, p. 2274 - 2282, May 2012. [4] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk, SLIC Superpixels, EPFL Technical Report no. 149300, June 2010. [5] Hai Wang, Xiongyou Peng, Xue Xiao, and Yan Liu, BSLIC: SLIC Superpixels Based on Boundary Term, Symmetry 2017, 9(3), Feb 2017.

slicwhatershedmean-shift
Data pipeline and data lake
Data pipeline and data lakeData pipeline and data lake
Data pipeline and data lake

Data Pipeline Data Lake

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into account for launching speculative map tasks, because it is assumed that most maps are data-local. Unfortunately, both the homogeneity and data locality assumptions are not satisfied in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably reduce the MapReduce performance. In this paper, we address the problem of how to place data across nodes in a way that each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster, our data placement scheme adaptively balances the amount of data stored in each node to achieve improved data-processing performance. Experimental results on two real data-intensive applications show that our data placement strategy can always improve the MapReduce performance by rebalancing data across nodes before performing a data-intensive application in a heterogeneous Hadoop cluster.

heterogeneous clustermapreducedata placement
Refinements : Skipping Bad Records
§  Map/Reduce functions sometimes fail for particular
inputs
•  MapReduce has a special treatment for ‘bad’
input data, i.e. input data that repeatedly leads to
the crash of a task.
Ø  The master, tracking crashes of tasks, recognizes
such situations and, after a number of failed retries,
will decide to ignore this piece of data.
•  Effect: Can work around bugs in third-party libraries
25
Refinements : Status Information
—  Status pages shows the computation progress
—  Links to standard error and output files generated
by each task.

—  User can
Ø  Predict the computational length
Ø  Add more resources if needed
Ø  Know which workers have failed

—  Useful in user code bug diagnosis

26
Other Refinements
§  Combiner function: Compression of intermediate data
Ø  useful for saving network bandwidth

§  User-defined counters
Ø  periodically propagated to the master from worker machines
Ø Useful for checking behavior of MaReduce operations (appears on
master status page )
27
Outline
þ—  MapReduce : Execution
þ—  Example
þ—  Programming Model
þ—  MapReduce: Distributed Execution
þ—  More Examples
— 
þ Customizations on Clusters

þ Refinements
— 
—  Performance measurement
—  Conclusion & Future Work
—  Companies using MapReduce
28

Recommended for you

Handout3o
Handout3oHandout3o
Handout3o

This document provides an overview of Google's Bigtable distributed storage system. It describes Bigtable's data model as a sparse, multidimensional sorted map indexed by row, column, and timestamp. Bigtable stores data across many tablet servers, with a single master server coordinating metadata operations like tablet assignment and load balancing. The master uses Chubby, a distributed lock service, to track which tablet servers are available and reassign tablets if servers become unreachable.

cloud computing
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation

This document describes MapReduce, a programming model and software framework for processing large datasets in a distributed manner. It introduces the key concepts of MapReduce including the map and reduce functions, distributed execution across clusters of machines, and fault tolerance. The document outlines how MapReduce abstracts away complexities like parallelization, data distribution, and failure handling. It has been used successfully at Google for large-scale tasks like search indexing and machine learning.

mapreduce
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce

This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.

mapreducegooglemapreducegoogle
Performance
§  Tests run on cluster of 1800 machines: each machine has:
—  4 GB of memory
—  Dual-processor 2 GHz Xeons with Hyperthreading
—  Dual 160 GB IDE disks
—  Gigabit Ethernet link
—  Bisection bandwidth approximately 100-200 Gbps
§  Two benchmarks:
—  Grep:
Scan 1010 100-byte records to extract records
matching a rare pattern (92K matching records)
—  Sort:
Sort 1010 100-byte records

29
Grep

1764 workers
M=15000 (input split= 64MB)
R=1
Assume all machines has same host
Search pattern: 3 characters
Found in: 92,337 records

•  1800 machines read 1 TB of data at peak of ~31 GB/s
•  Startup overhead is significant for short jobs (entire
computation = 80 + 1 minute start up

30
Sort

M=15000 (input split= 64MB)
R=4000, # of workers = 1746
Fig.(a) Btr than Terasoft benchmark reported result of 1057 s

(a) Normal Execution

(b) No backup tasks

(c) 200 tasks killed

(a) Locality optimization èInput rate > shuffle rate and output rate
Output phase writes 2 copies of sorted data è Shuffle rate > output rate
(b) 5 Stragglers à Entire computation rate increases 44% than normal 31
Experience:

Rewrite of Production
Indexing System

§  New code is simpler, easier to understand
§  MapReduce takes care of failures, slow machines
§  Easy to make indexing faster by adding more
machines

32

Recommended for you

Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile

The document discusses network performance profiling of Hadoop jobs. It presents results from running two common Hadoop benchmarks - Terasort and Ranked Inverted Index - on different Amazon EC2 instance configurations. The results show that the shuffle phase accounts for a significant portion (25-29%) of total job runtime. They aim to reproduce existing findings that network performance is a key bottleneck for shuffle-intensive Hadoop jobs. Some questions are also raised about inconsistencies in reported network bandwidth capabilities for EC2.

MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...

MapReduce is a programming model and an associated implementation for processing and generating large data sets on a distributed computing environment. It allows users to write map and reduce functions to process input key/value pairs in parallel across large clusters of commodity machines. The MapReduce framework handles parallelization, scheduling, input/output distribution, and fault tolerance automatically, allowing developers to focus just on the logic of their map and reduce functions. The paper presents the MapReduce model and describes its implementation at Google for processing terabytes of data across thousands of machines efficiently and with fault tolerance.

Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters

The document describes MapReduce, a programming model and software framework for processing large datasets in a distributed computing environment. It discusses how MapReduce allows users to specify map and reduce functions to parallelize tasks across large clusters of machines. It also covers how MapReduce handles parallelization, fault tolerance, and load balancing transparently through an easy-to-use programming interface.

cloud computinghadoopdatabase
Outline
þ—  MapReduce : Execution
þ—  Example
þ—  Programming Model
þ—  MapReduce: Distributed Execution
þ—  More Examples
— 
þ Customizations on Clusters
— 
þ Refinements
— 
þ Performance measurement
—  Conclusion & Future Work
—  Companies using MapReduce
33
Conclusion & Future Work
—  MapReduce has proven to be a useful abstraction
—  Greatly simplifies large-scale computations
—  Fun to use: focus on problem, let library deal w/
messy details

34
MapReduce Advantages/Disadvantages
Now it s easy to program for many CPUs
•  Communication management effectively gone
Ø  I/O scheduling done for us

•  Fault tolerance, monitoring
Ø  machine failures, suddenly-slow machines, etc are handled

•  Can be much easier to design and program!
•  Can cascade several (many?) MapReduce tasks

But … it further restricts solvable problems
•  Might be hard to express problem in MapReduce
•  Data parallelism is key
Ø  Need to be able to break up a problem by data chunks

•  MapReduce is closed-source (to Google) C++
Ø  Hadoop is open-source Java-based rewrite

35
Outline
þ—  MapReduce : Execution
þ—  Example
þ—  Programming Model
þ—  MapReduce: Distributed Execution
þ—  More Examples
— 
þ Customizations on Clusters
— 
þ Refinements
— 
þ Performance measurement
— 
þ Conclusion & Future Work
—  Companies using MapReduce
36

Recommended for you

Download It
Download ItDownload It
Download It

The document discusses using Map-Reduce for machine learning algorithms on multi-core processors. It describes rewriting machine learning algorithms in "summation form" to express the independent computations as Map tasks and aggregating results as Reduce tasks. This formulation allows the algorithms to be parallelized efficiently across multiple cores. Specific machine learning algorithms that have been implemented or analyzed in this Map-Reduce framework are listed.

pattern recognitionmachine learning
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds

This document discusses using R for large-scale data analysis on distributed data clouds. It recommends splitting large datasets into segments using MapReduce or UDFs, then building separate models for each segment in R. PMML can be used to combine the separate models into an ensemble model. The Sawmill framework is proposed to preprocess data in parallel, build models for each segment using R, and combine the models into a PMML file for deployment. Running R on each segment sequentially allows scaling to large datasets, with examples showing processing times for different numbers of segments.

sphere streamssector streamshadoop streams
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx

MapReduce is a programming model for processing large datasets in parallel across clusters of machines. It involves splitting the input data into independent chunks which are processed by the "map" step, and then grouping the outputs of the maps together and inputting them to the "reduce" step to produce the final results. The MapReduce paper presented Google's implementation which ran on a large cluster of commodity machines and used the Google File System for fault tolerance. It demonstrated that MapReduce can efficiently process very large amounts of data for applications like search, sorting and counting word frequencies.

Companies using MapReduce
v  Amazon: Amazon Elastic MapReduce :
§  a web service
§  enables businesses, researchers, data analysts, and

developers to easily and cost-effectively process vast
amounts of data.
§  It utilizes a hosted Hadoop framework running on the webscale infrastructure of Amazon Elastic Compute Cloud
(Amazon EC2) and Amazon Simple Storage Service
(Amazon S3).
§  allows you to use Hadoop with no hardware investment

—  http://aws.amazon.com/elasticmapreduce/

37
Companies using MapReduce
—  Amazon: to build product search indices
—  Facebook: processing of web logs, via both Map-Reduce and
Hive

—  IBM and Google: making large compute clusters available to
higher ed and research organizations

—  New York Times: large scale image conversions
—  Yahoo: use Map Reduce and Pig for web log processing, data

model training, web map construction, and much, much more

—  Many universities for teaching parallel and large data
systems

And many more, see them all at

http://wiki.apache.org/hadoop/ PoweredBy
38
39

More Related Content

What's hot

Physics-Informed Machine Learning
Physics-Informed Machine LearningPhysics-Informed Machine Learning
Physics-Informed Machine Learning
OmarYounis21
 
Programmable Network Devices
Programmable Network DevicesProgrammable Network Devices
Programmable Network Devices
Tal Lavian Ph.D.
 
The macrame of scholarly training - collecting the cords that bind
The macrame of scholarly training - collecting the cords that bind The macrame of scholarly training - collecting the cords that bind
The macrame of scholarly training - collecting the cords that bind
Danny Kingsley
 
[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례
[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례
[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례
NAVER D2
 
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
Sangwoo Mo
 
Elk devops
Elk devopsElk devops
Elk devops
Ideato
 
Natural Language Processing: Comparing NLTK and OpenNLP
Natural Language Processing: Comparing NLTK and OpenNLPNatural Language Processing: Comparing NLTK and OpenNLP
Natural Language Processing: Comparing NLTK and OpenNLP
CodeOps Technologies LLP
 
Precise LSTM Algorithm
Precise LSTM AlgorithmPrecise LSTM Algorithm
Precise LSTM Algorithm
YasutoTamura1
 
系统性能分析和优化.ppt
系统性能分析和优化.ppt系统性能分析和优化.ppt
系统性能分析和优化.ppt
Frank Cai
 
Cache in API Gateway
Cache in API GatewayCache in API Gateway
Cache in API Gateway
GilWon Oh
 
Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々
Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々
Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々
torutk
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Universitat Politècnica de Catalunya
 
【PlayCanvas×NCMB 勉強会+ハンズオン】HTML5ゲームにバックエンド機能をらくらく追加!ハンズオン(2017/09/05講演)
【PlayCanvas×NCMB 勉強会+ハンズオン】HTML5ゲームにバックエンド機能をらくらく追加!ハンズオン(2017/09/05講演)【PlayCanvas×NCMB 勉強会+ハンズオン】HTML5ゲームにバックエンド機能をらくらく追加!ハンズオン(2017/09/05講演)
【PlayCanvas×NCMB 勉強会+ハンズオン】HTML5ゲームにバックエンド機能をらくらく追加!ハンズオン(2017/09/05講演)
PlayCanvas運営事務局
 
One-Shot Learning
One-Shot LearningOne-Shot Learning
One-Shot Learning
Jisung Kim
 
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
Seongyun Byeon
 
Data Driven Decision을 위한 데이터플랫폼구축기@kakaomobility
Data Driven Decision을 위한 데이터플랫폼구축기@kakaomobilityData Driven Decision을 위한 데이터플랫폼구축기@kakaomobility
Data Driven Decision을 위한 데이터플랫폼구축기@kakaomobility
Jongho Woo
 
Donald Knuth
Donald KnuthDonald Knuth
Donald Knuth
Roman Rader
 
Manual de aplicação I'green (Empresa Braskem)
Manual de aplicação I'green (Empresa Braskem)Manual de aplicação I'green (Empresa Braskem)
Manual de aplicação I'green (Empresa Braskem)
qsustentavel
 
Superpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), Foolad
Superpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), FooladSuperpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), Foolad
Superpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), Foolad
Shima Foolad
 
Data pipeline and data lake
Data pipeline and data lakeData pipeline and data lake
Data pipeline and data lake
DaeMyung Kang
 

What's hot (20)

Physics-Informed Machine Learning
Physics-Informed Machine LearningPhysics-Informed Machine Learning
Physics-Informed Machine Learning
 
Programmable Network Devices
Programmable Network DevicesProgrammable Network Devices
Programmable Network Devices
 
The macrame of scholarly training - collecting the cords that bind
The macrame of scholarly training - collecting the cords that bind The macrame of scholarly training - collecting the cords that bind
The macrame of scholarly training - collecting the cords that bind
 
[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례
[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례
[2B3]ARCUS차별기능,사용이슈,그리고카카오적용사례
 
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
 
Elk devops
Elk devopsElk devops
Elk devops
 
Natural Language Processing: Comparing NLTK and OpenNLP
Natural Language Processing: Comparing NLTK and OpenNLPNatural Language Processing: Comparing NLTK and OpenNLP
Natural Language Processing: Comparing NLTK and OpenNLP
 
Precise LSTM Algorithm
Precise LSTM AlgorithmPrecise LSTM Algorithm
Precise LSTM Algorithm
 
系统性能分析和优化.ppt
系统性能分析和优化.ppt系统性能分析和优化.ppt
系统性能分析和优化.ppt
 
Cache in API Gateway
Cache in API GatewayCache in API Gateway
Cache in API Gateway
 
Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々
Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々
Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
【PlayCanvas×NCMB 勉強会+ハンズオン】HTML5ゲームにバックエンド機能をらくらく追加!ハンズオン(2017/09/05講演)
【PlayCanvas×NCMB 勉強会+ハンズオン】HTML5ゲームにバックエンド機能をらくらく追加!ハンズオン(2017/09/05講演)【PlayCanvas×NCMB 勉強会+ハンズオン】HTML5ゲームにバックエンド機能をらくらく追加!ハンズオン(2017/09/05講演)
【PlayCanvas×NCMB 勉強会+ハンズオン】HTML5ゲームにバックエンド機能をらくらく追加!ハンズオン(2017/09/05講演)
 
One-Shot Learning
One-Shot LearningOne-Shot Learning
One-Shot Learning
 
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
[MLOps KR 행사] MLOps 춘추 전국 시대 정리(210605)
 
Data Driven Decision을 위한 데이터플랫폼구축기@kakaomobility
Data Driven Decision을 위한 데이터플랫폼구축기@kakaomobilityData Driven Decision을 위한 데이터플랫폼구축기@kakaomobility
Data Driven Decision을 위한 데이터플랫폼구축기@kakaomobility
 
Donald Knuth
Donald KnuthDonald Knuth
Donald Knuth
 
Manual de aplicação I'green (Empresa Braskem)
Manual de aplicação I'green (Empresa Braskem)Manual de aplicação I'green (Empresa Braskem)
Manual de aplicação I'green (Empresa Braskem)
 
Superpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), Foolad
Superpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), FooladSuperpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), Foolad
Superpixel algorithms (whatershed, mean-shift, SLIC, BSLIC), Foolad
 
Data pipeline and data lake
Data pipeline and data lakeData pipeline and data lake
Data pipeline and data lake
 

Similar to My mapreduce1 presentation

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Handout3o
Handout3oHandout3o
Handout3o
Shahbaz Sidhu
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
Vu Thi Trang
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
areej qasrawi
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
Cleverence Kombe
 
Download It
Download ItDownload It
Download It
butest
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Robert Grossman
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
ShimoFcis
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
David Ribeiro Alves
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
Dilum Bandara
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
Habiba Abderrahim
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
Vivian S. Zhang
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Albert Bifet
 
E031201032036
E031201032036E031201032036
E031201032036
ijceronline
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
lmphuong06
 
MapReduce
MapReduceMapReduce
MapReduce
KavyaGo
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
Sri Prasanna
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
Kelly Technologies
 

Similar to My mapreduce1 presentation (20)

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Handout3o
Handout3oHandout3o
Handout3o
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Download It
Download ItDownload It
Download It
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
Streaming Python on Hadoop
Streaming Python on HadoopStreaming Python on Hadoop
Streaming Python on Hadoop
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
E031201032036
E031201032036E031201032036
E031201032036
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
MapReduce
MapReduceMapReduce
MapReduce
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 

More from Noha Elprince

Debug me
Debug meDebug me
Debug me
Noha Elprince
 
T2 fs talk
T2 fs talkT2 fs talk
T2 fs talk
Noha Elprince
 
Noha mega store
Noha mega storeNoha mega store
Noha mega store
Noha Elprince
 
Robot maptalk
Robot maptalkRobot maptalk
Robot maptalk
Noha Elprince
 
AdaptiveLab Talk1
AdaptiveLab Talk1AdaptiveLab Talk1
AdaptiveLab Talk1
Noha Elprince
 
Noha danms13 talk_final
Noha danms13 talk_finalNoha danms13 talk_final
Noha danms13 talk_final
Noha Elprince
 

More from Noha Elprince (6)

Debug me
Debug meDebug me
Debug me
 
T2 fs talk
T2 fs talkT2 fs talk
T2 fs talk
 
Noha mega store
Noha mega storeNoha mega store
Noha mega store
 
Robot maptalk
Robot maptalkRobot maptalk
Robot maptalk
 
AdaptiveLab Talk1
AdaptiveLab Talk1AdaptiveLab Talk1
AdaptiveLab Talk1
 
Noha danms13 talk_final
Noha danms13 talk_finalNoha danms13 talk_final
Noha danms13 talk_final
 

Recently uploaded

Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Lidia A.
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
Matthew Sinclair
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
Bert Blevins
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
Stephanie Beckett
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 

Recently uploaded (20)

Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 

My mapreduce1 presentation

  • 1. MapReduce Simplified Data Processing on Large Clusters Google , Inc. Presented by Noha El-Prince Winter 2011
  • 2. Problem and Motivations —  Large Data Size —  Limited CPU Powers —  Difficulties of Distributed , Parallel Computing 2
  • 3. MapReduce —  MapReduce is a Software framework —  introduced by Google —  Enables automatic parallelization and distribution of large-scale computations —  Hides the details of parallelization, data distribution, load balancing and fault tolerance. —  Achieves high performance 3
  • 4. Outline —  MapReduce : Execution Example —  Programming Model —  MapReduce: Distributed Execution —  More Examples —  Customization on Cluster —  Refinements —  Performance measurement —  Conclusion and Future Work —  MapReduce in other companies 4
  • 6. Example q  Input: —  Page 1: the weather is good —  Page 2: today is good —  Page 3: good weather is good. q  Output Desired: The frequency each word is encountered in all pages. (the 1), (is 3), (weather 2),(today 1), (good 4) 6
  • 7. Input The weather is good Today is good Good weather is good Data map(key, value): for each word w in value: emit(w, 1) M M M M M M M Intermediate Data (The,1) (weather,1) (is,1) (good,1) (Today,1) (is,1) (good,1) (good,1) (weather,1) (is,1) (good,1) Group by Key reduce(key, values): result=0 Grouped for each count v in values (The,[1]) (weather,[1,1]) (is,[1,1,1]) (good,[1,1,1]) (Today,[1]) Data result += v emit(key, result) R R R R R Output Data (The,1) (weather, 2) (is, 3) (good,3) (Today,1) 7
  • 8. Programming Model §  Input : A set of key/value pairs §  Programmer specifies two functions: Map Reduce •  map  (k,v)  à  <k’,  v’>   •  reduce  (k’,<v’>*)  à  <k’,v’>*   All v’ with same k’ are reduced together 8
  • 9. Distributed Execution Overview fork assign map Input Data Split 0 read Split 1 Split 2 fork Master fork assign reduce Worker Worker Worker local write Worker Worker remote read, sort write Output File 0 Output File 1 9
  • 10. MapReduce Examples Distributed Grep: Search pattern (key) : virus A….virus B……. C..virus… MAP virus, A… virus, B… RED Virus, [A..,B..] Web Page 10
  • 11. MapReduce Examples —  Count of URL Access Frequency: www.cbc.com www.cnn.com www.bbc.com www.cbc.com www.cbc.com www.bbc.com MAP CBC, [1,1,1] CNN [1] BBC [1,1] RED CBC, 3 BBC, 2 CNN, 1 Web server logs 11
  • 12. MapReduce Examples q Reverse Web-Link Graph: www.facebook.com www.youtube.com source target MAP (facebook,youtube) (facebook, disney) Facebook.com Twitter.com RED www.disney.com Facebook.com Web server logs (Facebook, [youtube, disney]) 12
  • 13. MapReduce Examples q  Term-Vector per Host:. MAP word1> word2> word2> word2> …. RED Documents of the facebook (hostname) <facebook, <facebook, <facebook, <facebook, <facebook, [word2, …]> Summary of the most popular words 13
  • 15. Outline þ—  MapReduce : Execution þ—  Example þ—  Programming Model þ—  MapReduce: Distributed Execution þ—  More Examples —  Customizations on Clusters —  Refinements —  Performance measurement —  Conclusion & Future Work —  MapReduce in other companies 15
  • 16. Customizations on Clusters —  Coordination —  Scheduling —  Fault Tolerance —  Task Granularity —  Backup Tasks 16
  • 17. Customizations on Clusters q Coordination Master Data Structure M 250.133.22.7 Completed Root/intFile.txt M 250.133.22.8 inprogress Root/intFile.txt R 250.123.23.3 idle Root/outFile.txt 17
  • 18. Customizations on Clusters q Scheduling Master scheduling policy: (objective: conserve network bandwidth) 1.  GFS divides each file into 64MB block. 2.  I/P data are stored on the worker’s local disks (managed by GFS) Ø  Locality :using the same cluster for both data storage and data processing. 3.  GFS stores multiple copies of each block (typically 3 copies) on different machines. 18
  • 19. Customizations on Clusters q Fault Tolerance On worker failure: •  •  •  •  Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Master failure: •  Could handle, but don't yet (master failure unlikely) •  MapReduce task is aborted and client is notified 19
  • 20. Customizations on Clusters q  Task Granularity (How tasks are divided ?) Rule of thumb: Make M and R much larger than the number of worker machines à  Improves dynamic load balancing à  speeds recovery from worker failure Usually R is smaller than M 20
  • 21. Customizations on Clusters q  Backup tasks —  Problem of stragglers (machines taking long time to complete one of the last few tasks ) —  When a MapReduce operation is about to complete: Ø  Ø  Master schedules backup executions of the remaining tasks Task is marked “complete” whenever either the primary or the backup execution completes. Effect: dramatically shortens job completion time 21
  • 22. Outline þ—  MapReduce : Execution þ—  Example þ—  Programming Model þ—  MapReduce: Distributed Execution þ—  More Examples —  þ Customizations on Clusters —  Refinements —  Performance measurement —  Conclusion & Future Work —  Companies using MapReduce 22
  • 23. Refinements —  Partitioning functions. —  Skipping bad records. —  Status info. —  Other Refinements 23
  • 24. Refinements : Partitioning Function —  MapRedue users specify no. of tasks/output files desired (R) —  For reduce, we need to ensure that records with the same intermediate key end up at the same worker —  System uses a default partition function e.g., hash(key) mod R ( results fairly well-balanced partitions ) —  Sometimes useful to override —  E.g., hash(hostname(URL key)) mod R Ø  ensures URLs from a host end up in the same output file 24
  • 25. Refinements : Skipping Bad Records §  Map/Reduce functions sometimes fail for particular inputs •  MapReduce has a special treatment for ‘bad’ input data, i.e. input data that repeatedly leads to the crash of a task. Ø  The master, tracking crashes of tasks, recognizes such situations and, after a number of failed retries, will decide to ignore this piece of data. •  Effect: Can work around bugs in third-party libraries 25
  • 26. Refinements : Status Information —  Status pages shows the computation progress —  Links to standard error and output files generated by each task. —  User can Ø  Predict the computational length Ø  Add more resources if needed Ø  Know which workers have failed —  Useful in user code bug diagnosis 26
  • 27. Other Refinements §  Combiner function: Compression of intermediate data Ø  useful for saving network bandwidth §  User-defined counters Ø  periodically propagated to the master from worker machines Ø Useful for checking behavior of MaReduce operations (appears on master status page ) 27
  • 28. Outline þ—  MapReduce : Execution þ—  Example þ—  Programming Model þ—  MapReduce: Distributed Execution þ—  More Examples —  þ Customizations on Clusters þ Refinements —  —  Performance measurement —  Conclusion & Future Work —  Companies using MapReduce 28
  • 29. Performance §  Tests run on cluster of 1800 machines: each machine has: —  4 GB of memory —  Dual-processor 2 GHz Xeons with Hyperthreading —  Dual 160 GB IDE disks —  Gigabit Ethernet link —  Bisection bandwidth approximately 100-200 Gbps §  Two benchmarks: —  Grep: Scan 1010 100-byte records to extract records matching a rare pattern (92K matching records) —  Sort: Sort 1010 100-byte records 29
  • 30. Grep 1764 workers M=15000 (input split= 64MB) R=1 Assume all machines has same host Search pattern: 3 characters Found in: 92,337 records •  1800 machines read 1 TB of data at peak of ~31 GB/s •  Startup overhead is significant for short jobs (entire computation = 80 + 1 minute start up 30
  • 31. Sort M=15000 (input split= 64MB) R=4000, # of workers = 1746 Fig.(a) Btr than Terasoft benchmark reported result of 1057 s (a) Normal Execution (b) No backup tasks (c) 200 tasks killed (a) Locality optimization èInput rate > shuffle rate and output rate Output phase writes 2 copies of sorted data è Shuffle rate > output rate (b) 5 Stragglers à Entire computation rate increases 44% than normal 31
  • 32. Experience: Rewrite of Production Indexing System §  New code is simpler, easier to understand §  MapReduce takes care of failures, slow machines §  Easy to make indexing faster by adding more machines 32
  • 33. Outline þ—  MapReduce : Execution þ—  Example þ—  Programming Model þ—  MapReduce: Distributed Execution þ—  More Examples —  þ Customizations on Clusters —  þ Refinements —  þ Performance measurement —  Conclusion & Future Work —  Companies using MapReduce 33
  • 34. Conclusion & Future Work —  MapReduce has proven to be a useful abstraction —  Greatly simplifies large-scale computations —  Fun to use: focus on problem, let library deal w/ messy details 34
  • 35. MapReduce Advantages/Disadvantages Now it s easy to program for many CPUs •  Communication management effectively gone Ø  I/O scheduling done for us •  Fault tolerance, monitoring Ø  machine failures, suddenly-slow machines, etc are handled •  Can be much easier to design and program! •  Can cascade several (many?) MapReduce tasks But … it further restricts solvable problems •  Might be hard to express problem in MapReduce •  Data parallelism is key Ø  Need to be able to break up a problem by data chunks •  MapReduce is closed-source (to Google) C++ Ø  Hadoop is open-source Java-based rewrite 35
  • 36. Outline þ—  MapReduce : Execution þ—  Example þ—  Programming Model þ—  MapReduce: Distributed Execution þ—  More Examples —  þ Customizations on Clusters —  þ Refinements —  þ Performance measurement —  þ Conclusion & Future Work —  Companies using MapReduce 36
  • 37. Companies using MapReduce v  Amazon: Amazon Elastic MapReduce : §  a web service §  enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. §  It utilizes a hosted Hadoop framework running on the webscale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). §  allows you to use Hadoop with no hardware investment —  http://aws.amazon.com/elasticmapreduce/ 37
  • 38. Companies using MapReduce —  Amazon: to build product search indices —  Facebook: processing of web logs, via both Map-Reduce and Hive —  IBM and Google: making large compute clusters available to higher ed and research organizations —  New York Times: large scale image conversions —  Yahoo: use Map Reduce and Pig for web log processing, data model training, web map construction, and much, much more —  Many universities for teaching parallel and large data systems And many more, see them all at http://wiki.apache.org/hadoop/ PoweredBy 38
  • 39. 39