SlideShare a Scribd company logo
Introduction to MapReduce, an Abstraction for Large-Scale Computation Ilan Horn Google, Inc. (most slides borrowed from Jeff Dean)
Outline Overview of our computing environment MapReduce overview, examples implementation details usage stats Implications for parallel program development
Problem: lots of data Example: 20+ billion web pages x 20KB = 400+ terabytes One computer can read 30-35 MB/sec from disk ~four months to read the web ~1,000 hard drives just to store the web Even more to  do  something with the data
Solution: spread the work over many machines Good news: same problem with 1000 machines, < 3 hours Bad news: programming work communication and coordination recovering from machine failure status reporting debugging optimization locality Bad news II: repeat for every problem you want to solve
Computing Clusters Many racks of computers, thousands of machines per cluster Limited bisection bandwidth between racks
Machines 2 CPUs Typically hyperthreaded or dual-core Future machines will have more cores 1-6 locally-attached disks 200GB to ~2 TB of disk 4GB-16GB of RAM Typical machine runs: Google File System (GFS) chunkserver Scheduler daemon for starting user tasks One or many user tasks
Implications of our Computing Environment Single-thread performance doesn’t matter We have large problems and total throughput/$ more important than peak performance Stuff Breaks If you have one server, it may stay up three years (1,000 days) If you have 10,000 servers, expect to lose ten a day “ Ultra-reliable” hardware doesn’t really help At large scales, super-fancy reliable hardware still fails, albeit less often software still needs to be fault-tolerant commodity machines without fancy hardware give better perf/$ How can we make it easy to write distributed programs?
MapReduce A simple programming model that applies to many large-scale computing problems Hide messy details in MapReduce runtime library: automatic parallelization load balancing network and disk transfer optimization handling of machine failures robustness improvements to core library benefit all users of library!
Typical problem solved by MapReduce Read a lot of data Map : extract something you care about from each record Shuffle and Sort Reduce : aggregate, summarize, filter, or transform Write the results Outline stays the same, map and reduce change to fit the problem
More specifically… Programmer specifies two primary methods: map (k, v) -> <k', v'>* reduce (k', <v'>*) -> <k', v'>* All v' with same k' are reduced together, in order. Usually also specify: partition (k’, total partitions) -> partition for k’ often a simple hash of the key allows reduce operations for different k’ to be parallelized
Example:  Word Frequencies in Web Pages A typical exercise for a new engineer in his or her first week Input is files with one document per record Specify a  map  function that takes a key/value pair key = document URL value = document contents Output of map function is (potentially many) key/value pairs. In our case, output (word, “1”) once per word in the document “ document1”, “to be or not to be” “ to”, “1” “ be”, “1” “ or”, “1” …
Example continued: word frequencies in web pages MapReduce library gathers together all pairs with the same key (shuffle/sort) The  reduce  function combines the values for a key In our case, compute the sum Output of reduce (usually 0 or 1 value) paired with key and saved “ be”, “2” “ not”, “1” “ or”, “1” “ to”, “2” key = “or” values = “1” “ 1” key = “be” values = “1”, “1” “ 2” key = “to” values = “1”, “1” “ 2” key = “not” values = “1” “ 1”
Example:  Pseudo-code Map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_values: EmitIntermediate(w, &quot;1&quot;); Reduce(String key, Iterator intermediate_values): // key: a word, same for input and output // intermediate_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); Total 80 lines of C++ code including comments, main()
Widely applicable at Google Implemented as a C++ library linked to user programs Can read and write many different data types Example uses: web access log stats web link-graph reversal inverted index construction statistical machine translation … distributed grep distributed sort term-vector per host document clustering machine learning ...
Example: Generating Language Model Statistics Used in our statistical machine translation system need to count # of times every 5-word sequence occurs in large corpus of documents (and keep all those where count >= 4) Easy with MapReduce: map : extract 5-word sequences => count from document reduce : combine counts, and keep if count large enough
Example: Joining with Other Data Example: generate per-doc summary, but include per-host information (e.g. # of pages on host, important terms on host) per-host information might be in per-process data structure, or might involve RPC to a set of machines containing data for all sites map : extract host name from URL, lookup per-host info, combine with per-doc data and emit reduce : identity function (just emit key/value directly)
MapReduce Programs in Google’s Source Tree
New MapReduce Programs Per Month Summer intern effect
MapReduce: Scheduling One master, many workers  Input data split into  M  map tasks (typically 64 MB in size) Reduce phase partitioned into  R  reduce tasks Tasks are assigned to workers dynamically Often:  M =200,000;  R =4,000; workers=2,000 Master assigns each map task to a free worker  Considers locality of data to worker when assigning task Worker reads task input (often from local disk!) Worker produces R  local files  containing intermediate k/v pairs Master assigns each reduce task to a free worker  Worker reads intermediate k/v pairs from map workers Worker sorts & applies user’s  Reduce  op to produce the output
Parallel MapReduce Map Map Map Map Input data Reduce Shuffle Reduce Shuffle Reduce Shuffle Partitioned output Master
Task Granularity and Pipelining Fine granularity tasks: many more map tasks than machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing Often use 200,000 map/5000 reduce tasks w/ 2000 machines
Fault tolerance: Handled via re-execution On worker failure: Detect failure via periodic heartbeats  Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master On master failure: State is checkpointed to GFS: new master recovers & continues Very Robust: lost 1600 of 1800 machines once, but finished fine
Refinement: Backup Tasks Slow workers significantly lengthen completion time Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!) Solution: Near end of phase, spawn backup copies of tasks Whichever one finishes first &quot;wins&quot; Effect: Dramatically shortens job completion time
Refinement: Locality Optimization Master scheduling policy: Asks GFS for locations of replicas of input file blocks Map tasks typically split into 64MB (== GFS block size) Map tasks scheduled so GFS input block replica are on same machine or same rack Effect: Thousands of machines read input at local disk speed Without this, rack switches limit read rate
Refinement: Skipping Bad Records Map/Reduce functions sometimes fail for particular inputs Best solution is to debug & fix, but not always possible On seg fault: Send UDP packet to master from signal handler Include sequence number of record being processed If master sees  K  failures for same record (typically  K  set to 2 or 3) : Next worker is told to skip the record Effect: Can work around bugs in third-party libraries
Other Refinements Optional secondary keys for ordering Compression of intermediate data Combiner: useful for saving network bandwidth Local execution for debugging/testing User-defined counters
Using 1,800 machines: MR_Grep scanned 1 terabyte in 100 seconds MR_Sort sorted 1 terabyte of 100 byte records in 14 minutes Rewrote Google's production indexing system a sequence of  7 ,  10 ,  14 ,  17 ,  21 , 24 MapReductions simpler more robust faster more scalable Performance Results & Experience
MR_Sort
Usage Statistics Over Time 157 193 758 3,288 217 634 Aug, ‘04 29 268 2,970 6,743 52,254 2,002 874 Mar, ‘06 172 232 Average worker machines 941 Output data written (TB) 2,756 Intermediate data (TB) 12,571 Input data read (TB) 981 Machine years used 934 Average completion time (secs) Mar, ‘05 72 Number of jobs (1000s) 394 14,018 34,774 403,152 11,081 395 Sep, ‘07 2,217
Implications for Multi-core Processors Multi-core processors require parallelism, but many programmers are uncomfortable writing parallel programs MapReduce provides an easy-to-understand programming model for a very diverse set of computing problems users don’t need to be parallel programming experts system automatically adapts to number of cores & machines available Optimizations useful even in single machine, multi-core environment locality, load balancing, status monitoring, robustness, …
Conclusion MapReduce has proven to be a remarkably-useful abstraction Greatly simplifies large-scale computations at Google Fun to use: focus on problem, let library deal with messy details Many thousands of parallel programs written by hundreds of different programmers in last few years Many had no prior parallel or distributed programming experience Further info: MapReduce: Simplified Data Processing on Large Clusters,  Jeffrey Dean and Sanjay Ghemawat,  OSDI’04 http://labs.google.com/papers/mapreduce.html (or search Google for [MapReduce])

More Related Content

What's hot

NUMA and Java Databases
NUMA and Java DatabasesNUMA and Java Databases
NUMA and Java Databases
Raghavendra Prabhu
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Rocks db state store in structured streaming
Rocks db state store in structured streamingRocks db state store in structured streaming
Rocks db state store in structured streaming
Balaji Mohanam
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
Cleverence Kombe
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
ProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQLProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQL
René Cannaò
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
confluent
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
Databricks
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
DataWorks Summit
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Vigen Sahakyan
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @Java
Peter Lawrey
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 

What's hot (20)

NUMA and Java Databases
NUMA and Java DatabasesNUMA and Java Databases
NUMA and Java Databases
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Rocks db state store in structured streaming
Rocks db state store in structured streamingRocks db state store in structured streaming
Rocks db state store in structured streaming
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
ProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQLProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQL
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Open HFT libraries in @Java
Open HFT libraries in @JavaOpen HFT libraries in @Java
Open HFT libraries in @Java
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 

Viewers also liked

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
Nick Dimiduk
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
Hortonworks
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
Francisco Pérez-Sorrosal
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
Muralidharan Deenathayalan
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
Paladion Networks
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
Tudor Lapusan
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
Jay
 

Viewers also liked (20)

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 

Similar to Introduction To Map Reduce

mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
ShimoFcis
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
Vu Thi Trang
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
Sri Prasanna
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Sri Prasanna
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
areej qasrawi
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
M Baddar
 
Handout3o
Handout3oHandout3o
Handout3o
Shahbaz Sidhu
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
GiannisPagges
 
MapReduce
MapReduceMapReduce
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
Noha Elprince
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 
Map reduce
Map reduceMap reduce
Map reduce
Shahbaz Sidhu
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
softwarequery
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
Newvewm
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
Shubham Bansal
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-training
Geohedrick
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 

Similar to Introduction To Map Reduce (20)

mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
MapReduce presentation
MapReduce presentationMapReduce presentation
MapReduce presentation
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Handout3o
Handout3oHandout3o
Handout3o
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
Map reduce
Map reduceMap reduce
Map reduce
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-training
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 

Recently uploaded

Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
Sally Laouacheria
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
Bert Blevins
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
Enterprise Wired
 
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
Awais Yaseen
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
welrejdoall
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Bert Blevins
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 

Recently uploaded (20)

Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf20240702 Présentation Plateforme GenAI.pdf
20240702 Présentation Plateforme GenAI.pdf
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
 
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 

Introduction To Map Reduce

  • 1. Introduction to MapReduce, an Abstraction for Large-Scale Computation Ilan Horn Google, Inc. (most slides borrowed from Jeff Dean)
  • 2. Outline Overview of our computing environment MapReduce overview, examples implementation details usage stats Implications for parallel program development
  • 3. Problem: lots of data Example: 20+ billion web pages x 20KB = 400+ terabytes One computer can read 30-35 MB/sec from disk ~four months to read the web ~1,000 hard drives just to store the web Even more to do something with the data
  • 4. Solution: spread the work over many machines Good news: same problem with 1000 machines, < 3 hours Bad news: programming work communication and coordination recovering from machine failure status reporting debugging optimization locality Bad news II: repeat for every problem you want to solve
  • 5. Computing Clusters Many racks of computers, thousands of machines per cluster Limited bisection bandwidth between racks
  • 6. Machines 2 CPUs Typically hyperthreaded or dual-core Future machines will have more cores 1-6 locally-attached disks 200GB to ~2 TB of disk 4GB-16GB of RAM Typical machine runs: Google File System (GFS) chunkserver Scheduler daemon for starting user tasks One or many user tasks
  • 7. Implications of our Computing Environment Single-thread performance doesn’t matter We have large problems and total throughput/$ more important than peak performance Stuff Breaks If you have one server, it may stay up three years (1,000 days) If you have 10,000 servers, expect to lose ten a day “ Ultra-reliable” hardware doesn’t really help At large scales, super-fancy reliable hardware still fails, albeit less often software still needs to be fault-tolerant commodity machines without fancy hardware give better perf/$ How can we make it easy to write distributed programs?
  • 8. MapReduce A simple programming model that applies to many large-scale computing problems Hide messy details in MapReduce runtime library: automatic parallelization load balancing network and disk transfer optimization handling of machine failures robustness improvements to core library benefit all users of library!
  • 9. Typical problem solved by MapReduce Read a lot of data Map : extract something you care about from each record Shuffle and Sort Reduce : aggregate, summarize, filter, or transform Write the results Outline stays the same, map and reduce change to fit the problem
  • 10. More specifically… Programmer specifies two primary methods: map (k, v) -> <k', v'>* reduce (k', <v'>*) -> <k', v'>* All v' with same k' are reduced together, in order. Usually also specify: partition (k’, total partitions) -> partition for k’ often a simple hash of the key allows reduce operations for different k’ to be parallelized
  • 11. Example: Word Frequencies in Web Pages A typical exercise for a new engineer in his or her first week Input is files with one document per record Specify a map function that takes a key/value pair key = document URL value = document contents Output of map function is (potentially many) key/value pairs. In our case, output (word, “1”) once per word in the document “ document1”, “to be or not to be” “ to”, “1” “ be”, “1” “ or”, “1” …
  • 12. Example continued: word frequencies in web pages MapReduce library gathers together all pairs with the same key (shuffle/sort) The reduce function combines the values for a key In our case, compute the sum Output of reduce (usually 0 or 1 value) paired with key and saved “ be”, “2” “ not”, “1” “ or”, “1” “ to”, “2” key = “or” values = “1” “ 1” key = “be” values = “1”, “1” “ 2” key = “to” values = “1”, “1” “ 2” key = “not” values = “1” “ 1”
  • 13. Example: Pseudo-code Map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_values: EmitIntermediate(w, &quot;1&quot;); Reduce(String key, Iterator intermediate_values): // key: a word, same for input and output // intermediate_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); Total 80 lines of C++ code including comments, main()
  • 14. Widely applicable at Google Implemented as a C++ library linked to user programs Can read and write many different data types Example uses: web access log stats web link-graph reversal inverted index construction statistical machine translation … distributed grep distributed sort term-vector per host document clustering machine learning ...
  • 15. Example: Generating Language Model Statistics Used in our statistical machine translation system need to count # of times every 5-word sequence occurs in large corpus of documents (and keep all those where count >= 4) Easy with MapReduce: map : extract 5-word sequences => count from document reduce : combine counts, and keep if count large enough
  • 16. Example: Joining with Other Data Example: generate per-doc summary, but include per-host information (e.g. # of pages on host, important terms on host) per-host information might be in per-process data structure, or might involve RPC to a set of machines containing data for all sites map : extract host name from URL, lookup per-host info, combine with per-doc data and emit reduce : identity function (just emit key/value directly)
  • 17. MapReduce Programs in Google’s Source Tree
  • 18. New MapReduce Programs Per Month Summer intern effect
  • 19. MapReduce: Scheduling One master, many workers Input data split into M map tasks (typically 64 MB in size) Reduce phase partitioned into R reduce tasks Tasks are assigned to workers dynamically Often: M =200,000; R =4,000; workers=2,000 Master assigns each map task to a free worker Considers locality of data to worker when assigning task Worker reads task input (often from local disk!) Worker produces R local files containing intermediate k/v pairs Master assigns each reduce task to a free worker Worker reads intermediate k/v pairs from map workers Worker sorts & applies user’s Reduce op to produce the output
  • 20. Parallel MapReduce Map Map Map Map Input data Reduce Shuffle Reduce Shuffle Reduce Shuffle Partitioned output Master
  • 21. Task Granularity and Pipelining Fine granularity tasks: many more map tasks than machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing Often use 200,000 map/5000 reduce tasks w/ 2000 machines
  • 22. Fault tolerance: Handled via re-execution On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master On master failure: State is checkpointed to GFS: new master recovers & continues Very Robust: lost 1600 of 1800 machines once, but finished fine
  • 23. Refinement: Backup Tasks Slow workers significantly lengthen completion time Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!) Solution: Near end of phase, spawn backup copies of tasks Whichever one finishes first &quot;wins&quot; Effect: Dramatically shortens job completion time
  • 24. Refinement: Locality Optimization Master scheduling policy: Asks GFS for locations of replicas of input file blocks Map tasks typically split into 64MB (== GFS block size) Map tasks scheduled so GFS input block replica are on same machine or same rack Effect: Thousands of machines read input at local disk speed Without this, rack switches limit read rate
  • 25. Refinement: Skipping Bad Records Map/Reduce functions sometimes fail for particular inputs Best solution is to debug & fix, but not always possible On seg fault: Send UDP packet to master from signal handler Include sequence number of record being processed If master sees K failures for same record (typically K set to 2 or 3) : Next worker is told to skip the record Effect: Can work around bugs in third-party libraries
  • 26. Other Refinements Optional secondary keys for ordering Compression of intermediate data Combiner: useful for saving network bandwidth Local execution for debugging/testing User-defined counters
  • 27. Using 1,800 machines: MR_Grep scanned 1 terabyte in 100 seconds MR_Sort sorted 1 terabyte of 100 byte records in 14 minutes Rewrote Google's production indexing system a sequence of 7 , 10 , 14 , 17 , 21 , 24 MapReductions simpler more robust faster more scalable Performance Results & Experience
  • 29. Usage Statistics Over Time 157 193 758 3,288 217 634 Aug, ‘04 29 268 2,970 6,743 52,254 2,002 874 Mar, ‘06 172 232 Average worker machines 941 Output data written (TB) 2,756 Intermediate data (TB) 12,571 Input data read (TB) 981 Machine years used 934 Average completion time (secs) Mar, ‘05 72 Number of jobs (1000s) 394 14,018 34,774 403,152 11,081 395 Sep, ‘07 2,217
  • 30. Implications for Multi-core Processors Multi-core processors require parallelism, but many programmers are uncomfortable writing parallel programs MapReduce provides an easy-to-understand programming model for a very diverse set of computing problems users don’t need to be parallel programming experts system automatically adapts to number of cores & machines available Optimizations useful even in single machine, multi-core environment locality, load balancing, status monitoring, robustness, …
  • 31. Conclusion MapReduce has proven to be a remarkably-useful abstraction Greatly simplifies large-scale computations at Google Fun to use: focus on problem, let library deal with messy details Many thousands of parallel programs written by hundreds of different programmers in last few years Many had no prior parallel or distributed programming experience Further info: MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI’04 http://labs.google.com/papers/mapreduce.html (or search Google for [MapReduce])