SlideShare a Scribd company logo
An Introduction to  MapReduce  Francisco P érez-Sorrosal Distributed Systems Lab (DSL/LSD) Universidad Polit é cnica de Madrid 10/Apr/2008
Outline Motivation What is MapReduce? Simple Example What is MapReduce ’ s Main Goal? Main Features What MapReduce Solves? Programming Model Framework Overview Example Other Features Hadoop: A MapReduce Implementation Example References An Introduction to MapReduce
Motivation Increasing demand of large scale processing applications Web engines, semantic search tools, scientific applications... Most of these applications can be parallelized There are many ad-hoc implementations for such applications  but ... An Introduction to MapReduce
Motivation (II) ...the development and management execution of such ad-hoc parallel applications was too complex Usually implies the use and management of hundreds/thousands of machines However , they share basically the same problems: Parallelization Fault-tolerance Data distribution  Load balancing An Introduction to MapReduce

Recommended for you

Map Reduce
Map ReduceMap Reduce
Map Reduce

This document provides an overview of MapReduce in Hadoop. It defines MapReduce as a distributed data processing paradigm designed for batch processing large datasets in parallel. The anatomy of MapReduce is explained, including the roles of mappers, shufflers, reducers, and how a MapReduce job runs from submission to completion. Potential purposes are batch processing and long running applications, while weaknesses include iterative algorithms, ad-hoc queries, and algorithms that depend on previously computed values or shared global state.

batch processingbig datamapreduce
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce

Here is how you can solve this problem using MapReduce and Unix commands: Map step: grep -o 'Blue\|Green' input.txt | wc -l > output This uses grep to search the input file for the strings "Blue" or "Green" and print only the matches. The matches are piped to wc which counts the lines (matches). Reduce step: cat output This isn't really needed as there is only one mapper. Cat prints the contents of the output file which has the count of Blue and Green. So MapReduce has been simulated using grep for the map and cat for the reduce functionality. The key aspects are - grep extracts the relevant data (map

hadoopbig dataapache apex
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce

The document provides an introduction to MapReduce, including: - MapReduce is a framework for executing parallel algorithms across large datasets using commodity computers. It is based on map and reduce functions. - Mappers process input key-value pairs in parallel, and outputs are sorted and grouped by the reducers. - Examples demonstrate how MapReduce can be used for tasks like building indexes, joins, and iterative algorithms.

What is MapReduce? It is a framework to... ...automatically partition jobs that have large input data sets into simpler work units or tasks,  distribute them in the nodes of a cluster ( map ) and... ...combine the intermediate results of those tasks ( reduce ) in a way to produce the required results. Presented by Google in 2004 http://labs.google.com/papers/mapreduce.html An Introduction to MapReduce
Simple Example An Introduction to MapReduce Input data Mapped data on Node 1 Mapped data on Node 2 Result
What is MapReduce ’ s Main Goal? An Introduction to MapReduce Simplify the parallelization and  distribution of large-scale  computations in clusters
MapReduce Main Features Simple interface Automatic partition, parallelization and distribution of tasks Fault-tolerance Status and monitoring An Introduction to MapReduce

Recommended for you

Map Reduce
Map ReduceMap Reduce
Map Reduce

• What is MapReduce? • What are MapReduce implementations? Facing these questions I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The attached presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.

crm data managementbimarketing
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce

In this presentation , i provide in depth information about the how MapReduce works. It contains many details about the execution steps , Fault tolerance , master / worker responsibilities.

hadoopbig data
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic

MapReduce is a programming model for processing large datasets in parallel. It works by breaking the dataset into independent chunks which are processed by the map function, and then grouping the output of the maps into partitions to be processed by the reduce function. Hadoop uses MapReduce to provide fault tolerance by restarting failed tasks and monitoring the JobTracker and TaskTrackers. MapReduce programs can be written in languages other than Java using Hadoop Streaming.

mapreducehadoopbig data
What does MapReduce solves? It allows non-experienced programmers on parallel and distributed systems to use large distributed systems easily Used extensively on many applications inside Google and Yahoo that... ...require simple processing tasks... ... but   have large input data sets An Introduction to MapReduce
What does MapReduce solves? Examples:  Distributed grep Distributed sort Count URL access frequency Web crawling Represent the structure of web documents Generate summaries (pages crawled per host, most frequent queries, results returned...) An Introduction to MapReduce
Programming Model Input & Output Each one is a set of key/value pairs Map: Processes input key/value pairs Compute a set of intermediate key/value pairs map (in_key, in_value) -> list(int_key, intermediate_value) Reduce: Combine all the intermediate values that share the same key Produces a set of merged output values (usually just one per key) reduce(int_key, list(intermediate_value)) -> list(out_value) An Introduction to MapReduce
Programming Model: Example Problem : Count of URL access frequency Input : Log of web page requests Map :  Processes the assigned chunk of the log Compute a set of intermediate pairs  <URL, 1> Reduce : Processes the intermediate pairs  <URL, 1>   Adds together all the values that share the same URL Produces a set pairs in the form  <URL, total count> An Introduction to MapReduce

Recommended for you

Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips

This document provides performance optimization tips for Hadoop jobs, including recommendations around compression, speculative execution, number of maps/reducers, block size, sort size, JVM tuning, and more. It suggests how to configure properties like mapred.compress.map.output, mapred.map/reduce.tasks.speculative.execution, and dfs.block.size based on factors like cluster size, job characteristics, and data size. It also identifies antipatterns to avoid like processing thousands of small files or using many maps with very short runtimes.

hadoop
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce

The Google MapReduce presented in 2004 is the inspiration for Hadoop. Let's take a deep dive into MapReduce to better understand Hadoop.

googlemapreducedistributed computation
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch

This document summarizes Hadoop MapReduce, including its goals of distribution and reliability. It describes the roles of mappers, reducers, and other system components like the JobTracker and TaskTracker. Mappers process input splits in parallel and generate intermediate key-value pairs. Reducers sort and group the outputs by key before processing each unique key. The JobTracker coordinates jobs while the TaskTracker manages tasks on each node.

Framework Overview An Introduction to MapReduce
Framework Overview An Introduction to MapReduce
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 1) Split File into 10 pieces of 64MB R = 4 output files (Set by the user)‏ a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Worker Idle Worker Idle Worker Idle (There are 26 different keys letters in the range [a..z]) Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle 1 2 3 4 5 6 7 8 9 10
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 2) Assign map and reduce tasks a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l 1 2 3 4 5 6 7 8 9 10 Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Mappers Reducers

Recommended for you

Map Reduce
Map ReduceMap Reduce
Map Reduce

MapReduce is a programming model for processing large datasets in a distributed manner across clusters of machines. It involves two functions - Map and Reduce. The Map function processes input key-value pairs to generate intermediate key-value pairs, and the Reduce function merges all intermediate values associated with the same intermediate key. This allows for distributed processing that hides complexity and provides fault tolerance. An example is counting word frequencies, where the Map function emits word counts and the Reduce function sums the counts for each word.

Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts

The document describes Hadoop MapReduce and its key concepts. It discusses how MapReduce allows for parallel processing of large datasets across clusters of computers using a simple programming model. It provides details on the MapReduce architecture, including the JobTracker master and TaskTracker slaves. It also gives examples of common MapReduce algorithms and patterns like counting, sorting, joins and iterative processing.

hadoop
Map Reduce
Map ReduceMap Reduce
Map Reduce

This document provides an overview of MapReduce, including: - MapReduce is a programming model for processing large datasets in parallel across clusters of computers. - It works by breaking the processing into map and reduce functions that can be run on many machines. - Examples are given like word counting, distributed grep, and analyzing web server logs.

Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 3) Read the split data a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l 1 2 3 4 Map T. In progress Map T. In progress Map T. In progress Reduce T. Idle Reduce T. Idle Reduce T. Idle Map T. In progress Reduce T. Idle
Example: Count # of Each Letter in a Big File An Introduction to MapReduce a   b  c d e f g  h i  j k l  m  n  n  o   p  q  r  s  t  v w x  y  z  Machine 1 Big File 640MB 4) Process data (in memory) a y b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l R1 Partition Function (used to map the letters in regions): R2 R3 R4 Simulating the execution in memory Map T.1 In-Progress R1 R2 R3 R4 (a,1) (b,1) (a,1) (m1) (o,1) (p,1) (r, 1) (y,1)
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine 1 Big File 640MB Master 5) Apply combiner function a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Simulating the execution in memory R1 R2 R3 R4 Map T.1 In-Progress (a,1) (b,1) (a,1) (m1) (o,1) (p,1) (r, 1) (y,1) (a,2) (b,1)  (m1) (o,1) (p,1) (r, 1) (y,1)
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine 1 Big File 640MB Master 6) Store results on disk a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Memory R1 R2 R3 R4 Disk Map T.1 In-Progress (a,2) (b,1)  (m1) (o,1) (p,1) (r, 1) (y,1)

Recommended for you

Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction

This document introduces MapReduce, including its architecture, advantages, frameworks for writing MapReduce programs, and an example WordCount MapReduce program. It also discusses how to compile, deploy, and run MapReduce programs using Hadoop and Eclipse.

map reducemaprbigdata
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Mastering Hadoop Map Reduce was a presentation I gave to Orlando Data Science on April 23, 2015. The presentation provides a clear overview of how Hadoop Map Reduce works, and then dives into more advanced topics of how to optimize runtime performance and implement custom data types. The examples are written in Python and Java, and the presentation walks through how to create an n-gram count map reduce program using custom data types. You can get the full source code for the examples on my Github! http://www.github.com/scottcrespo/ngrams

mapreducebig datangrams
MapReduce
MapReduceMapReduce
MapReduce

This document describes the MapReduce programming model for processing large datasets in a distributed manner. MapReduce allows users to write map and reduce functions that are automatically parallelized and run across large clusters. The input data is split and the map tasks run in parallel, producing intermediate key-value pairs. These are shuffled and input to the reduce tasks, which produce the final output. The system handles failures, scheduling and parallelization transparently, making it easy for programmers to write distributed applications.

mapreduce
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 7) Inform the master about the position of the intermediate results in local disk  a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Machine 1 R1 R2 R3 R4 MT1 Results Location MT1 Results Map T.1 In-Progress (a,2) (b,1)  (m1) (o,1) (p,1) (r, 1) (y,1)
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 8) The Master assigns the next task (Map Task 5) to the Worker recently free a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Machine 1 R1 R2 R3 R4 T1 Results Worker In-Progress Data for  Map Task 5 (a,2) (b,1)  (m1) (o,1) (p,1) (r, 1) (y,1) Task 5
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Master 9) The Master forwards the location of the intermediate results of Map Task 1 to reducers Machine 1 R1 R2 R3 R4 MT1 Results MT1 Results Location (R1) MT1 Results Location (Rx) Big File 640MB a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l ... Map T.5 In-Progress Reduce T.1 Idle (a,2) (b,1)  (m1) (o,1) (p,1) (r, 1) (y,1)
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB a  t  b  o m  a  p r r  e d  u c e  g o o o  g  l e a  p i m  a c a c a b  r a a  r r o z  f e i j  a  o t o m  a t  e c  r u i m  e s s o l R1 a b c d e f g Letters in  Region 1 : Reduce T.1 Idle (a, 2) (b,1) (e, 1) (d, 1) (c, 1) (e, 1) (g, 1) (e, 1) (a, 3) (c, 1) (c, 1) (a, 1) (b,1) (a, 2) (f, 1) (e, 1) (a, 2) (e, 1)(c, 1) (e, 1)

Recommended for you

An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce

The document presents an introduction to MapReduce. It discusses how MapReduce provides an easy framework for distributed computing by allowing programmers to write simple map and reduce functions without worrying about complex distributed systems issues. It outlines Google's implementation of MapReduce and how it uses the Google File System for fault tolerance. Alternative open-source implementations like Apache Hadoop are also covered. The document discusses how MapReduce has been widely adopted by companies to process massive amounts of data and analyzes some criticism of MapReduce from database experts. It concludes by noting trends in using MapReduce as a parallel database and for multi-core processing.

map-reducemapreduceintroduction
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications

The document outlines the anatomy of MapReduce applications including common phases like input splitting, mapping, shuffling, and reducing. It then provides high-level and low-level views of how a word counting MapReduce job works, explaining that it takes a text corpus as input, maps words to counts of 1, shuffles to reduce by word, and outputs final word counts. The map and reduce functions are explained at a high-level, and then implementation details like MapRunner, RecordReader, and OutputCollector are described at a lower level.

Interactive big data analytics
Interactive big data analyticsInteractive big data analytics
Interactive big data analytics

- Dremel is an interactive analysis tool from Google that allows for near real-time analysis of trillion record, multi-terabyte datasets using a SQL-like query language. - It uses a columnar storage format for better compression and to only access the columns needed for a query. It also uses a serving tree architecture to parallelize query processing. - Experiments show Dremel can perform interactive analysis jobs over petabyte-scale datasets over 10 times faster than an equivalent MapReduce job, due to its columnar storage and serving tree architecture.

apache drilldremel
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (b,1) (e, 1) (d, 1) (c, 1) (e, 1) (g, 1) (e, 1) (a, 3) (c, 1) (c, 1) (a, 1) (b,1) (a, 2) (f, 1) (e, 1) (a, 2) (e, 1)(c, 1) (e, 1) Data read from  each Map Task stored in region 1 10) The RT 1 reads the data in R=1 from each MT Reduce T.1 In-Progress
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (a, 3) (a, 1) (a, 2) (a, 2) (b,1) (b,1) (c, 1) (c, 1) (c, 1) (c, 1) (d, 1) (e, 1) (e, 1) (e, 1)  (e, 1) (e, 1) (e, 1) (f, 1) (g, 1) 11) The reduce task 1 sorts the data Reduce T.1 In-Progress
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (a, 3) (a, 1) (a, 2) (a, 2) (b,1) (b,1) (c, 1) (c, 1) (c, 1) (c, 1) (d, 1) (e, 1) (e, 1) (e, 1)  (e, 1) (e, 1) (e, 1) (f, 1) (g, 1) 12) Then it passes the key and the corresponding set of intermediate data to the user's reduce function (a, {2,3,1,2,2}) (b, {1,1}) (c, {1,1,1,1}) (d,{1}) (e, {1,1,1,1,1,1}) (f, {1}) (g, {1}) Reduce T.1 In-Progress
Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N 12)   Finally, generates the output file 1 of R, after executing the user's reduce   (a, {2,3,1,2,2}) (b, {1,1}) (c, {1,1,1,1}) (d,{1}) (e, {1,1,1,1,1,1}) (f, {1}) (g, {1}) (a, 10) (b, 2) (c, 4) (d, 1) (e, 6) (f, 1) (g, 1) Reduce T.1 In-Progress

Recommended for you

Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce

This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.

mapreducegooglemapreducegoogle
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce

This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens. The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.

elastic map reducemap reduceapache hadoop
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab

Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial: 1) Thinking in Map / Reduce 2) Understanding Unix Pipeline 3) Examples to understand MapReduce 4) Merging 5) Mappers & Reducers 6) Mapper Example 7) Input Split 8) mapper() & reducer() Code 9) Example - Count number of words in a file using MapReduce 10) Example - Compute Max Temperature using MapReduce 11) Hands-on - Count number of words in a file using MapReduce on CloudxLab

cloudxlabhadoopapache hadoop
Other Features: Failures Re-execution is the main mechanism for fault-tolerance Worker failures: Master detect Worker failures via periodic heartbeats The master drives the re-execution of tasks Completed and in-progress map tasks are re-executed In-progress reduce tasks are re-executed Master failure:  The initial implementation did not support failures of the master Solutions:  Checkpoint the state of internal structures in the GFS Use replication techniques Robust : lost 1600 of 1800 machines once, but finished fine  An Introduction to MapReduce
Other Features: Locality Most input data is read locally Why?  To not consume network bandwidth How does it achieve that? The master attempts to schedule a map task on a machine that contains a replica (in the GFS) of the corresponding input data If it fails, attempts to schedule near a replica (e.g. on the same network switch)‏ An Introduction to MapReduce
Other Features: Backup Tasks Some tasks may have delays (Stragglers): A machine that takes too long time to complete one of the last few map or reduce tasks Causes : Bad disk, concurrency with other processes, processor caches disabled Solution : When close to completion, master schedules Backup Tasks for in-progress tasks Whichever one that finishes first &quot;wins&quot; Effect : Dramatically shortens job completion time An Introduction to MapReduce
Performance Tests run on cluster of  ~  1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine All machines in placed in the same hosting facility An Introduction to MapReduce

Recommended for you

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into account for launching speculative map tasks, because it is assumed that most maps are data-local. Unfortunately, both the homogeneity and data locality assumptions are not satisfied in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably reduce the MapReduce performance. In this paper, we address the problem of how to place data across nodes in a way that each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster, our data placement scheme adaptively balances the amount of data stored in each node to achieve improved data-processing performance. Experimental results on two real data-intensive applications show that our data placement strategy can always improve the MapReduce performance by rebalancing data across nodes before performing a data-intensive application in a heterogeneous Hadoop cluster.

heterogeneous clustermapreducedata placement
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms

This document provides a summary of MapReduce algorithms. It begins with background on the author's experience blogging about MapReduce algorithms in academic papers. It then provides an overview of MapReduce concepts including the mapper and reducer functions. Several examples of recently published MapReduce algorithms are described for tasks like machine learning, finance, and software engineering. One algorithm is examined in depth for building a low-latency key-value store. Finally, recommendations are provided for designing MapReduce algorithms including patterns, performance, and cost/maintainability considerations. An appendix lists additional MapReduce algorithms from academic papers in areas such as AI, biology, machine learning, and mathematics.

mapreducealgorithmsparallelization
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx

MapReduce is a programming model for processing large datasets in parallel across clusters of machines. It involves splitting the input data into independent chunks which are processed by the "map" step, and then grouping the outputs of the maps together and inputting them to the "reduce" step to produce the final results. The MapReduce paper presented Google's implementation which ran on a large cluster of commodity machines and used the Google File System for fault tolerance. It demonstrated that MapReduce can efficiently process very large amounts of data for applications like search, sorting and counting word frequencies.

Performance: Distributed Grep Program Searching for rare three-character pattern The pattern occurs 97337 times‏ Scans through  10 10  100-byte records ( Input ) Input split into aprox. 64MB Map tasks = 15000‏ Entire output is placed in one file Reducers =1‏ An Introduction to MapReduce
Performance: Grep Test completes in  ~  150 sec Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs  An Introduction to MapReduce 1764 Workers Maps are starting to finish Scan Rate
Hadoop: A MapReduce Implementation http://hadoop.apache.org Installing Hadoop MapReduce Install Hadoop Core Configure Hadoop site in conf/hadoop-site.xml HDFS Master MapReduce Master # of replicated files in the cluster An Introduction to MapReduce <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Hadoop: A MapReduce Implementation Create a distributed filesystem: $ bin/hadoop namenode -format Start Hadoop daemons $ bin/start-all.sh ($ bin/start-dfs.sh + $ bin/start-mapred.sh) Check the namenode (HDFS) http://localhost:50070/ Check the job tracker (MapReduce) http://localhost:50030/ An Introduction to MapReduce

Recommended for you

Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters

The document describes MapReduce, a programming model and software framework for processing large datasets in a distributed computing environment. It discusses how MapReduce allows users to specify map and reduce functions to parallelize tasks across large clusters of machines. It also covers how MapReduce handles parallelization, fault tolerance, and load balancing transparently through an easy-to-use programming interface.

cloud computinghadoopdatabase
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through HDFS and distributed processing via MapReduce. HDFS handles storage and MapReduce provides a programming model for parallel processing of large datasets across a cluster. The MapReduce framework consists of a mapper that processes input key-value pairs in parallel, and a reducer that aggregates the output of the mappers by key.

Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism

The document discusses distributed computing and the MapReduce programming model. It provides examples of how Folding@home and PS3s contribute significantly to distributed computing projects. It then explains challenges with inter-machine parallelism like communication overhead and load balancing. The document outlines Google's MapReduce model which handles these issues and makes programming distributed systems easier through its map and reduce functions.

intermachine parallelismparallelismuniversity of california
Hadoop: HDFS Console An Introduction to MapReduce
Hadoop: JobTracker Console An Introduction to MapReduce
Hadoop: Word Count Example $ bin/hadoop dfs -ls /tmp/fperez-hadoop/wordcount/input/ 
/tmp/fperez-hadoop/wordcount/input/file01 
/tmp/fperez-hadoop/wordcount/input/file02 $ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/input/file01 
 Welcome To Hadoop World $ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/input/file02 
 Goodbye Hadoop World An Introduction to MapReduce
Hadoop: Running the Example Run the application $ bin/hadoop jar /tmp/fperez-hadoop/wordcount.jar org.myorg.WordCount /tmp/fperez-hadoop/wordcount/input /tmp/fperez/wordcount/output Output: $ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/output/part-00000  
 Goodbye 1 
 Hadoop 2 
 To 1 Welcome 1 World 2 An Introduction to MapReduce

Recommended for you

Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank

This document provides an introduction to MapReduce and Hadoop, including an overview of computing PageRank using MapReduce. It discusses how MapReduce addresses challenges of parallel programming by hiding details of distributed systems. It also demonstrates computing PageRank on Hadoop through parallel matrix multiplication and implementing custom file formats.

MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf

The WordCount and Sort examples demonstrate basic MapReduce algorithms in Hadoop. WordCount counts the frequency of words in a text document by having mappers emit (word, 1) pairs and reducers sum the counts. Sort uses an identity mapper and reducer to simply sort the input files by key. Both examples read from and write to HDFS, and can be run on large datasets to benchmark a Hadoop cluster's sorting performance.

Map Reduce
Map ReduceMap Reduce
Map Reduce

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.

hadoopmapreduce
Hadoop: Word Count Example An Introduction to MapReduce public class  WordCount  extends  Configured  implements  Tool  { ... public static class  MapClass  extends  MapReduceBase   implements  Mapper < LongWritable ,  Text ,  Text , IntWritable > { ...  // Map Task Definition } public static class  Reduce  extends  MapReduceBase implements  Reducer < Text ,  IntWritable ,  Text , IntWritable > { ...  // Reduce Task Definition } public int  run ( String []  args ) throws  Exception  { ...  // Job Configuration } public static void  main ( String []  args ) throws  Exception  { int  res  =  ToolRunner . run (new  Configuration (),  new  WordCount (),  args ); System . exit ( res );  } }
Hadoop: Job Configuration An Introduction to MapReduce public int  run ( String []  args ) throws  Exception  {  	 JobConf   conf  = new  JobConf ( getConf (),  WordCount .class);  conf . setJobName ( &quot;wordcount&quot;);  // the keys are words (strings)  conf . setOutputKeyClass ( Text . class);  // the values are counts (ints)  conf . setOutputValueClass ( IntWritable . class);  	 conf . setMapperClass ( MapClass .class);  conf . setCombinerClass ( Reduce .class);  conf . setReducerClass ( Reduce .class); 	conf . setInputPath ( new  Path ( args . get (0)));  			conf . setOutputPath (new  Path ( args . get (1)));  JobClient . runJob ( conf ); return 0;  }
Hadoop: Map Class An Introduction to MapReduce public static class  MapClass  extends  MapReduceBase   implements  Mapper < LongWritable ,  Text ,  Text ,  IntWritable > {  private final static  IntWritable   one  = new  IntWritable (1);  private  Text   word  = new  Text ();  // map(WritableComparable, Writable, OutputCollector, Reporter) public void  map ( LongWritable   key ,  Text   value ,  OutputCollector < Text ,  IntWritable >  output ,  Reporter   reporter ) throws  IOException  {  String   line  =  value . toString ();  StringTokenizer   itr  = new  StringTokenizer ( line );  while ( itr . hasMoreTokens ()) {  word . set ( itr . nextToken ());  output . collect ( word ,  one );  }  } }
Hadoop: Reduce Class An Introduction to MapReduce public static class  Reduce  extends  MapReduceBase  implements  Reducer < Text ,  IntWritable ,  Text ,  IntWritable > {    // reduce(WritableComparable, Iterator, OutputCollector, Reporter) public void  reduce ( Text   key ,  Iterator < IntWritable >  values ,  OutputCollector < Text ,  IntWritable >  output ,  Reporter   reporter ) throws  IOException  {  int  sum  = 0;  while ( values . hasNext ()) {  sum  +=  values . next (). get ();  }  	 output . collect ( key , new  IntWritable ( sum ));  }  }

Recommended for you

Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra

This document provides an overview of several big data technologies including MapReduce, Pig, Flume, Cascading, and Dremel. It describes what each technology is used for, how it works, and example applications. MapReduce is a programming model for processing large datasets in a distributed environment, while Pig, Flume, and Cascading build upon MapReduce to provide higher-level abstractions. Dremel is an interactive query system for nested and complex datasets that uses a column-oriented data storage format.

pig etchadoopdremel
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture

Detailed documentation of the complex Map Reduce Execution Architecture with the terminology explanations and the execution of MapReduce Jar File

big datadata analyticsmachine learning
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution

The document discusses query execution in database management systems. It begins with an example query on a City, Country database and represents it in relational algebra. It then discusses different query execution strategies like table scan, nested loop join, sort merge join, and hash join. The strategies are compared based on their memory and disk I/O requirements. The document emphasizes that query execution plans can be optimized for parallelism and pipelining to improve performance.

query executionquery optimization
References Jeffrey Dean, Sanjay Ghemawat.   MapReduce : Simplified Data Processing on Large Clusters.  OSDI'04,  San Francisco, CA, December, 2004. Ralf Lämmel.  Google's MapReduce Programming Model – Revisited.  2006-2007. Accepted for publication in the Science of Computer Programming Journal Jeff Dean, Sanjay Ghemawat. Slides from the OSDI'04.   http://labs.google.com/papers/mapreduce-osdi04-slides/index.html Hadoop.  http://hadoop.apache.org An Introduction to MapReduce
Questions? An Introduction to MapReduce

More Related Content

What's hot

Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
Muhammad Shahid
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
Hassan A-j
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
Bhupesh Chawda
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Vigen Sahakyan
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
Apache Apex
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
Chicago Hadoop Users Group
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Michel Bruley
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
M Baddar
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
Chirag Ahuja
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
Romain Jacotin
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
Jeff Hammerbacher
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Manuel Correa
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Rahul Agarwal
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
Muralidharan Deenathayalan
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
scottcrespo
 
MapReduce
MapReduceMapReduce
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
Zubair Nabi
 

What's hot (20)

Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
MapReduce
MapReduceMapReduce
MapReduce
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 

Similar to An Introduction To Map-Reduce

Interactive big data analytics
Interactive big data analyticsInteractive big data analytics
Interactive big data analytics
Viet-Trung TRAN
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
Newvewm
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
ShimoFcis
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
Cleverence Kombe
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
Sri Prasanna
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
AnilVijayagiri
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture
Rupak Roy
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
J Singh
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
Avinash Pandu
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
TerrierTeam
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
anh tuan
 
Map reduce
Map reduceMap reduce
Map reduce
Shahbaz Sidhu
 

Similar to An Introduction To Map-Reduce (20)

Interactive big data analytics
Interactive big data analyticsInteractive big data analytics
Interactive big data analytics
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
Map reduce
Map reduceMap reduce
Map reduce
 

An Introduction To Map-Reduce

  • 1. An Introduction to MapReduce Francisco P érez-Sorrosal Distributed Systems Lab (DSL/LSD) Universidad Polit é cnica de Madrid 10/Apr/2008
  • 2. Outline Motivation What is MapReduce? Simple Example What is MapReduce ’ s Main Goal? Main Features What MapReduce Solves? Programming Model Framework Overview Example Other Features Hadoop: A MapReduce Implementation Example References An Introduction to MapReduce
  • 3. Motivation Increasing demand of large scale processing applications Web engines, semantic search tools, scientific applications... Most of these applications can be parallelized There are many ad-hoc implementations for such applications but ... An Introduction to MapReduce
  • 4. Motivation (II) ...the development and management execution of such ad-hoc parallel applications was too complex Usually implies the use and management of hundreds/thousands of machines However , they share basically the same problems: Parallelization Fault-tolerance Data distribution Load balancing An Introduction to MapReduce
  • 5. What is MapReduce? It is a framework to... ...automatically partition jobs that have large input data sets into simpler work units or tasks, distribute them in the nodes of a cluster ( map ) and... ...combine the intermediate results of those tasks ( reduce ) in a way to produce the required results. Presented by Google in 2004 http://labs.google.com/papers/mapreduce.html An Introduction to MapReduce
  • 6. Simple Example An Introduction to MapReduce Input data Mapped data on Node 1 Mapped data on Node 2 Result
  • 7. What is MapReduce ’ s Main Goal? An Introduction to MapReduce Simplify the parallelization and distribution of large-scale computations in clusters
  • 8. MapReduce Main Features Simple interface Automatic partition, parallelization and distribution of tasks Fault-tolerance Status and monitoring An Introduction to MapReduce
  • 9. What does MapReduce solves? It allows non-experienced programmers on parallel and distributed systems to use large distributed systems easily Used extensively on many applications inside Google and Yahoo that... ...require simple processing tasks... ... but have large input data sets An Introduction to MapReduce
  • 10. What does MapReduce solves? Examples: Distributed grep Distributed sort Count URL access frequency Web crawling Represent the structure of web documents Generate summaries (pages crawled per host, most frequent queries, results returned...) An Introduction to MapReduce
  • 11. Programming Model Input & Output Each one is a set of key/value pairs Map: Processes input key/value pairs Compute a set of intermediate key/value pairs map (in_key, in_value) -> list(int_key, intermediate_value) Reduce: Combine all the intermediate values that share the same key Produces a set of merged output values (usually just one per key) reduce(int_key, list(intermediate_value)) -> list(out_value) An Introduction to MapReduce
  • 12. Programming Model: Example Problem : Count of URL access frequency Input : Log of web page requests Map : Processes the assigned chunk of the log Compute a set of intermediate pairs <URL, 1> Reduce : Processes the intermediate pairs <URL, 1> Adds together all the values that share the same URL Produces a set pairs in the form <URL, total count> An Introduction to MapReduce
  • 13. Framework Overview An Introduction to MapReduce
  • 14. Framework Overview An Introduction to MapReduce
  • 15. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 1) Split File into 10 pieces of 64MB R = 4 output files (Set by the user)‏ a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Worker Idle Worker Idle Worker Idle (There are 26 different keys letters in the range [a..z]) Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle 1 2 3 4 5 6 7 8 9 10
  • 16. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 2) Assign map and reduce tasks a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l 1 2 3 4 5 6 7 8 9 10 Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Worker Idle Mappers Reducers
  • 17. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 3) Read the split data a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l 1 2 3 4 Map T. In progress Map T. In progress Map T. In progress Reduce T. Idle Reduce T. Idle Reduce T. Idle Map T. In progress Reduce T. Idle
  • 18. Example: Count # of Each Letter in a Big File An Introduction to MapReduce a b c d e f g h i j k l m n n o p q r s t v w x y z Machine 1 Big File 640MB 4) Process data (in memory) a y b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l R1 Partition Function (used to map the letters in regions): R2 R3 R4 Simulating the execution in memory Map T.1 In-Progress R1 R2 R3 R4 (a,1) (b,1) (a,1) (m1) (o,1) (p,1) (r, 1) (y,1)
  • 19. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine 1 Big File 640MB Master 5) Apply combiner function a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Simulating the execution in memory R1 R2 R3 R4 Map T.1 In-Progress (a,1) (b,1) (a,1) (m1) (o,1) (p,1) (r, 1) (y,1) (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1)
  • 20. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine 1 Big File 640MB Master 6) Store results on disk a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Memory R1 R2 R3 R4 Disk Map T.1 In-Progress (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1)
  • 21. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 7) Inform the master about the position of the intermediate results in local disk a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Machine 1 R1 R2 R3 R4 MT1 Results Location MT1 Results Map T.1 In-Progress (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1)
  • 22. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB Master 8) The Master assigns the next task (Map Task 5) to the Worker recently free a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l Machine 1 R1 R2 R3 R4 T1 Results Worker In-Progress Data for Map Task 5 (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1) Task 5
  • 23. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Master 9) The Master forwards the location of the intermediate results of Map Task 1 to reducers Machine 1 R1 R2 R3 R4 MT1 Results MT1 Results Location (R1) MT1 Results Location (Rx) Big File 640MB a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l ... Map T.5 In-Progress Reduce T.1 Idle (a,2) (b,1) (m1) (o,1) (p,1) (r, 1) (y,1)
  • 24. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Big File 640MB a t b o m a p r r e d u c e g o o o g l e a p i m a c a c a b r a a r r o z f e i j a o t o m a t e c r u i m e s s o l R1 a b c d e f g Letters in Region 1 : Reduce T.1 Idle (a, 2) (b,1) (e, 1) (d, 1) (c, 1) (e, 1) (g, 1) (e, 1) (a, 3) (c, 1) (c, 1) (a, 1) (b,1) (a, 2) (f, 1) (e, 1) (a, 2) (e, 1)(c, 1) (e, 1)
  • 25. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (b,1) (e, 1) (d, 1) (c, 1) (e, 1) (g, 1) (e, 1) (a, 3) (c, 1) (c, 1) (a, 1) (b,1) (a, 2) (f, 1) (e, 1) (a, 2) (e, 1)(c, 1) (e, 1) Data read from each Map Task stored in region 1 10) The RT 1 reads the data in R=1 from each MT Reduce T.1 In-Progress
  • 26. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (a, 3) (a, 1) (a, 2) (a, 2) (b,1) (b,1) (c, 1) (c, 1) (c, 1) (c, 1) (d, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (f, 1) (g, 1) 11) The reduce task 1 sorts the data Reduce T.1 In-Progress
  • 27. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N (a, 2) (a, 3) (a, 1) (a, 2) (a, 2) (b,1) (b,1) (c, 1) (c, 1) (c, 1) (c, 1) (d, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (f, 1) (g, 1) 12) Then it passes the key and the corresponding set of intermediate data to the user's reduce function (a, {2,3,1,2,2}) (b, {1,1}) (c, {1,1,1,1}) (d,{1}) (e, {1,1,1,1,1,1}) (f, {1}) (g, {1}) Reduce T.1 In-Progress
  • 28. Example: Count # of Each Letter in a Big File An Introduction to MapReduce Machine N 12) Finally, generates the output file 1 of R, after executing the user's reduce (a, {2,3,1,2,2}) (b, {1,1}) (c, {1,1,1,1}) (d,{1}) (e, {1,1,1,1,1,1}) (f, {1}) (g, {1}) (a, 10) (b, 2) (c, 4) (d, 1) (e, 6) (f, 1) (g, 1) Reduce T.1 In-Progress
  • 29. Other Features: Failures Re-execution is the main mechanism for fault-tolerance Worker failures: Master detect Worker failures via periodic heartbeats The master drives the re-execution of tasks Completed and in-progress map tasks are re-executed In-progress reduce tasks are re-executed Master failure: The initial implementation did not support failures of the master Solutions: Checkpoint the state of internal structures in the GFS Use replication techniques Robust : lost 1600 of 1800 machines once, but finished fine An Introduction to MapReduce
  • 30. Other Features: Locality Most input data is read locally Why? To not consume network bandwidth How does it achieve that? The master attempts to schedule a map task on a machine that contains a replica (in the GFS) of the corresponding input data If it fails, attempts to schedule near a replica (e.g. on the same network switch)‏ An Introduction to MapReduce
  • 31. Other Features: Backup Tasks Some tasks may have delays (Stragglers): A machine that takes too long time to complete one of the last few map or reduce tasks Causes : Bad disk, concurrency with other processes, processor caches disabled Solution : When close to completion, master schedules Backup Tasks for in-progress tasks Whichever one that finishes first &quot;wins&quot; Effect : Dramatically shortens job completion time An Introduction to MapReduce
  • 32. Performance Tests run on cluster of ~ 1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine All machines in placed in the same hosting facility An Introduction to MapReduce
  • 33. Performance: Distributed Grep Program Searching for rare three-character pattern The pattern occurs 97337 times‏ Scans through 10 10 100-byte records ( Input ) Input split into aprox. 64MB Map tasks = 15000‏ Entire output is placed in one file Reducers =1‏ An Introduction to MapReduce
  • 34. Performance: Grep Test completes in ~ 150 sec Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs An Introduction to MapReduce 1764 Workers Maps are starting to finish Scan Rate
  • 35. Hadoop: A MapReduce Implementation http://hadoop.apache.org Installing Hadoop MapReduce Install Hadoop Core Configure Hadoop site in conf/hadoop-site.xml HDFS Master MapReduce Master # of replicated files in the cluster An Introduction to MapReduce <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
  • 36. Hadoop: A MapReduce Implementation Create a distributed filesystem: $ bin/hadoop namenode -format Start Hadoop daemons $ bin/start-all.sh ($ bin/start-dfs.sh + $ bin/start-mapred.sh) Check the namenode (HDFS) http://localhost:50070/ Check the job tracker (MapReduce) http://localhost:50030/ An Introduction to MapReduce
  • 37. Hadoop: HDFS Console An Introduction to MapReduce
  • 38. Hadoop: JobTracker Console An Introduction to MapReduce
  • 39. Hadoop: Word Count Example $ bin/hadoop dfs -ls /tmp/fperez-hadoop/wordcount/input/ 
/tmp/fperez-hadoop/wordcount/input/file01 
/tmp/fperez-hadoop/wordcount/input/file02 $ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/input/file01 
 Welcome To Hadoop World $ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/input/file02 
 Goodbye Hadoop World An Introduction to MapReduce
  • 40. Hadoop: Running the Example Run the application $ bin/hadoop jar /tmp/fperez-hadoop/wordcount.jar org.myorg.WordCount /tmp/fperez-hadoop/wordcount/input /tmp/fperez/wordcount/output Output: $ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/output/part-00000 
 Goodbye 1 
 Hadoop 2 
 To 1 Welcome 1 World 2 An Introduction to MapReduce
  • 41. Hadoop: Word Count Example An Introduction to MapReduce public class WordCount extends Configured implements Tool { ... public static class MapClass extends MapReduceBase implements Mapper < LongWritable , Text , Text , IntWritable > { ... // Map Task Definition } public static class Reduce extends MapReduceBase implements Reducer < Text , IntWritable , Text , IntWritable > { ... // Reduce Task Definition } public int run ( String [] args ) throws Exception { ... // Job Configuration } public static void main ( String [] args ) throws Exception { int res = ToolRunner . run (new Configuration (), new WordCount (), args ); System . exit ( res ); } }
  • 42. Hadoop: Job Configuration An Introduction to MapReduce public int run ( String [] args ) throws Exception { JobConf conf = new JobConf ( getConf (), WordCount .class); conf . setJobName ( &quot;wordcount&quot;); // the keys are words (strings) conf . setOutputKeyClass ( Text . class); // the values are counts (ints) conf . setOutputValueClass ( IntWritable . class); conf . setMapperClass ( MapClass .class); conf . setCombinerClass ( Reduce .class); conf . setReducerClass ( Reduce .class); conf . setInputPath ( new Path ( args . get (0))); conf . setOutputPath (new Path ( args . get (1))); JobClient . runJob ( conf ); return 0; }
  • 43. Hadoop: Map Class An Introduction to MapReduce public static class MapClass extends MapReduceBase implements Mapper < LongWritable , Text , Text , IntWritable > { private final static IntWritable one = new IntWritable (1); private Text word = new Text (); // map(WritableComparable, Writable, OutputCollector, Reporter) public void map ( LongWritable key , Text value , OutputCollector < Text , IntWritable > output , Reporter reporter ) throws IOException { String line = value . toString (); StringTokenizer itr = new StringTokenizer ( line ); while ( itr . hasMoreTokens ()) { word . set ( itr . nextToken ()); output . collect ( word , one ); } } }
  • 44. Hadoop: Reduce Class An Introduction to MapReduce public static class Reduce extends MapReduceBase implements Reducer < Text , IntWritable , Text , IntWritable > { // reduce(WritableComparable, Iterator, OutputCollector, Reporter) public void reduce ( Text key , Iterator < IntWritable > values , OutputCollector < Text , IntWritable > output , Reporter reporter ) throws IOException { int sum = 0; while ( values . hasNext ()) { sum += values . next (). get (); } output . collect ( key , new IntWritable ( sum )); } }
  • 45. References Jeffrey Dean, Sanjay Ghemawat. MapReduce : Simplified Data Processing on Large Clusters. OSDI'04, San Francisco, CA, December, 2004. Ralf Lämmel. Google's MapReduce Programming Model – Revisited. 2006-2007. Accepted for publication in the Science of Computer Programming Journal Jeff Dean, Sanjay Ghemawat. Slides from the OSDI'04. http://labs.google.com/papers/mapreduce-osdi04-slides/index.html Hadoop. http://hadoop.apache.org An Introduction to MapReduce

Editor's Notes

  1. Existen librerias para programar culsters como PVM (Parallel virtual machine), MPI (Message Passing interface)
  2. Encima de GFS
  3. Given a collection of Shapes we split this collection into 2 parts and send every part to a grid node. Each node will count number of Shapes provided and will return it back to caller. The caller then will add results received from remote nodes and provide the reduced result back to the user (the counts are displayed next to every shape).
  4. Simplificar la paralelizaci ón y distribución de procesamientos masivos de datos.
  5. So, in order to achieve this goal, MR provides...
  6. La entrada es una lista de URLs de p áginas web
  7. Vamos a ver con un ejemplo como funciona el framework de MapReduce. El programa de ejemplo cuenta el n úmero de ocurrencias de cada letra que aparece en un fichero grande y clasifica las letras en 4 ficheros de salida distintos por rangos de siete letras (e.g. De la A a la G, de la H a la N...). Para esto se utiliza una función de particionamiento definida por el usuario, y que suele ser una función hash.
  8. El master asigna a cada worker una tarea espec ífica. En este caso suponemos que los workers de la izquierda van a ser asignados con tareas map y los de la derecha con tareas de reducción.
  9. El siguiente paso asigna a cada tarea map su correspondiente parte de la entrada del fichero
  10. Utilizando la funci ón de particionamiento, la tarea 1 v a clasificando las letras en cada region de la siguiente manera: la a a la region 1, la y a la region 4
  11. Completed map tasks need to be re-executed because their results are stored in local disks and they are inaccessible. Completed reduce tasks don’t need to be re-executed because their results are in the GFS that provides fault-tolerance through data replication.