SlideShare a Scribd company logo
Distributed Computing with
Apache Hadoop
Introduction to MapReduce
Konstantin V. Shvachko
Birmingham Big Data Science Group
October 19, 2011
Computing
• History of computing started long time ago
• Fascination with numbers
– Vast universe with simple strict rules
– Computing devices
– Crunch numbers
• The Internet
– Universe of words, fuzzy rules
– Different type of computing
– Understand meaning of things
– Human thinking
– Errors & deviations are a
part of study
2
Computer History Museum, San Jose
Words vs. Numbers
• In 1997 IBM built Deep Blue supercomputer
– Playing chess game with the champion G. Kasparov
– Human race was defeated
– Strict rules for Chess
– Fast deep analyses of current state
– Still numbers
3
• In 2011 IBM built Watson computer to
play Jeopardy
– Questions and hints in human terms
– Analysis of texts from library and the
Internet
– Human champions defeated
Big Data
• Computations that need the power of many computers
– Large datasets: hundreds of TBs, PBs
– Or use of thousands of CPUs in parallel
– Or both
• Cluster as a computer
4
What is a PB?
1 KB = 1000 Bytes
1 MB = 1000 KB
1 GB = 1000 MB
1 TB = 1000 GB
1 PB = 1000 TB
???? = 1000 PB

Recommended for you

Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce

This document provides an overview of Hadoop and MapReduce. It discusses how Hadoop uses HDFS for distributed storage and replication of data blocks across commodity servers. It also explains how MapReduce allows for massively parallel processing of large datasets by splitting jobs into mappers and reducers. Mappers process data blocks in parallel and generate intermediate key-value pairs, which are then sorted and grouped by the reducers to produce the final results.

Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice

Slides for the talk given at IEEE BigData 2013, Santa Clara, USA on 07.10.2013. Full-text paper is available at http://goo.gl/WTJoxm To cite please refer to http://dx.doi.org/10.1109/BigData.2013.6691637

mapreducehadoopimage similarity search
Meethadoop
MeethadoopMeethadoop
Meethadoop

Hadoop is an open source framework for running large-scale data processing jobs across clusters of computers. It has two main components: HDFS for reliable storage and Hadoop MapReduce for distributed processing. HDFS stores large files across nodes through replication and uses a master-slave architecture. MapReduce allows users to write map and reduce functions to process large datasets in parallel and generate results. Hadoop has seen widespread adoption for processing massive datasets due to its scalability, reliability and ease of use.

Examples – Science
• Fundamental physics: Large Hadron Collider (LHC)
– Smashing high-energy protons at the speed of light
– 1 PB of event data per sec, most filtered out
– 15 PB of data per year
– 150 computing centers around the World
– 160 PB of disk + 90 PB of tape storage
• Math: Big Numbers
– 2 quadrillionth (1015) digit of π is 0
– pure CPU workload
– 12 days of cluster time
– 208 years of CPU-time on a cluster with 7600 CPU cores
• Big Data – Big Science
5
Examples – Web
• Search engine Webmap
– Map of the Internet
– 2008 @ Yahoo, 1500 nodes, 5 PB raw storage
• Internet Search Index
– Traditional application
• Social Network Analysis
– Intelligence
– Trends
6
The Sorting Problem
• Classic in-memory sorting
– Complexity: number of comparisons
• External sorting
– Cannot load all data in memory
– 16 GB RAM vs. 200 GB file
– Complexity: + disk IOs (bytes read or written)
• Distributed sorting
– Cannot load data on a single server
– 12 drives * 2 TB = 24 TB disc space vs. 200 TB data set
– Complexity: + network transfers
7
Worst Average Space
Bubble Sort O(n2) O(n2) In-place
Quicksort O(n2) O(n log n) In-place
Merge Sort O(n log n) O(n log n) Double
What do we do?
• Need a lot of computers
• How to make them work together
8

Recommended for you

Hadoop
HadoopHadoop
Hadoop

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was created to support applications handling large datasets operating on many servers. Key Hadoop technologies include MapReduce for distributed computing, and HDFS for distributed file storage inspired by Google File System. Other related Apache projects extend Hadoop capabilities, like Pig for data flows, Hive for data warehousing, and HBase for NoSQL-like big data. Hadoop provides an effective solution for companies dealing with petabytes of data through distributed and parallel processing.

hadoop
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners

This presentation is about apache hadoop technology. It may be helpful for the beginners to know some terminologies of hadoop.

Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011

In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.

mrv2mapreducehadoop
Hadoop
• Apache Hadoop is an ecosystem of
tools for processing “Big Data”
• Started in 2005 by D. Cutting and M. Cafarella
• Consists of two main components: Providing unified cluster view
1. HDFS – a distributed file system
– File system API connecting thousands of drives
2. MapReduce – a framework for distributed computations
– Splitting jobs into parts executable on one node
– Scheduling and monitoring of job execution
• Today used everywhere: Becoming a standard of distributed computing
• Hadoop is an open source project
9
MapReduce
• MapReduce
– 2004 Jeffrey Dean, Sanjay Ghemawat. Google.
– “MapReduce: Simplified Data Processing on Large Clusters”
• Computational model
– What is a comp. model ???
• Turing machine, Java
– Split large input data into small enough pieces, process in parallel
• Execution framework
– Compilers, interpreters
– Scheduling, Processing, Coordination
– Failure recovery
10
Functional Programming
• Map a higher-order function
– applies a given function to each element of a list
– returns the list of results
• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
11
Functional Programming: reduce
• Map a higher-order function
– applies a given function to each element of a list
– returns the list of results
• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
• Reduce / fold a higher-order function
– Iterates given function over a list of elements
– Applies function to previous result and current element
– Return single result
• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15
12

Recommended for you

Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data

This document provides an overview of Hadoop, an open source distributed computing framework for processing large datasets. It discusses the motivation for Hadoop, including challenges with traditional approaches. It then describes how Hadoop provides partial failure support, fault tolerance, and data locality to efficiently process big data across clusters. The document outlines the core Hadoop concepts and architecture, including HDFS for reliable data storage, and MapReduce for parallel processing. It provides examples of how Hadoop works and how organizations use it at large scale.

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop

Ran Ziv introduces Apache Hadoop, an open-source software platform for distributed storage and processing of large datasets across clusters of computers. Hadoop consists of HDFS for storage and MapReduce for processing. HDFS stores data across clusters as blocks and provides high throughput even when hardware fails, while MapReduce allows parallel processing of data using "map" and "reduce" functions. A large ecosystem of projects has been built around Hadoop's core to support additional functionality such as data integration, querying, databases and scheduling. Hadoop works best for large datasets, batch processing and jobs where data can be distributed across nodes.

big datadata clustermapreduce
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)

This chapter discusses different methods for processing large amounts of data across distributed systems. It introduces MapReduce as a programming model used by Google to process vast amounts of data across thousands of servers. MapReduce allows for distributed processing of large datasets by dividing work into independent tasks (mapping) and collecting/aggregating the results (reducing). The chapter also discusses scaling computation by launching many independent virtual machines and assigning tasks via a messaging queue. Overall it provides an overview of approaches for parallel and distributed processing of big data across cloud infrastructures.

mapreducenosqlhadoop
Functional Programming
• Map a higher-order function
– applies a given function to each element of a list
– returns the list of results
• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
• Reduce / fold a higher-order function
– Iterates given function over a list of elements
– Applies function to previous result and current element
– Return single result
• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15
• Reduce( x * y, [0,1,2,3,4,5] ) = ?
13
Functional Programming
• Map a higher-order function
– applies a given function to each element of a list
– returns the list of results
• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
• Reduce / fold a higher-order function
– Iterates given function over a list of elements
– Applies function to previous result and current element
– Return single result
• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15
• Reduce( x * y, [0,1,2,3,4,5] ) = 0
14
Example: Sum of Squares
• Composition of
– a map followed by
– a reduce applied to the results of the map
• Example.
– Map( x2, [1,2,3,4,5] ) = [0,1,4,9,16,25]
– Reduce( x + y, [1,4,9,16,25] ) = ((((1 + 4) + 9) + 16) + 25) = 55
• Map easily parallelizable
– Compute x2 for 1,2,3 on one node and for 4,5 on another
• Reduce notoriously sequential
– Need all squares at one node to compute the total sum.
15
Square Pyramid Number
1 + 4 + … + n2 =
n(n+1)(2n+1) / 6
Computational Model
• MapReduce is a Parallel Computational Model
• Map-Reduce algorithm = job
• Operates with key-value pairs: (k, V)
– Primitive types, Strings or more complex Structures
• Map-Reduce job input and output is a list of pairs {(k, V)}
• MR Job as defined by 2 functions
• map: (k1; v1) → {(k2; v2)}
• reduce: (k2; {v2}) → {(k3; v3)}
16

Recommended for you

Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology

Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro

Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop

The document discusses linking the statistical programming language R with the Hadoop platform for big data analysis. It introduces Hadoop and its components like HDFS and MapReduce. It describes three ways to link R and Hadoop: RHIPE which performs distributed and parallel analysis, RHadoop which provides HDFS and MapReduce interfaces, and Hadoop streaming which allows R scripts to be used as Mappers and Reducers. The goal is to use these methods to analyze large datasets with R functions on Hadoop clusters.

mapreducehadoopr
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop

The document is an introduction to Hadoop and MapReduce for scientific data mining. It aims to introduce MapReduce thinking and how it enables parallel computing, introduce Hadoop as an open source implementation of MapReduce, and present an example of using Hadoop's streaming API for a scientific data mining task. It also discusses higher-level concepts for performing ad hoc analysis and building systems with Hadoop.

hadoop scientific programming geospatial
Job Workflow
17
dogs C, 3
like
cats
V, 1
C, 2 V, 2
C, 3 V, 1
C, 8
V, 4
The Algorithm
18
Map ( null, word)
nC = Consonants(word)
nV = Vowels(word)
Emit(“Consonants”, nC)
Emit(“Vowels”, nV)
Reduce(key, {n1, n2, …})
nRes = n1 + n2 + …
Emit(key, nRes)
Computation Framework
• Two virtual clusters: HDFS and MapReduce
– Physically tightly coupled. Designed to work together
• Hadoop Distributed File System. View data as files and directories
• MapReduce is a Parallel Computation Framework
– Job scheduling and execution framework
19
HDFS Architecture Principles
• The name space is a hierarchy of files and directories
• Files are divided into blocks (typically 128 MB)
• Namespace (metadata) is decoupled from data
– Fast namespace operations, not slowed down by
– Data streaming
• Single NameNode keeps the entire name space in RAM
• DataNodes store data blocks on local drives
• Blocks are replicated on 3 DataNodes for redundancy and availability
20

Recommended for you

Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction

This document provides an overview of Hadoop, a tool for processing large datasets across clusters of computers. It discusses why big data has become so large, including exponential growth in data from the internet and machines. It describes how Hadoop uses HDFS for reliable storage across nodes and MapReduce for parallel processing. The document traces the history of Hadoop from its origins in Google's file system GFS and MapReduce framework. It provides brief explanations of how HDFS and MapReduce work at a high level.

hadoopbig data
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies

The document discusses various Hadoop technologies including HDFS, MapReduce, Pig/Hive, HBase, Flume, Oozie, Zookeeper, and HBase. HDFS provides reliable storage across multiple machines by replicating data on different nodes. MapReduce is a framework for processing large datasets in parallel. Pig and Hive provide high-level languages for analyzing data stored in Hadoop. Flume collects log data as it is generated. Oozie manages Hadoop jobs. Zookeeper allows distributed coordination. HBase provides a fault-tolerant way to store large amounts of sparse data.

Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop

This document discusses scaling image indexing and search using Hadoop on the Grid5000 platform. The approach indexes over 100 million images (30 billion features) using MapReduce. Experiments indexing 1TB and 4TB of images on up to 100 nodes are described. Search quality and throughput for batches up to 12,000 query images are evaluated. Limitations of HDFS block size on scaling and processing over 10TB are discussed along with ideas to improve scalability and handle larger query batches.

multimedia retrievalhigh-dimensional indexingmapreduce
MapReduce Framework
• Job Input is a file or a set of files in a distributed file system (HDFS)
– Input is split into blocks of roughly the same size
– Blocks are replicated to multiple nodes
– Block holds a list of key-value pairs
• Map task is scheduled to one of the nodes containing the block
– Map task input is node-local
– Map task result is node-local
• Map task results are grouped: one group per reducer
Each group is sorted
• Reduce task is scheduled to a node
– Reduce task transfers the targeted groups from all mapper nodes
– Computes and stores results in a separate HDFS file
• Job Output is a set of files in HDFS. With #files = #reducers
21
Map Reduce Example: Mean
• Mean
• Input: large text file
• Output: average length of words in the file µ
• Example: µ({dogs, like, cats}) = 4
22
n
ix
n 1
1
Mean Mapper
• Map input is the set of words {w} in the partition
– Key = null Value = w
• Map computes
– Number of words in the partition
– Total length of the words ∑length(w)
• Map output
– <“count”, #words>
– <“length”, #totalLength>
23
Map (null, w)
Emit(“count”, 1)
Emit(“length”, length(w))
Single Mean Reducer
• Reduce input
– {<key, {value}>}, where
– key = “count”, “length”
– value is an integer
• Reduce computes
– Total number of words: N = sum of all “count” values
– Total length of words: L = sum of all “length” values
• Reduce Output
– <“count”, N>
– <“length”, L>
• The result
– µ = L / N
24
Reduce(key, {n1, n2, …})
nRes = n1 + n2 + …
Emit(key, nRes)
Analyze ()
read(“part-r-00000”)
print(“mean = ” + L/N)

Recommended for you

Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt

This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.

nosqlmapreducehadoop
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses challenges in big data by providing reliability, scalability, and fault tolerance. Hadoop allows distributed processing of large datasets across clusters using MapReduce and can scale from single servers to thousands of machines, each offering local computation and storage. It is widely used for applications such as log analysis, data warehousing, and web indexing.

etl bi mahout hive pig siva pandeti cloudera bigda
Hadoop
HadoopHadoop
Hadoop

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a simple programming model called MapReduce that automatically parallelizes and distributes work across nodes. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and MapReduce execution engine for processing. HDFS stores data as blocks replicated across nodes for fault tolerance. MapReduce jobs are split into map and reduce tasks that process key-value pairs in parallel. Hadoop is well-suited for large-scale data analytics as it scales to petabytes of data and thousands of machines with commodity hardware.

Mean: Mapper, Reducer
25
public class WordMean {
private final static Text COUNT_KEY = new Text(new String("count"));
private final static Text LENGTH_KEY = new Text(new String("length"));
private final static LongWritable ONE = new LongWritable(1);
public static class WordMeanMapper
extends Mapper<Object, Text, Text, LongWritable> {
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
String word = itr.nextToken();
context.write(LENGTH_KEY, new LongWritable(word.length()));
context.write(COUNT_KEY, ONE);
} } }
public static class WordMeanReducer
extends Reducer<Text,LongWritable,Text,LongWritable> {
public void reduce(Text key, Iterable<LongWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (LongWritable val : values)
sum += val.get();
context.write(key, new LongWritable(sum));
} }
. . . . . . . . . . . . . . . .
Mean: main()
26
. . . . . . . . . . . . . . . .
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(
conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordmean <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word mean");
job.setJarByClass(WordMean.class);
job.setMapperClass(WordMeanMapper.class);
job.setReducerClass(WordMeanReducer.class);
job.setCombinerClass(WordMeanReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setNumReduceTasks(1);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
Path outputpath = new Path(otherArgs[1]);
FileOutputFormat.setOutputPath(job, outputpath);
boolean result = job.waitForCompletion(true);
analyzeResult(outputpath);
System.exit(result ? 0 : 1);
}
. . . . . . . . . . . . . . . .
Mean: analyzeResult()
27
. . . . . . . . . . . . . . . .
private static void analyzeResult(Path outDir) throws IOException {
FileSystem fs = FileSystem.get(new Configuration());
Path reduceFile = new Path(outDir, "part-r-00000");
if(!fs.exists(reduceFile)) return;
long count = 0, length = 0;
BufferedReader in =
new BufferedReader(new InputStreamReader(fs.open(reduceFile)));
while(in != null && in.ready()) {
StringTokenizer st = new StringTokenizer(in.readLine());
String key = st.nextToken();
String value = st.nextToken();
if(key.equals("count")) count = Long.parseLong(value);
else if(key.equals("length")) length = Long.parseLong(value);
}
double average = (double)length / count;
System.out.println("The mean is: " + average);
}
} // end WordMean
MapReduce Implementation
• Single master JobTracker shepherds the distributed heard of TaskTrackers
1. Job scheduling and resource allocation
2. Job monitoring and job lifecycle coordination
3. Cluster health and resource tracking
• Job is defined
– Program: myJob.jar file
– Configuration: conf.xml
– Input, output paths
• JobClient submits the job to the JobTracker
– Calculates and creates splits based on the input
– Write myJob.jar and conf.xml to HDFS
28

Recommended for you

L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt

This document provides an overview of cloud computing with MapReduce and Hadoop. It discusses what cloud computing and MapReduce are, how they work, and examples of applications that use MapReduce. Specifically, MapReduce is introduced as a programming model for large-scale data processing across thousands of machines in a fault-tolerant way. Example applications like search, sorting, inverted indexing, finding popular words, and numerical integration are described. The document also outlines how to get started with Hadoop and write MapReduce jobs in Java.

Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai

Hadoop classes in mumbai best android classes in mumbai with job assistance. our features are: expert guidance by it industry professionals lowest fees of 5000 practical exposure to handle projects well equiped lab after course resume writing guidance

hadoop classes in mumbai
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework

Storage and computation is getting cheaper AND easily accessible on demand in the cloud. We now collect and store some really large data sets Eg: user activity logs, genome sequencing, sensory data etc. Hadoop and the ecosystem of projects built around it present simple and easy to use tools for storing and analyzing such large data collections on commodity hardware. Topics Covered * The Hadoop architecture. * Thinking in MapReduce. * Run some sample MapReduce Jobs (using Hadoop Streaming). * Introduce PigLatin, a easy to use data processing language. Speaker Profile: Mahesh Reddy is an Entrepreneur, chasing dreams. Works on large scale crawl and extraction of structured data from the web. He is a graduate frm IIT Kanpur(2000-05) and previously worked at Yahoo! Labs as Research Engineer/Tech Lead on Search and Advertising products.

hadoop streamingpiglatinmapreduce
MapReduce Implementation
• JobTracker divides the job into tasks: one map task per split.
– Assigns a TaskTracker for each task, collocated with the split
• TaskTrackers execute tasks and report status to the JobTracker
– TaskTracker can run multiple map and reduce tasks
– Map and Reduce Slots
• Failed attempts reassigned to other TaskTrackers
• Job execution status and results reported back to the client
• Scheduler lets many jobs run in parallel
29
Example: Standard Deviation
• Standard deviation
• Input: large text file
• Output: standard deviation σ of word lengths
• Example: σ({dogs, like, cats}) = 0
• How many jobs
30
n
ix
n 1
2
)(
1
Standard Deviation: Hint
31
2
1
22
1
2
11
22
1
22
1
1
)
1
(2
1
)(
1
n
i
nn
i
n
i
n
i
x
n
n
x
n
x
n
x
n
Standard Deviation Mapper
• Map input is the set of words {w} in the partition
– Key = null Value = w
• Map computes
– Number of words in the partition
– Total length of the words ∑length(w)
– The sum of lengths squared ∑length(w)2
• Map output
– <“count”, #words>
– <“length”, #totalLength>
– <“squared”, #sumLengthSquared>
32
Map (null, w)
Emit(“count”, 1)
Emit(“length”, length(w))
Emit(“squared”, length(w)2)

Recommended for you

Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing

From An Introduction to Data Intensive Computing. Processing Big Data Using Utility and Data Clouds

data centercloud computingscience
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics

Covers: Distributed processing issues, MR programming model Sample MR job How MR can be implemented Pros and cons of MR, tips for better performance

distributed computingmapreducebig bang
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx

R is a language and environment for statistical computing and graphics. It includes facilities for data manipulation, calculation, graphical display, and programming. Some key features of R include effective data handling, a suite of operators for calculations on arrays and matrices, graphical facilities, and a programming language with conditionals, loops, and functions. Common data structures in R include vectors, matrices, factors, lists, and data frames. Basic operations include arithmetic, logical operations, indexing, subsetting, applying functions, binding, and coercing between different structures.

Standard Deviation Reducer
• Reduce input
– {<key, {value}>}, where
– key = “count”, “length”, “squared”
– value is an integer
• Reduce computes
– Total number of words: N = sum of all “count” values
– Total length of words: L = sum of all “length” values
– Sum of length squares: S = sum of all “squared” values
• Reduce Output
– <“count”, N>
– <“length”, L>
– <“squared”, S>
• The result
– µ = L / N
– σ = sqrt(S / N - µ2)
33
Reduce(key, {n1, n2, …})
nRes = n1 + n2 + …
Emit(key, nRes)
Analyze ()
read(“part-r-00000”)
print(“mean = ” + L/N)
print(“std.dev = ” +
sqrt(S/N – L*L / N*N))
Combiner, Partitioner
• Combiners perform local aggregation before the shuffle & sort phase
– Optimization to reduce data transfers during shuffle
– In Mean example reduces transfer of many keys to only two
• Partitioners assign intermediate (map) key-value pairs to reducers
– Responsible for dividing up the intermediate key space
– Not used with single Reducer
34
Input
Data
Input
Data
Map Reduce
Input Map Shuffle
& sort
Reduce OutputCombiner
Partitioner
Distributed Sorting
• Sort a dataset, which cannot be entirely stored on one node.
• Input:
– Set of files. 100 byte records.
– The first 10 bytes of each record is the key and the rest is the value.
• Output:
– Ordered list of files: f1, … fN
– Each file fi is sorted, and
– If i < j then for any keys k Є fi and r Є fj (k ≤ r)
– Concatenation of files in the given order must form a completely sorted record set
35
Input
Data
Naïve MapReduce Sorting
• If the output could be stored on one node
• The input to any Reducer is always sorted by key
– Shuffle sorts Map outputs
• One identity Mapper and one identity Reducer would do the trick
– Identity: <k,v> → <k,v>
36
Input
Data
Map Reduce
dogs
like
cats
cats
dogs
like
Input Map Shuffle Reduce Output
cats dogs like

Recommended for you

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture

This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.

 
by EMC
apache hadoophadoophadoop 101
MapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce OperationsMapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce Operations

MapReduce: Recap •Programmers must specify: map(k, v) → <k’, v’>* reduce(k’, v’) → <k’, v’>* –All values with the same key are reduced together •Optionally, also: partition(k’, number of partitions) → partition for k’ –Often a simple hash of the key, e.g., hash(k’) mod n –Divides up key space for parallel reduce operations combine(k’, v’) → <k’, v’>* –Mini-reducers that run in memory after the map phase –Used as an optimization to reduce network traffic •The execution framework handles everything else…

mapreducedatahadoop
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx

This document discusses MapReduce and its ability to process large datasets in a distributed manner. MapReduce addresses challenges of distributed computation by allowing programmers to specify map and reduce functions. It then parallelizes the execution of these functions across large clusters and handles failures transparently. The map function processes input key-value pairs to generate intermediate pairs, which are then grouped by key and passed to reduce functions to generate the final output.

Naïve Sorting: Multiple Maps
• Multiple identity Mappers and one identity Reducer – same result
– Does not work for multiple Reducers
37
Input
Data
Output
Data
Map
Map
Map
Reduce
dogs
like
cats
cats
dogs
like
Input Map Shuffle Reduce Output
Sorting: Generalization
• Define a hash function, such that
– h: {k} → [1,N]
– Preserves the order: k ≤ s → h(k) ≤ h(s)
– h(k) is a fixed size prefix of string k (2 first bytes)
• Identity Mapper
• With a specialized Partitioner
– Compute hash of the key h(k) and assigns <k,v> to reducer Rh(k)
• Identity Reducer
– Number of reducers is N: R1, …, RN
– Inputs for Ri are all pairs that have key h(k) = i
– Ri is an identity reducer, which writes output to HDFS file fi
– Hash function choice guarantees that
keys from fi are less than keys from fj if i < j
• The algorithm was implemented to win Gray’s Terasort Benchmark in 2008
38
Undirected Graphs
• “A Discipline of Programming” E. W. Dijkstra. Ch. 23.
– Good old classics
• Graph is defined by V = {v}, E = {<v,w> | v,w Є V}
• Undirected graph. E is symmetrical, that is <v,w> Є E ≡ <w,v> Є E
• Different representations of E
1. Set of pairs
2. <v, {direct neighbors}>
3. Adjacency matrix
• From 1 to 2 in one MR job
– Identity Mapper
– Combiner = Reducer
– Reducer joins values for each vertex
39
Connected Components
• Partition set of nodes V into disjoint subsets V1, …, VN
– V = V1 U … U VN
– No paths using E from Vi to Vj if i ≠ j
– Gi = <Vi, Ei >
• Representation of connected component
– key = min{Vi}
– value = Vi
• Chain of MR jobs
• Initial data representation
– E is partitioned into sets of records (blocks)
– <v,w> Є E → <min(v,w), {v,w}> = <k, C>
40

Recommended for you

Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies

kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderaba www.kellytechno.com

hadoop training in hyderabadhadoop institutes in hyderabadhadoop training centers in hyderabad
mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt

MapReduce is a programming model used for processing large datasets in a distributed manner. It involves two key steps - the map step and the reduce step. The map step processes individual records in parallel to generate intermediate key-value pairs. The reduce step merges all intermediate values associated with the same key. Hadoop is an open-source implementation of MapReduce that runs jobs across large clusters of commodity hardware.

mapreduce
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies

Hadoop Institutes: kelly technologies are the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad. http://www.kellytechno.com/Hyderabad/Course/Hadoop-Training

hadoop training institutes in hyderabadhadoop institutes in hyderabadhadoop classes
MR Connected Components
• Mapper / Reducer Input
– {<k, C>}, where C is a subset of V, k = min(C)
• Mapper
• Reducer
• Iterate. Stop when stabilized
41
Map {<k, C>}
For all <ki, Ci> and <kj, Cj>
if Ci ∩ Cj ≠ Ǿ then
C = Ci U Cj
Emit(min(C), C)
Reduce(k, {C1, C2, …})
resC = C1 U C2 U …
Emit(k, resC)
The End
42

More Related Content

What's hot

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
joelcrabb
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
WANdisco Plc
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
DerrekYoungDotCom
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
Denis Shestakov
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Hadoop
HadoopHadoop
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
Cyanny LIANG
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Ran Ziv
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
Robert Grossman
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
Victoria López
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
University College Cork
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
Kannappan Sirchabesan
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
Denis Shestakov
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
Siva Pandeti
 

What's hot (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 

Similar to Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Hadoop
HadoopHadoop
Hadoop
Anil Reddy
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
Vibrant Technologies & Computers
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
Collin Bennett
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
Harisankar H
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx
karthikks82
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
MapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce OperationsMapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce Operations
Jason J Pulikkottil
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
GiannisPagges
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
TAGADPALLEWARPARTHVA
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
Kelly Technologies
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
Gwen (Chen) Shapira
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
Kelly Technologies
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
Tushar557668
 
MAPREDUCE ppt big data computing fall 2014 indranil gupta.ppt
MAPREDUCE ppt big data computing fall 2014 indranil gupta.pptMAPREDUCE ppt big data computing fall 2014 indranil gupta.ppt
MAPREDUCE ppt big data computing fall 2014 indranil gupta.ppt
zuhaibmohammed465
 
Hadoop
HadoopHadoop

Similar to Distributed Computing with Apache Hadoop. Introduction to MapReduce. (20)

Hadoop
HadoopHadoop
Hadoop
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
MapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce OperationsMapReduce Algorithm Design - Parallel Reduce Operations
MapReduce Algorithm Design - Parallel Reduce Operations
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
 
MAPREDUCE ppt big data computing fall 2014 indranil gupta.ppt
MAPREDUCE ppt big data computing fall 2014 indranil gupta.pptMAPREDUCE ppt big data computing fall 2014 indranil gupta.ppt
MAPREDUCE ppt big data computing fall 2014 indranil gupta.ppt
 
Hadoop
HadoopHadoop
Hadoop
 

Recently uploaded

Intro to Amazon Web Services (AWS) and Gen AI
Intro to Amazon Web Services (AWS) and Gen AIIntro to Amazon Web Services (AWS) and Gen AI
Intro to Amazon Web Services (AWS) and Gen AI
Ortus Solutions, Corp
 
Folding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a seriesFolding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a series
Philip Schwarz
 
Leading Project Management Tool Taskruop.pptx
Leading Project Management Tool Taskruop.pptxLeading Project Management Tool Taskruop.pptx
Leading Project Management Tool Taskruop.pptx
taskroupseo
 
Google ML-Kit - Understanding on-device machine learning
Google ML-Kit - Understanding on-device machine learningGoogle ML-Kit - Understanding on-device machine learning
Google ML-Kit - Understanding on-device machine learning
VishrutGoyani1
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
karim wahed
 
active-directory-auditing-solution (2).pptx
active-directory-auditing-solution (2).pptxactive-directory-auditing-solution (2).pptx
active-directory-auditing-solution (2).pptx
sudsdeep
 
Discover the Power of ONEMONITAR: The Ultimate Mobile Spy App for Android Dev...
Discover the Power of ONEMONITAR: The Ultimate Mobile Spy App for Android Dev...Discover the Power of ONEMONITAR: The Ultimate Mobile Spy App for Android Dev...
Discover the Power of ONEMONITAR: The Ultimate Mobile Spy App for Android Dev...
onemonitarsoftware
 
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptxAddressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Sparity1
 
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.
shivamt017
 
A Comparative Analysis of Functional and Non-Functional Testing.pdf
A Comparative Analysis of Functional and Non-Functional Testing.pdfA Comparative Analysis of Functional and Non-Functional Testing.pdf
A Comparative Analysis of Functional and Non-Functional Testing.pdf
kalichargn70th171
 
dachnug51 - HCL Sametime 12 as a Software Appliance.pdf
dachnug51 - HCL Sametime 12 as a Software Appliance.pdfdachnug51 - HCL Sametime 12 as a Software Appliance.pdf
dachnug51 - HCL Sametime 12 as a Software Appliance.pdf
DNUG e.V.
 
ThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and DjangoThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and Django
akshesh doshi
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdfAWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
karim wahed
 
Splunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptxSplunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptx
sudsdeep
 
ENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentationENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentation
sofiafernandezon
 
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
Semiosis Software Private Limited
 
React Native vs Flutter - SSTech System
React Native vs Flutter  - SSTech SystemReact Native vs Flutter  - SSTech System
React Native vs Flutter - SSTech System
SSTech System
 
NBFC Software: Optimize Your Non-Banking Financial Company
NBFC Software: Optimize Your Non-Banking Financial CompanyNBFC Software: Optimize Your Non-Banking Financial Company
NBFC Software: Optimize Your Non-Banking Financial Company
NBFC Softwares
 
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
Asher Sterkin
 
WEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service ProvidersWEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service Providers
Severalnines
 

Recently uploaded (20)

Intro to Amazon Web Services (AWS) and Gen AI
Intro to Amazon Web Services (AWS) and Gen AIIntro to Amazon Web Services (AWS) and Gen AI
Intro to Amazon Web Services (AWS) and Gen AI
 
Folding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a seriesFolding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a series
 
Leading Project Management Tool Taskruop.pptx
Leading Project Management Tool Taskruop.pptxLeading Project Management Tool Taskruop.pptx
Leading Project Management Tool Taskruop.pptx
 
Google ML-Kit - Understanding on-device machine learning
Google ML-Kit - Understanding on-device machine learningGoogle ML-Kit - Understanding on-device machine learning
Google ML-Kit - Understanding on-device machine learning
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) Course Introducti...
 
active-directory-auditing-solution (2).pptx
active-directory-auditing-solution (2).pptxactive-directory-auditing-solution (2).pptx
active-directory-auditing-solution (2).pptx
 
Discover the Power of ONEMONITAR: The Ultimate Mobile Spy App for Android Dev...
Discover the Power of ONEMONITAR: The Ultimate Mobile Spy App for Android Dev...Discover the Power of ONEMONITAR: The Ultimate Mobile Spy App for Android Dev...
Discover the Power of ONEMONITAR: The Ultimate Mobile Spy App for Android Dev...
 
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptxAddressing the Top 9 User Pain Points with Visual Design Elements.pptx
Addressing the Top 9 User Pain Points with Visual Design Elements.pptx
 
Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.Shivam Pandit working on Php Web Developer.
Shivam Pandit working on Php Web Developer.
 
A Comparative Analysis of Functional and Non-Functional Testing.pdf
A Comparative Analysis of Functional and Non-Functional Testing.pdfA Comparative Analysis of Functional and Non-Functional Testing.pdf
A Comparative Analysis of Functional and Non-Functional Testing.pdf
 
dachnug51 - HCL Sametime 12 as a Software Appliance.pdf
dachnug51 - HCL Sametime 12 as a Software Appliance.pdfdachnug51 - HCL Sametime 12 as a Software Appliance.pdf
dachnug51 - HCL Sametime 12 as a Software Appliance.pdf
 
ThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and DjangoThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and Django
 
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdfAWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
AWS Cloud Practitioner Essentials (Second Edition) (Arabic) AWS Security .pdf
 
Splunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptxSplunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptx
 
ENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentationENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentation
 
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
React vs Next js: Which is Better for Web Development? - Semiosis Software Pr...
 
React Native vs Flutter - SSTech System
React Native vs Flutter  - SSTech SystemReact Native vs Flutter  - SSTech System
React Native vs Flutter - SSTech System
 
NBFC Software: Optimize Your Non-Banking Financial Company
NBFC Software: Optimize Your Non-Banking Financial CompanyNBFC Software: Optimize Your Non-Banking Financial Company
NBFC Software: Optimize Your Non-Banking Financial Company
 
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
 
WEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service ProvidersWEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service Providers
 

Distributed Computing with Apache Hadoop. Introduction to MapReduce.

  • 1. Distributed Computing with Apache Hadoop Introduction to MapReduce Konstantin V. Shvachko Birmingham Big Data Science Group October 19, 2011
  • 2. Computing • History of computing started long time ago • Fascination with numbers – Vast universe with simple strict rules – Computing devices – Crunch numbers • The Internet – Universe of words, fuzzy rules – Different type of computing – Understand meaning of things – Human thinking – Errors & deviations are a part of study 2 Computer History Museum, San Jose
  • 3. Words vs. Numbers • In 1997 IBM built Deep Blue supercomputer – Playing chess game with the champion G. Kasparov – Human race was defeated – Strict rules for Chess – Fast deep analyses of current state – Still numbers 3 • In 2011 IBM built Watson computer to play Jeopardy – Questions and hints in human terms – Analysis of texts from library and the Internet – Human champions defeated
  • 4. Big Data • Computations that need the power of many computers – Large datasets: hundreds of TBs, PBs – Or use of thousands of CPUs in parallel – Or both • Cluster as a computer 4 What is a PB? 1 KB = 1000 Bytes 1 MB = 1000 KB 1 GB = 1000 MB 1 TB = 1000 GB 1 PB = 1000 TB ???? = 1000 PB
  • 5. Examples – Science • Fundamental physics: Large Hadron Collider (LHC) – Smashing high-energy protons at the speed of light – 1 PB of event data per sec, most filtered out – 15 PB of data per year – 150 computing centers around the World – 160 PB of disk + 90 PB of tape storage • Math: Big Numbers – 2 quadrillionth (1015) digit of π is 0 – pure CPU workload – 12 days of cluster time – 208 years of CPU-time on a cluster with 7600 CPU cores • Big Data – Big Science 5
  • 6. Examples – Web • Search engine Webmap – Map of the Internet – 2008 @ Yahoo, 1500 nodes, 5 PB raw storage • Internet Search Index – Traditional application • Social Network Analysis – Intelligence – Trends 6
  • 7. The Sorting Problem • Classic in-memory sorting – Complexity: number of comparisons • External sorting – Cannot load all data in memory – 16 GB RAM vs. 200 GB file – Complexity: + disk IOs (bytes read or written) • Distributed sorting – Cannot load data on a single server – 12 drives * 2 TB = 24 TB disc space vs. 200 TB data set – Complexity: + network transfers 7 Worst Average Space Bubble Sort O(n2) O(n2) In-place Quicksort O(n2) O(n log n) In-place Merge Sort O(n log n) O(n log n) Double
  • 8. What do we do? • Need a lot of computers • How to make them work together 8
  • 9. Hadoop • Apache Hadoop is an ecosystem of tools for processing “Big Data” • Started in 2005 by D. Cutting and M. Cafarella • Consists of two main components: Providing unified cluster view 1. HDFS – a distributed file system – File system API connecting thousands of drives 2. MapReduce – a framework for distributed computations – Splitting jobs into parts executable on one node – Scheduling and monitoring of job execution • Today used everywhere: Becoming a standard of distributed computing • Hadoop is an open source project 9
  • 10. MapReduce • MapReduce – 2004 Jeffrey Dean, Sanjay Ghemawat. Google. – “MapReduce: Simplified Data Processing on Large Clusters” • Computational model – What is a comp. model ??? • Turing machine, Java – Split large input data into small enough pieces, process in parallel • Execution framework – Compilers, interpreters – Scheduling, Processing, Coordination – Failure recovery 10
  • 11. Functional Programming • Map a higher-order function – applies a given function to each element of a list – returns the list of results • Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ] • Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25] 11
  • 12. Functional Programming: reduce • Map a higher-order function – applies a given function to each element of a list – returns the list of results • Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ] • Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25] • Reduce / fold a higher-order function – Iterates given function over a list of elements – Applies function to previous result and current element – Return single result • Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15 12
  • 13. Functional Programming • Map a higher-order function – applies a given function to each element of a list – returns the list of results • Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ] • Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25] • Reduce / fold a higher-order function – Iterates given function over a list of elements – Applies function to previous result and current element – Return single result • Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15 • Reduce( x * y, [0,1,2,3,4,5] ) = ? 13
  • 14. Functional Programming • Map a higher-order function – applies a given function to each element of a list – returns the list of results • Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ] • Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25] • Reduce / fold a higher-order function – Iterates given function over a list of elements – Applies function to previous result and current element – Return single result • Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15 • Reduce( x * y, [0,1,2,3,4,5] ) = 0 14
  • 15. Example: Sum of Squares • Composition of – a map followed by – a reduce applied to the results of the map • Example. – Map( x2, [1,2,3,4,5] ) = [0,1,4,9,16,25] – Reduce( x + y, [1,4,9,16,25] ) = ((((1 + 4) + 9) + 16) + 25) = 55 • Map easily parallelizable – Compute x2 for 1,2,3 on one node and for 4,5 on another • Reduce notoriously sequential – Need all squares at one node to compute the total sum. 15 Square Pyramid Number 1 + 4 + … + n2 = n(n+1)(2n+1) / 6
  • 16. Computational Model • MapReduce is a Parallel Computational Model • Map-Reduce algorithm = job • Operates with key-value pairs: (k, V) – Primitive types, Strings or more complex Structures • Map-Reduce job input and output is a list of pairs {(k, V)} • MR Job as defined by 2 functions • map: (k1; v1) → {(k2; v2)} • reduce: (k2; {v2}) → {(k3; v3)} 16
  • 17. Job Workflow 17 dogs C, 3 like cats V, 1 C, 2 V, 2 C, 3 V, 1 C, 8 V, 4
  • 18. The Algorithm 18 Map ( null, word) nC = Consonants(word) nV = Vowels(word) Emit(“Consonants”, nC) Emit(“Vowels”, nV) Reduce(key, {n1, n2, …}) nRes = n1 + n2 + … Emit(key, nRes)
  • 19. Computation Framework • Two virtual clusters: HDFS and MapReduce – Physically tightly coupled. Designed to work together • Hadoop Distributed File System. View data as files and directories • MapReduce is a Parallel Computation Framework – Job scheduling and execution framework 19
  • 20. HDFS Architecture Principles • The name space is a hierarchy of files and directories • Files are divided into blocks (typically 128 MB) • Namespace (metadata) is decoupled from data – Fast namespace operations, not slowed down by – Data streaming • Single NameNode keeps the entire name space in RAM • DataNodes store data blocks on local drives • Blocks are replicated on 3 DataNodes for redundancy and availability 20
  • 21. MapReduce Framework • Job Input is a file or a set of files in a distributed file system (HDFS) – Input is split into blocks of roughly the same size – Blocks are replicated to multiple nodes – Block holds a list of key-value pairs • Map task is scheduled to one of the nodes containing the block – Map task input is node-local – Map task result is node-local • Map task results are grouped: one group per reducer Each group is sorted • Reduce task is scheduled to a node – Reduce task transfers the targeted groups from all mapper nodes – Computes and stores results in a separate HDFS file • Job Output is a set of files in HDFS. With #files = #reducers 21
  • 22. Map Reduce Example: Mean • Mean • Input: large text file • Output: average length of words in the file µ • Example: µ({dogs, like, cats}) = 4 22 n ix n 1 1
  • 23. Mean Mapper • Map input is the set of words {w} in the partition – Key = null Value = w • Map computes – Number of words in the partition – Total length of the words ∑length(w) • Map output – <“count”, #words> – <“length”, #totalLength> 23 Map (null, w) Emit(“count”, 1) Emit(“length”, length(w))
  • 24. Single Mean Reducer • Reduce input – {<key, {value}>}, where – key = “count”, “length” – value is an integer • Reduce computes – Total number of words: N = sum of all “count” values – Total length of words: L = sum of all “length” values • Reduce Output – <“count”, N> – <“length”, L> • The result – µ = L / N 24 Reduce(key, {n1, n2, …}) nRes = n1 + n2 + … Emit(key, nRes) Analyze () read(“part-r-00000”) print(“mean = ” + L/N)
  • 25. Mean: Mapper, Reducer 25 public class WordMean { private final static Text COUNT_KEY = new Text(new String("count")); private final static Text LENGTH_KEY = new Text(new String("length")); private final static LongWritable ONE = new LongWritable(1); public static class WordMeanMapper extends Mapper<Object, Text, Text, LongWritable> { public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { String word = itr.nextToken(); context.write(LENGTH_KEY, new LongWritable(word.length())); context.write(COUNT_KEY, ONE); } } } public static class WordMeanReducer extends Reducer<Text,LongWritable,Text,LongWritable> { public void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (LongWritable val : values) sum += val.get(); context.write(key, new LongWritable(sum)); } } . . . . . . . . . . . . . . . .
  • 26. Mean: main() 26 . . . . . . . . . . . . . . . . public static void main(String[] args) throws IOException { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser( conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordmean <in> <out>"); System.exit(2); } Job job = new Job(conf, "word mean"); job.setJarByClass(WordMean.class); job.setMapperClass(WordMeanMapper.class); job.setReducerClass(WordMeanReducer.class); job.setCombinerClass(WordMeanReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); job.setNumReduceTasks(1); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); Path outputpath = new Path(otherArgs[1]); FileOutputFormat.setOutputPath(job, outputpath); boolean result = job.waitForCompletion(true); analyzeResult(outputpath); System.exit(result ? 0 : 1); } . . . . . . . . . . . . . . . .
  • 27. Mean: analyzeResult() 27 . . . . . . . . . . . . . . . . private static void analyzeResult(Path outDir) throws IOException { FileSystem fs = FileSystem.get(new Configuration()); Path reduceFile = new Path(outDir, "part-r-00000"); if(!fs.exists(reduceFile)) return; long count = 0, length = 0; BufferedReader in = new BufferedReader(new InputStreamReader(fs.open(reduceFile))); while(in != null && in.ready()) { StringTokenizer st = new StringTokenizer(in.readLine()); String key = st.nextToken(); String value = st.nextToken(); if(key.equals("count")) count = Long.parseLong(value); else if(key.equals("length")) length = Long.parseLong(value); } double average = (double)length / count; System.out.println("The mean is: " + average); } } // end WordMean
  • 28. MapReduce Implementation • Single master JobTracker shepherds the distributed heard of TaskTrackers 1. Job scheduling and resource allocation 2. Job monitoring and job lifecycle coordination 3. Cluster health and resource tracking • Job is defined – Program: myJob.jar file – Configuration: conf.xml – Input, output paths • JobClient submits the job to the JobTracker – Calculates and creates splits based on the input – Write myJob.jar and conf.xml to HDFS 28
  • 29. MapReduce Implementation • JobTracker divides the job into tasks: one map task per split. – Assigns a TaskTracker for each task, collocated with the split • TaskTrackers execute tasks and report status to the JobTracker – TaskTracker can run multiple map and reduce tasks – Map and Reduce Slots • Failed attempts reassigned to other TaskTrackers • Job execution status and results reported back to the client • Scheduler lets many jobs run in parallel 29
  • 30. Example: Standard Deviation • Standard deviation • Input: large text file • Output: standard deviation σ of word lengths • Example: σ({dogs, like, cats}) = 0 • How many jobs 30 n ix n 1 2 )( 1
  • 32. Standard Deviation Mapper • Map input is the set of words {w} in the partition – Key = null Value = w • Map computes – Number of words in the partition – Total length of the words ∑length(w) – The sum of lengths squared ∑length(w)2 • Map output – <“count”, #words> – <“length”, #totalLength> – <“squared”, #sumLengthSquared> 32 Map (null, w) Emit(“count”, 1) Emit(“length”, length(w)) Emit(“squared”, length(w)2)
  • 33. Standard Deviation Reducer • Reduce input – {<key, {value}>}, where – key = “count”, “length”, “squared” – value is an integer • Reduce computes – Total number of words: N = sum of all “count” values – Total length of words: L = sum of all “length” values – Sum of length squares: S = sum of all “squared” values • Reduce Output – <“count”, N> – <“length”, L> – <“squared”, S> • The result – µ = L / N – σ = sqrt(S / N - µ2) 33 Reduce(key, {n1, n2, …}) nRes = n1 + n2 + … Emit(key, nRes) Analyze () read(“part-r-00000”) print(“mean = ” + L/N) print(“std.dev = ” + sqrt(S/N – L*L / N*N))
  • 34. Combiner, Partitioner • Combiners perform local aggregation before the shuffle & sort phase – Optimization to reduce data transfers during shuffle – In Mean example reduces transfer of many keys to only two • Partitioners assign intermediate (map) key-value pairs to reducers – Responsible for dividing up the intermediate key space – Not used with single Reducer 34 Input Data Input Data Map Reduce Input Map Shuffle & sort Reduce OutputCombiner Partitioner
  • 35. Distributed Sorting • Sort a dataset, which cannot be entirely stored on one node. • Input: – Set of files. 100 byte records. – The first 10 bytes of each record is the key and the rest is the value. • Output: – Ordered list of files: f1, … fN – Each file fi is sorted, and – If i < j then for any keys k Є fi and r Є fj (k ≤ r) – Concatenation of files in the given order must form a completely sorted record set 35
  • 36. Input Data Naïve MapReduce Sorting • If the output could be stored on one node • The input to any Reducer is always sorted by key – Shuffle sorts Map outputs • One identity Mapper and one identity Reducer would do the trick – Identity: <k,v> → <k,v> 36 Input Data Map Reduce dogs like cats cats dogs like Input Map Shuffle Reduce Output cats dogs like
  • 37. Naïve Sorting: Multiple Maps • Multiple identity Mappers and one identity Reducer – same result – Does not work for multiple Reducers 37 Input Data Output Data Map Map Map Reduce dogs like cats cats dogs like Input Map Shuffle Reduce Output
  • 38. Sorting: Generalization • Define a hash function, such that – h: {k} → [1,N] – Preserves the order: k ≤ s → h(k) ≤ h(s) – h(k) is a fixed size prefix of string k (2 first bytes) • Identity Mapper • With a specialized Partitioner – Compute hash of the key h(k) and assigns <k,v> to reducer Rh(k) • Identity Reducer – Number of reducers is N: R1, …, RN – Inputs for Ri are all pairs that have key h(k) = i – Ri is an identity reducer, which writes output to HDFS file fi – Hash function choice guarantees that keys from fi are less than keys from fj if i < j • The algorithm was implemented to win Gray’s Terasort Benchmark in 2008 38
  • 39. Undirected Graphs • “A Discipline of Programming” E. W. Dijkstra. Ch. 23. – Good old classics • Graph is defined by V = {v}, E = {<v,w> | v,w Є V} • Undirected graph. E is symmetrical, that is <v,w> Є E ≡ <w,v> Є E • Different representations of E 1. Set of pairs 2. <v, {direct neighbors}> 3. Adjacency matrix • From 1 to 2 in one MR job – Identity Mapper – Combiner = Reducer – Reducer joins values for each vertex 39
  • 40. Connected Components • Partition set of nodes V into disjoint subsets V1, …, VN – V = V1 U … U VN – No paths using E from Vi to Vj if i ≠ j – Gi = <Vi, Ei > • Representation of connected component – key = min{Vi} – value = Vi • Chain of MR jobs • Initial data representation – E is partitioned into sets of records (blocks) – <v,w> Є E → <min(v,w), {v,w}> = <k, C> 40
  • 41. MR Connected Components • Mapper / Reducer Input – {<k, C>}, where C is a subset of V, k = min(C) • Mapper • Reducer • Iterate. Stop when stabilized 41 Map {<k, C>} For all <ki, Ci> and <kj, Cj> if Ci ∩ Cj ≠ Ǿ then C = Ci U Cj Emit(min(C), C) Reduce(k, {C1, C2, …}) resC = C1 U C2 U … Emit(k, resC)