Big Data Analysis using Hadoop!
Map-Reduce – An Introduction!
Lecture 2!
Brendan Tierney
[from Hadoop in Practice, Alex Holmes]

•  A batch based, distributed computing framework modelled on Google’s paper on
MapReduce []
•  MapReduce decomposes work into small parallelised map and reduce tasks which
are scheduled for remote execution on slave nodes
•  Terminology
•  A job is a full programme
•  A task is the execution of a single map or reduce task over a slice of
data called a split
•  A Mapper is a map task
•  A Reducer is a reduce task
•  MapReduce works by manipulating key/value pairs in the general format 
map(key1,value1)➝ list(key2,value2)
reduce(key2,list(value2)) ➝ (key3, value3)
[from Hadoop in Practice, Alex Holmes]
A MapReduce Job

[from Hadoop in Practice, Alex Holmes]
A MapReduce Job
The input is
divided into
fixed-size pieces
called input
A map task is
created for each
[from Hadoop in Practice, Alex Holmes]
A MapReduce Job
The role of
is to define
the Map and
[from Hadoop in Practice, Alex Holmes]
A MapReduce Job
The Shuffle &
Sort phases
between the Map
and the Reduce
phases combines
map outputs and
sorts them for
the Reducers...
[from Hadoop in Practice, Alex Holmes]
A MapReduce Job
The Shuffle &
Sort phases
between the Map
and the Reduce
phases combines
map outputs and
sorts them for
the Reducers...
The Reduce phase
merges the data,
as defined by the
programmer to
produce the

•  The Map function 
•  The Mapper takes as input a key/value pair which represents a logical
record from the input data source (e.g. a line in a file) 
•  It produces zero or more outputs key/value pairs for each input pair
•  e.g. a filtering function may only produce output if a certain
condition is met
•  e.g. a counting function may produce multiple key/value pairs, one
per element being counted
map(in_key, in_value) ➝ list(temp_key, temp_value)

•  The Reducer(s)
•  A single Reducer handles all the map output for a unique map output
•  A Reducer outputs zero to many key/value pairs 
•  The output is written to HDFS files, to external DBs, or to any data sink...
reduce(temp_key,list(temp_values) ➝ list(out_key, out_value)

•  JobTracker - (Master)
•  Controls MapReduce jobs
•  Assigns Map & Reduce tasks to the other nodes on the cluster
•  Monitors the tasks as they are running
•  Relaunches failed tasks on other nodes in the cluster
•  TaskTracker - (Slave)
•  A single TaskTracker per slave node 
•  Manage the execution of the individual tasks on the node
•  Can instantiate many JVMs to handle tasks in parallel
•  Communicates back to the JobTracker (via a heartbeat)
[from Hadoop in Practice, Alex Holmes]

[from Hadoop the Definitive Guide,Tom White]
A MapReduce Job

[from Hadoop the Definitive Guide,Tom White]
Monitoring progress

YARN (Yet Another Resource Negotiator) Framework

Data Locality !
“This is a local node for local Data” 
•  Whenever possible Hadoop will attempt to ensure that a Mapper on a node is
working on a block of data stored locally on that node vis HDFS
•  If this is not possible, the Mapper will have to transfer the data across the network as
it accesses the data
•  Once all the Map tasks are finished, the map output data is transferred across the
network to the Reducers
•  Although Reducers may run on the same node (physical machine) as the Mappers
there is no concept of data locality for Reducers

•  Reducers cannot start until all Mappers are finished and the output has been
transferred to the Reducers and sorted
•  To alleviate bottlenecks in Shuffle & Sort - Hadoop starts to transfer data to the
Reducers as the Mappers finish
•  The percentage of Mappers which should finish before the Reducers
start retrieving data is configurable
•  To alleviate bottlenecks caused by slow Mappers - Hadoop uses speculative
•  If a Mapper appears to be running significantly slower than the others, a
new instance of the Mapper will be started on another machine,
operating on the same data (remember replication) 
•  The results of the first Mapper to finish will be used
•  The Mapper which is still running will be terminated by Hadoop
Introduction to Map-Reduce

The MapReduce Job!
Let us build up an example

The Scenario
•  Build a Word Counter
•  Using the Shakespeare Poems
•  Count the number of times a word appears
in the data set
•  Use Map-Reduce to do this work
•  Step-by-Step of creating the MR process

Driving Class

Setting up the MapReduce Job 

•  A Job object forms the specification for the job
•  Job needs to know:
•  the jar file that the code is in which will be distributed around the cluster; setJarByClass()
•  the input path(s) (in HDFS) for the job; FileInputFormat.addInputPath()
•  the output path(s) (in HDFS) for the job; FileOutputFormat.setOutputPath()
•  the Mapper and Reducer classes; setMapperClass() setReducerClass()
•  the output key and value classes; setOutputKeyClass() setOutputValueClass()
•  the Mapper output key and value classes if they are different from the Reducer;
setMapOutputKeyClass() setMapOutputValueClass()
•  the Mapper output key and value classes
•  the name of the job, default is the name of the jar file; setJobName()
•  The default input considers the file as lines of text
•  The default key input is LongWriteable (the byte offset into the file)
•  The default value input is Text (the contents of the line read from the file)

Driver Code
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);

Driver Code
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
You will typically import these classes into every
MapReduce job you write. We will omit the import
statements in future slides for brevity.

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
The main method accepts two command-line arguments: the
input and output directories.
The first step is to ensure that we have been given two
command line arguments. If not, print a help message and exit.

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Create a new job, specify the class which will be called to run
the job, and give it a Job Name.

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Give the Job information about the classes for the Mapper and
the reducer

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Specify the format of the intermediate output key and value
produced by the Mapper

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Specify the types for the Reducer output key and value

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Specify the input directory (where the data will be read from)
and the output directory where the data will be written.

File formats - Inputs
•  The default InputFormat (TextInputFormat) will be used unless you specify otherwise
•  To use an InputFormat other than the default, use e.g.

•  By default, FileInputFormat.setInputPaths() will read all files from a specified directory
and send them to Mappers
•  Exceptions: items whose names begin with a period (.) or underscore (_)
•  Globs can be specified to restrict input 
•  For example, /2010/*/01/*

Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce

MapReduce provides a programming model for processing large datasets in a distributed, parallel manner. It involves two main steps - the map step where the input data is converted into intermediate key-value pairs, and the reduce step where the intermediate outputs are aggregated based on keys to produce the final results. Hadoop is an open-source software framework that allows distributed processing of large datasets across clusters of computers using MapReduce.


File formats - Outputs
•  FileOutputFormat.setOutputPath() specifies the directory to which the Reducers will
write their final output
•  The driver can also specify the format of the output data
•  Default is a plain text file 
•  Could be explicitly written as 


Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Submit the Job and wait for completion

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);

•  The Mapper takes as input a key/value pair which represents a logical record from the
input data source (e.g. a line in a file) 
•  The Mapper may use or ignore the input key
•  E.g. a standard pattern is to read a file one line at a time
•  Key = byte offset into the file where the line starts
•  Value = contents of the line in the file 
•  Typically the key can be considered irrelevant
•  It produces zero or more outputs key/value pairs for each input pair
•  e.g. a filtering function may only produce output if a certain condition is
•  e.g. a counting function may produce multiple key/value pairs, one per
element being counted

Mapper Class
•  extends the Mapper <K1, V1, K2, V2> class
•  key and value classes implement the WriteableComparable and
Writeable interfaces
•  most Mappers override the map method which is called once for every
key/value pair in the input
•  void map (K1 key,
V1 value,
Context context) throws IOException,
•  the default map method is the Identity mapper - maps the inputs directly
to the outputs
•  in general the map input types K1, V1 are different from the map output
types K2, V2

Mapper Class
•  Hadoop provides a number of Mapper implementations:
InverseMapper - swaps the keys and values
TokenCounterMapper - tokenises the input and outputs each token with a 

count of 1
RegexMapper - extracts text matching a regular expression

Mapper Code
public class WordMapper extends Mapper<LongWritable, Text,
Text, IntWritable> {
public void map(LongWritable key, Text value, Context
throws IOException, InterruptedException {
String s = value.toString();
for (String word : s.split("W+")) {
if (word.length() > 0) {
context.write(new Text(word), new IntWritable(1));
Writes the outputs
Processes the input text

What the mapper does
•  Input to the Mapper:
•  Output from the Mapper:
(“this one I think is called a yink”)
(“he likes to wink, he likes to drink”)
(“he likes to drink and drink and drink”)
(this, 1)
(one, 1)
(I, 1)
(think, 1)
(is, 1)
(a, 1)
(he, 1)
(drink 1)

Shuffle and sort
•  Shuffle 
•  Integrates the data (key/value pairs) from outputs of each mapper
•  For now, integrates into 1 file
•  Sort 
•  The set of intermediate keys on a single node is automatically
sorted by Hadoop before they are presented to the Reducer
•  Sorted within key
•  Determines what subset of data goes to which Reducer

(a, [1])
(he, [1,1,1])
(I, [1])
(is, [1])
(one, [1])
(think, [1])
(this, [1])
(this, 1)
(one, 1)
(I, 1)
(think, 1)
(is, 1)
(a, 1)
(he, 1)
(drink 1)
(this, [1])
(one, [1])
(I, [1])
(think, [1])
(is, [1])
(a, [1])
(he, [1,1,1])
Shuffle (Group)

Reducer Class
•  extends the Reducer <K2, V2, K3, V3> class
•  key and value classes implement the WriteableComparable and Writeable interfaces
•  void reduce (K2 key,
Iterable<V2> values,
Context context) throws IOException, InterruptedException
•  called once for each input key
•  generates a list of output key/values pairs by iterating over the values associated with the
input key
•  the reduce input types K2, V2 must be the same types as the map output types
•  the reduce output types K3, V3 can be different from the reduce input types
•  the default reduce method is the Identity reducer - outputs each input/value pair directly
•  getConfiguration() - access the Configuration for a Job
•  void setup (Context context) - called once at the beginning of the reduce task
•  void cleanup(Context context) - called at the end of the task to wrap up any
loose ends, closes files, db connections etc.
•  Default number of reducers = 1

Reducer Class
•  Hadoop provides some Reducer implementations
IntSumReducer - sums the values (integers) for a given key 
LongSumReducer - sums the values (longs) for a given key

Reducer Code
public class SumReducer extends Reducer<Text, IntWritable,
Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
context.write(key, new IntWritable(wordCount));

Reducer Code
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
context.write(key, new IntWritable(wordCount));

Reducer Code
public class SumReducer extends Reducer<Text, IntWritable,
Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
context.write(key, new IntWritable(wordCount));
Processes the input text

Reducer Code
public class SumReducer extends Reducer<Text, IntWritable,
Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
context.write(key, new IntWritable(wordCount));
Writes the outputs

Introduction to Map-Reduce

  • 1. t : @brendantierney e : Big Data Analysis using Hadoop! ! Map-Reduce – An Introduction! ! Lecture 2! ! ! Brendan Tierney
  • 2. [from Hadoop in Practice, Alex Holmes] HDFS Architecture
  • 3. t : @brendantierney e : MapReduce •  A batch based, distributed computing framework modelled on Google’s paper on MapReduce [] •  MapReduce decomposes work into small parallelised map and reduce tasks which are scheduled for remote execution on slave nodes •  Terminology •  A job is a full programme •  A task is the execution of a single map or reduce task over a slice of data called a split •  A Mapper is a map task •  A Reducer is a reduce task •  MapReduce works by manipulating key/value pairs in the general format map(key1,value1)➝ list(key2,value2) reduce(key2,list(value2)) ➝ (key3, value3)
  • 4. [from Hadoop in Practice, Alex Holmes] A MapReduce Job
  • 5. [from Hadoop in Practice, Alex Holmes] A MapReduce Job The input is divided into fixed-size pieces called input splits A map task is created for each split
  • 6. [from Hadoop in Practice, Alex Holmes] A MapReduce Job The role of the programmer is to define the Map and Reduce functions
  • 7. [from Hadoop in Practice, Alex Holmes] A MapReduce Job The Shuffle & Sort phases between the Map and the Reduce phases combines map outputs and sorts them for the Reducers...
  • 8. [from Hadoop in Practice, Alex Holmes] A MapReduce Job The Shuffle & Sort phases between the Map and the Reduce phases combines map outputs and sorts them for the Reducers... The Reduce phase merges the data, as defined by the programmer to produce the outputs.
  • 9. t : @brendantierney e : Map •  The Map function •  The Mapper takes as input a key/value pair which represents a logical record from the input data source (e.g. a line in a file) •  It produces zero or more outputs key/value pairs for each input pair •  e.g. a filtering function may only produce output if a certain condition is met •  e.g. a counting function may produce multiple key/value pairs, one per element being counted map(in_key, in_value) ➝ list(temp_key, temp_value)
  • 10. t : @brendantierney e : Reduce •  The Reducer(s) •  A single Reducer handles all the map output for a unique map output key •  A Reducer outputs zero to many key/value pairs •  The output is written to HDFS files, to external DBs, or to any data sink... reduce(temp_key,list(temp_values) ➝ list(out_key, out_value)
  • 11. t : @brendantierney e : MapReduce •  JobTracker - (Master) •  Controls MapReduce jobs •  Assigns Map & Reduce tasks to the other nodes on the cluster •  Monitors the tasks as they are running •  Relaunches failed tasks on other nodes in the cluster •  TaskTracker - (Slave) •  A single TaskTracker per slave node •  Manage the execution of the individual tasks on the node •  Can instantiate many JVMs to handle tasks in parallel •  Communicates back to the JobTracker (via a heartbeat)
  • 12. [from Hadoop in Practice, Alex Holmes]
  • 13. t : @brendantierney e : [from Hadoop the Definitive Guide,Tom White] A MapReduce Job
  • 14. t : @brendantierney e : [from Hadoop the Definitive Guide,Tom White] Monitoring progress
  • 15. t : @brendantierney e : YARN (Yet Another Resource Negotiator) Framework
  • 16. t : @brendantierney e : Data Locality ! “This is a local node for local Data” •  Whenever possible Hadoop will attempt to ensure that a Mapper on a node is working on a block of data stored locally on that node vis HDFS •  If this is not possible, the Mapper will have to transfer the data across the network as it accesses the data •  Once all the Map tasks are finished, the map output data is transferred across the network to the Reducers •  Although Reducers may run on the same node (physical machine) as the Mappers there is no concept of data locality for Reducers
  • 17. t : @brendantierney e : Bottlenecks? •  Reducers cannot start until all Mappers are finished and the output has been transferred to the Reducers and sorted •  To alleviate bottlenecks in Shuffle & Sort - Hadoop starts to transfer data to the Reducers as the Mappers finish •  The percentage of Mappers which should finish before the Reducers start retrieving data is configurable •  To alleviate bottlenecks caused by slow Mappers - Hadoop uses speculative execution •  If a Mapper appears to be running significantly slower than the others, a new instance of the Mapper will be started on another machine, operating on the same data (remember replication) •  The results of the first Mapper to finish will be used •  The Mapper which is still running will be terminated by Hadoop
  • 19. t : @brendantierney e : The MapReduce Job! ! Let us build up an example
  • 20. t : @brendantierney e : The Scenario •  Build a Word Counter •  Using the Shakespeare Poems •  Count the number of times a word appears in the data set •  Use Map-Reduce to do this work •  Step-by-Step of creating the MR process
  • 21. t : @brendantierney e : Driving Class Mapper Reducer
  • 22. t : @brendantierney e : Setting up the MapReduce Job •  A Job object forms the specification for the job •  Job needs to know: •  the jar file that the code is in which will be distributed around the cluster; setJarByClass() •  the input path(s) (in HDFS) for the job; FileInputFormat.addInputPath() •  the output path(s) (in HDFS) for the job; FileOutputFormat.setOutputPath() •  the Mapper and Reducer classes; setMapperClass() setReducerClass() •  the output key and value classes; setOutputKeyClass() setOutputValueClass() •  the Mapper output key and value classes if they are different from the Reducer; setMapOutputKeyClass() setMapOutputValueClass() •  the Mapper output key and value classes •  the name of the job, default is the name of the jar file; setJobName() •  The default input considers the file as lines of text •  The default key input is LongWriteable (the byte offset into the file) •  The default value input is Text (the contents of the line read from the file)
  • 23. t : @brendantierney e : Driver Code import org.apache.hadoop.fs.Path; import; import; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 24. t : @brendantierney e : Driver Code import org.apache.hadoop.fs.Path; import; import; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } You will typically import these classes into every MapReduce job you write. We will omit the import statements in future slides for brevity.
  • 25. t : @brendantierney e : Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 26. t : @brendantierney e : Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } The main method accepts two command-line arguments: the input and output directories. The first step is to ensure that we have been given two command line arguments. If not, print a help message and exit.
  • 27. t : @brendantierney e : Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Create a new job, specify the class which will be called to run the job, and give it a Job Name.
  • 28. t : @brendantierney e : Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Give the Job information about the classes for the Mapper and the reducer
  • 29. t : @brendantierney e : Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Specify the format of the intermediate output key and value produced by the Mapper
  • 30. t : @brendantierney e : Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Specify the types for the Reducer output key and value
  • 31. t : @brendantierney e : Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Specify the input directory (where the data will be read from) and the output directory where the data will be written.
  • 32. t : @brendantierney e : File formats - Inputs •  The default InputFormat (TextInputFormat) will be used unless you specify otherwise •  To use an InputFormat other than the default, use e.g. conf.setInputFormat(KeyValueTextInputFormat.class) •  By default, FileInputFormat.setInputPaths() will read all files from a specified directory and send them to Mappers •  Exceptions: items whose names begin with a period (.) or underscore (_) •  Globs can be specified to restrict input •  For example, /2010/*/01/*
  • 33. t : @brendantierney e : File formats - Outputs •  FileOutputFormat.setOutputPath() specifies the directory to which the Reducers will write their final output •  The driver can also specify the format of the output data •  Default is a plain text file •  Could be explicitly written as conf.setOutputFormat(TextOutputFormat.class);
  • 34. t : @brendantierney e : Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Submit the Job and wait for completion
  • 35. t : @brendantierney e : Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 36. t : @brendantierney e : Mapper •  The Mapper takes as input a key/value pair which represents a logical record from the input data source (e.g. a line in a file) •  The Mapper may use or ignore the input key •  E.g. a standard pattern is to read a file one line at a time •  Key = byte offset into the file where the line starts •  Value = contents of the line in the file •  Typically the key can be considered irrelevant •  It produces zero or more outputs key/value pairs for each input pair •  e.g. a filtering function may only produce output if a certain condition is met •  e.g. a counting function may produce multiple key/value pairs, one per element being counted
  • 37. t : @brendantierney e : Mapper Class •  extends the Mapper <K1, V1, K2, V2> class •  key and value classes implement the WriteableComparable and Writeable interfaces •  most Mappers override the map method which is called once for every key/value pair in the input •  void map (K1 key, V1 value, Context context) throws IOException, InterruptedException •  the default map method is the Identity mapper - maps the inputs directly to the outputs •  in general the map input types K1, V1 are different from the map output types K2, V2
  • 38. t : @brendantierney e : Mapper Class •  Hadoop provides a number of Mapper implementations: InverseMapper - swaps the keys and values TokenCounterMapper - tokenises the input and outputs each token with a 
 count of 1 RegexMapper - extracts text matching a regular expression Example: job.setMapperClass(TokenCounterMapper.class);
  • 39. t : @brendantierney e : Mapper Code ... public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String s = value.toString(); for (String word : s.split("W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } } Inputs Outputs Writes the outputs Processes the input text
  • 40. t : @brendantierney e : What the mapper does •  Input to the Mapper: •  Output from the Mapper: (“this one I think is called a yink”) (“he likes to wink, he likes to drink”) (“he likes to drink and drink and drink”) (this, 1) (one, 1) (I, 1) (think, 1) (is, 1) (called,1) (a, 1) (yink,1) (he, 1) (likes,1) (to,1) (wink,1) (he,1) (likes,1) (to,1) (drink,1) (he,1) (likes,1) (to,1) (drink 1) (and,1) (drink,1) (and,1) (drink,1)
  • 41. t : @brendantierney e : Shuffle and sort •  Shuffle •  Integrates the data (key/value pairs) from outputs of each mapper •  For now, integrates into 1 file •  Sort •  The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer •  Sorted within key •  Determines what subset of data goes to which Reducer
  • 42. t : @brendantierney e : (a, [1]) (and,[1,1]) (called,[1]) (drink,[1,1,1,1]) (he, [1,1,1]) (I, [1]) (is, [1]) (likes,[1,1,1]) (one, [1]) (think, [1]) (this, [1]) (to,[1,1,1]) (wink,[1]) (yink,[1]) (this, 1) (one, 1) (I, 1) (think, 1) (is, 1) (called,1) (a, 1) (yink,1) (he, 1) (likes,1) (to,1) (wink,1) (he,1) (likes,1) (to,1) (drink,1) (he,1) (likes,1) (to,1) (drink 1) (and,1) (drink,1) (and,1) (drink,1) (this, [1]) (one, [1]) (I, [1]) (think, [1]) (called,[1]) (is, [1]) (a, [1]) (yink,[1]) (he, [1,1,1]) (likes,[1,1,1]) (to,[1,1,1]) (wink,[1]) (drink,[1,1,1,1]) (and,[1,1]) Mapper Shuffle (Group) Sort
  • 43. t : @brendantierney e : Reducer Class •  extends the Reducer <K2, V2, K3, V3> class •  key and value classes implement the WriteableComparable and Writeable interfaces •  void reduce (K2 key, Iterable<V2> values, Context context) throws IOException, InterruptedException •  called once for each input key •  generates a list of output key/values pairs by iterating over the values associated with the input key •  the reduce input types K2, V2 must be the same types as the map output types •  the reduce output types K3, V3 can be different from the reduce input types •  the default reduce method is the Identity reducer - outputs each input/value pair directly •  getConfiguration() - access the Configuration for a Job •  void setup (Context context) - called once at the beginning of the reduce task •  void cleanup(Context context) - called at the end of the task to wrap up any loose ends, closes files, db connections etc. •  Default number of reducers = 1
  • 44. t : @brendantierney e : Reducer Class •  Hadoop provides some Reducer implementations IntSumReducer - sums the values (integers) for a given key LongSumReducer - sums the values (longs) for a given key Example: job.setReducerClass(IntSumReducer.class);
  • 45. t : @brendantierney e : Reducer Code public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } }
  • 46. t : @brendantierney e : Reducer Code public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } } Inputs Outputs
  • 47. t : @brendantierney e : Reducer Code public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } } Processes the input text
  • 48. t : @brendantierney e : Reducer Code public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } } Writes the outputs