Introduction to MapReduce and Hadoop
Expected … what to be said!
● History.
● What is Hadoop.
● Hadoop vs SQl.
● MapReduce.
● Hadoop Building Blocks.
● Installing, Configuring and Running Hadoop.
● Anatomy of MapReduce program.
How hadoop was born?
Doug Cutting
Challenges of Distributed Processing of
Large Data
● How to distribute the work?
● How to store and distribute the data itself?
● How to overcome failures?
● How to balance the load?
● How to deal with unstructured data?
● ...
Hadoop tackles these
So, what’s Hadoop?
What is Hadoop?
Hadoop is an open source framework for writing and
running distributed applications that process large
amounts of data.
Key distinctions of Hadoop:
● Accessible
● Robust
● Scalable
● Simple

Hadoop vs SQL
● Structured and Unstructured data.
● Datastore and Data Analysis.
● Scale-out and Scale-up.
● Offline batch processing and Online
Hadoop Uses
What is MapReduce?...
● Parallel programming model for clusters of
commodity machines.
● MapReduce provides:
o Automatic parallelization & distribution.
o Fault tolerance.
o Locality of data.
What is MapReduce?
MapReduce … Map then Reduce

Keys and Values
● Key/Value pairs.
● Keys divide Reduce Space.
Input Output
Map <k1, v1> list(<k2, v2>)
Reduce <k2, list(v2)> list(<k3, v3>)
WordCount in Action
“This is the foo file”
“And this is the bar one”
is, [1, 1] is,
this, [1, 1]
this, 2
foo, [1]
foo, 1.
Final output:
this 2
is 2
the 2
foo 1
file 1
and 1
bar 1
one 1
WordCount with MapReduce
map(String filename, String document) {
List<String> T = tokenize(document);
for each token in T {
emit ((String)token,
(Integer) 1);
reduce(String token, List<Integer> values) {
Integer sum = 0;
for each value in values {
sum = sum + value;
emit ((String)token, (Integer) sum);
Hadoop Building Blocks
How does hadoop work?...

Hadoop Building Blocks
1. NameNode
2. DataNode
3. Secondary NameNode
4. JobTracker
5. TaskTracker
HDFS: NameNode and DataNodes
JobTracker and TaskTracker
Typical Hadoop Cluster

Running Hadoop
Three modes to run Hadoop:
1. Local (standalone) mode.
2. Pseudo-distributed mode “cluster of one” .
3. Fully distributed mode.
An Action
Running Hadoop on Local Machine
Actions ...
1. Installing Hadoop.
2. Configuring Hadoop (Pseudo-distributed mode).
3. Running WordCount example.
4. Web-based cluster UI.
1. HDFS is a filesystem designed for large-scale
distributed data processing.
2. HDFS isn’t a native Unix filesystem.
Basic File Commands:
$ hadoop fs -cmd <args>
$ hadoop fs –ls
$ hadoop fs –mkdir /user/chuck
$ hadoop fs -copyFromLocal

Anatomy of a MapReduce program
MapReduce and beyond
1. Data Types
2. Mapper
3. Reducer
4. Partitioner
5. Combiner
6. Reading and Writing
a. InputFormat
b. OutputFormat
Anatomy of a MapReduce program
Hadoop Data Types
● Certain defined way of serializing key/value pairs.
● Values should implement Writable Interface.
● Keys should implement WritableComparable interface.
● Some predefined classes:
o BooleanWritable.
o ByteWritable.
o IntWritable
o ...

1. Mapper<K1,V1,K2,V2>
2. Override method:
void map(K1 key, V1 value, Context context)
3. Use context.write(K2, V2) to emit key/value pairs.
WordCount Mapperpublic static class Map extends Mapper<LongWritable, Text,
Text, IntWritable> {
private final static IntWritable one = new
private Text word = new Text();
public void map(LongWritable key, Text value, Context
String line = value.toString();
StringTokenizer tokenizer = new
while (tokenizer.hasMoreTokens()) {
context.write(word, one);
Predefined Mappers

1. Extends Reducer<K1,V1,K2,V2>
2. Overrides method:
void reduce(K2, Iterable<V2>, Context context)
3. Use context.write(K2, V2) to emit key/value pairs.
WordCount Reducer
public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context){
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
context.write(key, new IntWritable(sum));
Predefined Reducers

The partitioner decides
which key goes where
class WordSizePartitioner extends
Partitioner<Text, IntWritable> {
public int getPartition(Text
word, IntWritable count, int
numOfPartions) {
return 0;
It’s a local Reduce Task at
WordCout Mapper Output:
1. Without Combiner:<the, 1>, <file,
1>, <the, 1>, …
2. With Combiner:<the, 2>, <file, 2>,

Reading and Writing
Reading and Writing
1. Input data usually resides in large files.
2. MapReduce’s processing power is the splitting of the
input data into chunks(InputSplit).
3. Hadoop’s FileSystem provides the class
FSDataInputStream for file reading. It extends
DataInputStream with random read access.
InputFormat Classes
● TextInputFormat
o <offset, line>
● KeyValueTextInputFormat
o keytvaue => <key, value>
● NLineInputFormat
o <offset, nLines>
You can define your own InputFormat class ...
1. The output has no splits.
2. Each reducer generates output file named
part-nnnnn, where nnnnn is the partition ID
of the reducer.
Predefined OutputFormat classes:
> TextOutputFormat <k, v> => ktv

  • 2. Expected … what to be said! ● History. ● What is Hadoop. ● Hadoop vs SQl. ● MapReduce. ● Hadoop Building Blocks. ● Installing, Configuring and Running Hadoop. ● Anatomy of MapReduce program.
  • 5. How hadoop was born? Doug Cutting
  • 6. Challenges of Distributed Processing of Large Data ● How to distribute the work? ● How to store and distribute the data itself? ● How to overcome failures? ● How to balance the load? ● How to deal with unstructured data? ● ...
  • 8. What is Hadoop? Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Key distinctions of Hadoop: ● Accessible ● Robust ● Scalable ● Simple
  • 9. Hadoop vs SQL ● Structured and Unstructured data. ● Datastore and Data Analysis. ● Scale-out and Scale-up. ● Offline batch processing and Online transactions.
  • 11. ● Parallel programming model for clusters of commodity machines. ● MapReduce provides: o Automatic parallelization & distribution. o Fault tolerance. o Locality of data. What is MapReduce?
  • 12. MapReduce … Map then Reduce
  • 13. Keys and Values ● Key/Value pairs. ● Keys divide Reduce Space. Input Output Map <k1, v1> list(<k2, v2>) Reduce <k2, list(v2)> list(<k3, v3>)
  • 14. WordCount in Action Input: foo.txt: “This is the foo file” bar.txt: “And this is the bar one” 1 is 1 the 1 foo 1 file 1 and 1 this 1 is 1 the 1 Reduce#2: Input: Output: is, [1, 1] is, 2 Reduce#1: Input: Output: this, [1, 1] this, 2 Reduce#3: Input: Output: foo, [1] foo, 1. . Final output: this 2 is 2 the 2 foo 1 file 1 and 1 bar 1 one 1
  • 15. WordCount with MapReduce map(String filename, String document) { List<String> T = tokenize(document); for each token in T { emit ((String)token, (Integer) 1); } } reduce(String token, List<Integer> values) { Integer sum = 0; for each value in values { sum = sum + value; } emit ((String)token, (Integer) sum); }
  • 16. Hadoop Building Blocks How does hadoop work?...
  • 17. Hadoop Building Blocks 1. NameNode 2. DataNode 3. Secondary NameNode 4. JobTracker 5. TaskTracker
  • 18. HDFS: NameNode and DataNodes
  • 21. Running Hadoop Three modes to run Hadoop: 1. Local (standalone) mode. 2. Pseudo-distributed mode “cluster of one” . 3. Fully distributed mode.
  • 22. An Action Running Hadoop on Local Machine
  • 23. Actions ... 1. Installing Hadoop. 2. Configuring Hadoop (Pseudo-distributed mode). 3. Running WordCount example. 4. Web-based cluster UI.
  • 24. HDFS 1. HDFS is a filesystem designed for large-scale distributed data processing. 2. HDFS isn’t a native Unix filesystem. Basic File Commands: $ hadoop fs -cmd <args> $ hadoop fs –ls $ hadoop fs –mkdir /user/chuck $ hadoop fs -copyFromLocal
  • 25. Anatomy of a MapReduce program MapReduce and beyond
  • 26. Hadoop 1. Data Types 2. Mapper 3. Reducer 4. Partitioner 5. Combiner 6. Reading and Writing a. InputFormat b. OutputFormat
  • 27. Anatomy of a MapReduce program
  • 28. Hadoop Data Types ● Certain defined way of serializing key/value pairs. ● Values should implement Writable Interface. ● Keys should implement WritableComparable interface. ● Some predefined classes: o BooleanWritable. o ByteWritable. o IntWritable o ...
  • 30. Mapper 1. Mapper<K1,V1,K2,V2> 2. Override method: void map(K1 key, V1 value, Context context) 3. Use context.write(K2, V2) to emit key/value pairs.
  • 31. WordCount Mapperpublic static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context){ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }
  • 34. Reducer 1. Extends Reducer<K1,V1,K2,V2> 2. Overrides method: void reduce(K2, Iterable<V2>, Context context) 3. Use context.write(K2, V2) to emit key/value pairs.
  • 35. WordCount Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context){ int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
  • 38. Partitioner The partitioner decides which key goes where class WordSizePartitioner extends Partitioner<Text, IntWritable> { @Override public int getPartition(Text word, IntWritable count, int numOfPartions) { return 0; } }
  • 40. Combiner It’s a local Reduce Task at Mapper. WordCout Mapper Output: 1. Without Combiner:<the, 1>, <file, 1>, <the, 1>, … 2. With Combiner:<the, 2>, <file, 2>, ...
  • 42. Reading and Writing 1. Input data usually resides in large files. 2. MapReduce’s processing power is the splitting of the input data into chunks(InputSplit). 3. Hadoop’s FileSystem provides the class FSDataInputStream for file reading. It extends DataInputStream with random read access.
  • 43. InputFormat Classes ● TextInputFormat o <offset, line> ● KeyValueTextInputFormat o keytvaue => <key, value> ● NLineInputFormat o <offset, nLines> You can define your own InputFormat class ...
  • 44. 1. The output has no splits. 2. Each reducer generates output file named part-nnnnn, where nnnnn is the partition ID of the reducer. Predefined OutputFormat classes: > TextOutputFormat <k, v> => ktv OutputFormat
  • 45. Recap
  • 47. Q

  4. Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic Compute Cloud (EC2). Robust—Because it is intended to run on commodity hardware, Hadoop is archi­tected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster. Simple—Hadoop allows users to quickly write efficient parallel code. Hadoop in Action section 1.2
  5. REF:
  6. REF:
  7. Table from “Hadoop In Action” Images source:
  8. Pseudo-code for map and reduce functions for word counting Source: Hadoop In Action
  9. We now know a general overview about mapreduce, let’s see how hadoop works
  10. Hadoop In Action Figure 2.1
  11. Local (standalone) mode. No HDFS. No Hadoop Daemons. Debugging and testing the logic of MapReduce program. Pseudo-distributed mode. All daemons running on a single machine. Debugging your code, allowing you to examine memory usage, HDFS input/out­put issues, and other daemon interactions. Fully distributed mode.
  13. This slide is initially left blank.
  15. This slide is initially left blank.
  16. When the reducer task receives the output from the various mappers, it sorts the incoming data on the key of the (key/value) pair and groups together all values of the same key.
  17. When the reducer task receives the output from the various mappers, it sorts the incoming data on the key of the (key/value) pair and groups together all values of the same key.