Java MapReduce
Programming on
Apache Hadoop
Aaron T. Myers, aka ATM
with thanks to Sandy Ryza
● Software Engineer/Tech Lead for HDFS at
● Committer/PMC Member on the Apache
Hadoop project
● My work focuses primarily on HDFS and
Hadoop security
What is MapReduce?
● A distributed programming paradigm
What is a distributed programming
What is a distributed programming
Distributed Systems are Hard
● Monitoring
● RPC protocols, serialization
● Fault tolerance
● Deployment
● Scheduling/Resource Management
Writing Data Parallel Programs
Should Not Be
MapReduce to the Rescue
● You specify map(...) and reduce(...)
○ map = (list(k, v) -> list(k, v))
○ reduce = (k, list(v) -> k, v)
● The framework does the rest
○ Split up the data
○ Run several mappers over the splits
○ Shuffle the data around for the reducers
○ Run several reducers
○ Store the final results
apple apple banana
a happy airplane
airplane on the runway
runway apple runway
rumple on the apple
apple apple banana
a happy airplane
airplane on the runway
runway apple runway
rumple on the apple
apple - 1
apple - 1
banana - 1
a - 1
happy - 1
airplane - 1
on - 1
the - 1
runway - 1
runway - 1
runway - 1
apple - 1
rumple - 1
on - 1
the - 1
apple - 1
Map Inputs Map OutputsInput Data Map Function
a - 1
airplane - 1
apple - 4
banana - 1
on - 2
runway - 3
rumple - 1
the - 2
a - 1, 1
airplane - 1
apple - 1, 1, 1, 1
banana - 1
on - 1, 1
runway - 1, 1, 1
rumple - 1
the - 1, 1
Shuffle Reduce Output
What is (Core) Hadoop?
● An open source platform for storing,
processing, and analyzing enormous
amounts of data
● Consists of…
○ A distributed file system (HDFS)
○ An implementation of the Map/Reduce paradigm
(Hadoop MapReduce)
● Written in Java!
What is Hadoop?
Traditional Operating System
File System
What is Hadoop?
(Distributed operating system)
Hadoop Distributed
File System (HDFS)
HDFS (briefly)
● Distributed file system that runs on all nodes
in the cluster
○ Co-located with Hadoop MapReduce daemons
● Looks like a pretty normal Unix file system
○ hadoop fs -ls /user/atm/
○ hadoop fs -cp /user/atm/data.txt /user/atm/data2.txt
○ hadoop fs -rm /user/atm/data.txt
○ …
● Don’t use the normal Java File API
○ Instead use org.apache.hadoop.fs.FileSystem API
Writing MapReduce programs in
● Interface to MapReduce in Hadoop is Java
● WordCount!
Word Count Map Function
public class WordCountMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one= new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable>output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
output.collect(word, one);
Word Count Reduce Function
public static class WordCountReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable>output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum +=;
output.collect(key, new IntWritable(sum));
Word Count Driver
● TextInputFormat
○ Each line becomes <LongWritable, Text> = <byte
offset in file, whole line>
● KeyValueTextInputFormat
○ Splits lines on delimiter into Text key and Text value
● SequenceFileInputFormat
○ Reads key/value pairs from SequenceFile, a Hadoop
● DBInputFormat
○ Uses JDBC to connect to a database
● Many more, or write your own!
● Writables
○ Native to Hadoop
○ Implement serialization for higher level structures
● Avro
○ Extensible
○ Cross-language
○ Handles serialization of higher level structures for
● And others…
○ Parquet, Thrift, etc.
public class MyNumberAndStringWritable implements Writable {
private int number;
private String str;
public void write(DataOutput out) throws IOException {
public void readFields(DataInput in) throws IOException {
number = in.readInt();
str = in.readUTF();
protocol MyMapReduceObjects {
record MyNumberAndString {
string str;
int number;
Testing MapReduce Programs
● First, write unit tests (duh) with MRUnit
● LocalJobRunner
○ Runs job in single process
● Single-node cluster (Cloudera VM!)
○ Multiple processes on the same machine
● On the real cluster
public void testMapper() throws IOException {
MapDriver<LongWritable, Text, Text, IntWritable> mapDriver=
new MapDriver<LongWritable, Text, Text, IntWritable>(new WordCountMapper());
String line = "apple banana banana carrot";
mapDriver.withInput(new LongWritable(0), new Text(line));
mapDriver.withOutput(new Text("apple"), new IntWritable(1));
mapDriver.withOutput(new Text("banana"), new IntWritable(1));
mapDriver.withOutput(new Text("banana"), new IntWritable(1));
mapDriver.withOutput(new Text("carrot"), new IntWritable(1));
public void testReducer() {
ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver=
new MapDriver<Text, IntWritable, Text, IntWritable>(new WordCountReducer());
reduceDriver.withInput(new Text("apple"),
Arrays.asList(new IntWritable(1), new IntWritable(2)));
reduceDriver.withOutput(new Text("apple"), new IntWritable("3"));
Map-Reduce Framework
Map input records=183
Map output records=183
Map output bytes=533563
Map output materialized bytes=534190
Input split bytes=144
Combine input records=0
Combine output records=0
Reduce input groups=183
Reduce shuffle bytes=0
Reduce input records=183
Reduce output records=183
Spilled Records=366
Shuffled Maps =0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=7
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
File System Counters
FILE: Number of bytes read=1844866
FILE: Number of bytes written=1927344
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
File Input Format Counters
Bytes Read=655137
File Output Format Counters
Bytes Written=537484
if (record.isUgly()) {
context.getCounter("Ugly Record Counters",
"Ugly Records").increment(1);
Map-Reduce Framework
Map input records=183
Map output records=183
Map output bytes=533563
Map output materialized bytes=534190
Input split bytes=144
Combine input records=0
Combine output records=0
Reduce input groups=183
Reduce shuffle bytes=0
Reduce input records=183
Reduce output records=183
Spilled Records=366
Shuffled Maps =0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=7
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
File System Counters
FILE: Number of bytes read=1844866
FILE: Number of bytes written=1927344
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
File Input Format Counters
Bytes Read=655137
File Output Format Counters
Bytes Written=537484
Ugly Record Counters
Ugly Records=1024
Distributed Cache
We need some data and libraries on all the
Distributed Cache
Map or
Reduce Task
Map or
Reduce Task
CacheMap or
Reduce Task
Map or
Reduce Task
Distributed Cache
In our driver:
DistributedCache .addCacheFile(
new URI("/some/path/to/ourfile.txt" ), conf);
In our mapper or reducer:
public void setup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
localFiles = DistributedCache .getLocalCacheFiles(conf);
Built on
● Library on top of MapReduce that makes it
easy to write pipelines of jobs in Java
● Contains capabilities like joins and
aggregation functions to save programmers
from writing these for each job
public class WordCount {
public static void main(String[] args) throws Exception {
Pipeline pipeline = new MRPipeline(WordCount.class);
PCollection<String> lines = pipeline.readTextFile(args[0]);
PCollection<String> words = lines.parallelDo("my splitter", new DoFn<String, String>() {
public void process(String line, Emitter<String> emitter) {
for (String word : line.split("s+")) {
}, Writables.strings());
PTable<String, Long> counts= Aggregate.count(words);
pipeline.writeTextFile(counts, args[1]);;
● Machine Learning on Hadoop
○ Collaborative Filtering
○ User and Item based recommenders
○ K-Means, Fuzzy K-Means clustering
○ Dirichlet process clustering
○ Latent Dirichlet Allocation
○ Singular value decomposition
○ Parallel Frequent Pattern mining
○ Complementary Naive Bayes classifier
○ Random forest decision tree based classifier
Non-Java technologies that use
● Hive
○ SQL -> M/R translator, metadata manager
● Pig
○ Scripting DSL -> M/R translator
● Distcp
○ HDFS tool to bulk copy data from one HDFS cluster
to another
● Questions?

