SlideShare a Scribd company logo
Mastering Map Reduce
Scott Crespo
Path to Success
Map Reduce Refresher
Optimization Strategies
CustomType Example
What’s Hadoop?
A framework that facilitates data flow through a cluster of servers
What’s Map Reduce?
 A paradigm for analyzing distributed data sets
Raw Data ( K, [V1..Vn] )(K,V)
What About Hive And Pig?
Use them whenever possible!
Data States in Map Reduce (Letter Count)
Map Partition/Shuffle
Basic Map Reduce Program Structure
MyMapReduceProgram {
MyMapperClass extends Mapper {
map() {
// map code
MyReducerClass extends Reducer {
reduce() {
//reduce code
main() {
//driver code
Advanced Optimizations
 Drivers
 CustomTypes
 Setup Methods
 Partitioning
 Combiners
 Chaining
 FaultTolerance
Generating N-Grams
 N-Gram: Set of all n sequential elements in a set.
Trigram: “The quick brown fox jumps over the lazy dog”
(the quick brown), (quick brown fox), (brown fox jumps),
(fox jumps over), (jumps over the), (the lazy dog)
Solution Design
NGramCounter {
NGramMapper {
map() {
//Tokenize and Sanitize Inputs
// Create NGram
// Output (NGram ngram, Int count)
NGramCombiner {
combine() {
// Sum local NGrams counts that are of the same key
// Output (NGram ngram, Int Count)
NGramReducer {
reduce() {
// Sum Ngrams counts of the same key
// Output (NGram ngram, Int Count)
Work Flow
 Prototype (Python)
 CustomType (Trigram)
 UnitTests
 Mapper
 Reducer
Quick and Dirty Python
def test_mapper():
lines = [“the quick brown fox jumped over the lazy dog", "the quick brown”]
for line in lines:
words = line.split()
length = len(words)
sys.stdout.write("nLength of %d n-------------------n" % length)
i = 0
while (i+2 < length):
first = words[i]
second = words [i+1]
third = words[i+2]
trigram = "%s %s %s n" % (first, second, third)
i += 1
Length of 9
the quick brown
quick brown fox
brown fox jumped
fox jumped over
jumped over the
over the lazy
the lazy dog
Length of 3
the quick brown
Custom DataTypes
Custom KeyTypes
Must implement Hadoops WritableComparable interface
 Writable:The key can be serialized and transmitted across a
 Comparable:The key can be compared to other keys &
combined/sorted for the reduce phase
write() readFields() compareTo() hashCode()
toString() equals()
public class Trigram implements WritableComparable<Trigram> {
public int compareTo(Trigram other) {
int compared = first.compareTo(other.first);
if (compared != 0) {
return compared;
compared = second.compareTo(other.second);
if (compared != 0) {
return compared;
return third.compareTo(other.third);
public int hashCode() {
return first.hashCode()*163 + second.hashCode() + third.hashCode();
Map Reduce Program
public static class TrigramMapper
extends Mapper<Object, Text, Trigram, IntWritable> {
public void map(Object key, Text value, Context context) {
String line = value.toString().toLowerCase(); // create string and lower case
line = line.replaceAll("[^a-zs]",""); // remove bad non-word chars
String[] words = line.split("s"); // split line into list of words
int len = words.length; // need the length for our loop condition
for(int i = 0; i+2 < len; i++) {
if(len <= 1) { continue; } // remove short lines
trigram.set(first, second, third);
context.write(trigram, one);
public static class TrigramReducer
extends Reducer<Trigram, IntWritable, Trigram, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Trigram key, Iterable<IntWritable> values, Context context ) {
int sum = 0;
for(IntWritable value : values) {
sum += value.get();
context.write(key, result);
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Trigram Count");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Speech Recognition
(Trigram1, 90)
(Trigram2, 76)
(Trigram3, 8)
(Trigram4, 1)
Other Applications
 Blog Posts
 Stocks
 GIS Coordinates
Any object with multiple attributes!
Text timeStamp;
Text ticker;
Float price;
Custom DataTypes Can:
 Improve Runtime Performance
 Result in Reusable Code
 Provide a Consistent Interface
Scott Crespo

More Related Content

Similar to Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
Map Reduce
Map ReduceMap Reduce
Map Reduce
Rahul Agarwal
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
Dilum Bandara
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
Jazan University
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
appaji intelhunt
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
Dilum Bandara
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
Avinash Pandu
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
ateeq ateeq
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
Martin Dvorak
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
Sri Prasanna
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
Kelly Technologies

Similar to Mastering Hadoop Map Reduce - Custom Types and Other Optimizations (20)

Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Map Reduce
Map ReduceMap Reduce
Map Reduce
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies

Recently uploaded

Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
Jyotishko Biswas
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
University of the Sunshine Coast degree offer diploma Transcript
University of the Sunshine Coast  degree offer diploma TranscriptUniversity of the Sunshine Coast  degree offer diploma Transcript
University of the Sunshine Coast degree offer diploma Transcript
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
jiya khan$A17
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
Amazon Web Services Korea
University of Toronto degree offer diploma Transcript
University of Toronto  degree offer diploma TranscriptUniversity of Toronto  degree offer diploma Transcript
University of Toronto degree offer diploma Transcript
EGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithmEGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithm
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
aarusi sexy model
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
kamli sharma#S10
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
Amazon Web Services Korea
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeRK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Alisha Pathan $A17

Recently uploaded (20)

Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
Greater Kailash @ℂall @Girls ꧁❤ 9873777170 ❤꧂Glamorous sonam Mehra Top Model ...
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
University of the Sunshine Coast degree offer diploma Transcript
University of the Sunshine Coast  degree offer diploma TranscriptUniversity of the Sunshine Coast  degree offer diploma Transcript
University of the Sunshine Coast degree offer diploma Transcript
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
University of Toronto degree offer diploma Transcript
University of Toronto  degree offer diploma TranscriptUniversity of Toronto  degree offer diploma Transcript
University of Toronto degree offer diploma Transcript
EGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithmEGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithm
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ginni Singh Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeRK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe

Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

  • 2. Path to Success Map Reduce Refresher Optimization Strategies CustomType Example Applications
  • 3. What’s Hadoop? A framework that facilitates data flow through a cluster of servers
  • 4. What’s Map Reduce?  A paradigm for analyzing distributed data sets Raw Data ( K, [V1..Vn] )(K,V)
  • 5. What About Hive And Pig? Use them whenever possible!
  • 6. Data States in Map Reduce (Letter Count) HelloWorld Hello World H,1 E,1 L,1 L,1 O,1 W,1 O,1 R,1 L,1 D,1 H,[1] E,[1] L,[1,1,1] O,[1,1] W,[1] R,[1] D,[1] H,1 E,1 L,3 O,2 W,1 R,1 D,1 Split Map Partition/Shuffle Reduce
  • 7. Basic Map Reduce Program Structure MyMapReduceProgram { MyMapperClass extends Mapper { map() { // map code } } MyReducerClass extends Reducer { reduce() { //reduce code } } main() { //driver code } }
  • 8. Advanced Optimizations  Drivers  CustomTypes  Setup Methods  Partitioning  Combiners  Chaining  FaultTolerance
  • 9. Generating N-Grams  N-Gram: Set of all n sequential elements in a set. Trigram: “The quick brown fox jumps over the lazy dog” (the quick brown), (quick brown fox), (brown fox jumps), (fox jumps over), (jumps over the), (the lazy dog)
  • 10. Solution Design NGramCounter { NGramMapper { map() { //Tokenize and Sanitize Inputs // Create NGram // Output (NGram ngram, Int count) } } NGramCombiner { combine() { // Sum local NGrams counts that are of the same key // Output (NGram ngram, Int Count) } } NGramReducer { reduce() { // Sum Ngrams counts of the same key // Output (NGram ngram, Int Count) } } } CustomType!
  • 11. Work Flow  Prototype (Python)  CustomType (Trigram)  UnitTests  Mapper  Reducer
  • 13. Prototype def test_mapper(): lines = [“the quick brown fox jumped over the lazy dog", "the quick brown”] for line in lines: words = line.split() length = len(words) sys.stdout.write("nLength of %d n-------------------n" % length) i = 0 while (i+2 < length): first = words[i] second = words [i+1] third = words[i+2] trigram = "%s %s %s n" % (first, second, third) sys.stdout.write(trigram) i += 1
  • 14. Output Length of 9 ------------------- the quick brown quick brown fox brown fox jumped fox jumped over jumped over the over the lazy the lazy dog Length of 3 ------------------- the quick brown
  • 16. Custom KeyTypes Must implement Hadoops WritableComparable interface  Writable:The key can be serialized and transmitted across a network  Comparable:The key can be compared to other keys & combined/sorted for the reduce phase write() readFields() compareTo() hashCode() toString() equals()
  • 17. public class Trigram implements WritableComparable<Trigram> { … public int compareTo(Trigram other) { int compared = first.compareTo(other.first); if (compared != 0) { return compared; } compared = second.compareTo(other.second); if (compared != 0) { return compared; } return third.compareTo(other.third); } public int hashCode() { return first.hashCode()*163 + second.hashCode() + third.hashCode(); } }
  • 19. TrigramMapper public static class TrigramMapper extends Mapper<Object, Text, Trigram, IntWritable> { … public void map(Object key, Text value, Context context) { String line = value.toString().toLowerCase(); // create string and lower case line = line.replaceAll("[^a-zs]",""); // remove bad non-word chars String[] words = line.split("s"); // split line into list of words int len = words.length; // need the length for our loop condition for(int i = 0; i+2 < len; i++) { if(len <= 1) { continue; } // remove short lines first.set(words[i]); second.set(words[i+1]); third.set(words[i+2]); trigram.set(first, second, third); context.write(trigram, one);
  • 20. TrigramReducer public static class TrigramReducer extends Reducer<Trigram, IntWritable, Trigram, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Trigram key, Iterable<IntWritable> values, Context context ) { int sum = 0; for(IntWritable value : values) { sum += value.get(); } result.set(sum); context.write(key, result); …
  • 21. Driver public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "Trigram Count"); job.setJarByClass(TrigramCount.class); job.setMapperClass(TrigramMapper.class); job.setMapOutputKeyClass(Trigram.class); job.setMapOutputValueClass(IntWritable.class); job.setReducerClass(TrigramReducer.class); job.setCombinerClass(TrigramReducer.class); job.setOutputKeyClass(Trigram.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }
  • 23. Speech Recognition (Trigram1, 90) (Trigram2, 76) (Trigram3, 8) (Trigram4, 1)
  • 24. Other Applications  Blog Posts  Stocks  GIS Coordinates Any object with multiple attributes!
  • 26. Conclusion Custom DataTypes Can:  Improve Runtime Performance  Result in Reusable Code  Provide a Consistent Interface