Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Mastering Map Reduce
Scott Crespo

Path to Success
Map Reduce Refresher
Optimization Strategies
CustomType Example
Applications

What’s Hadoop?
A framework that facilitates data flow through a cluster of servers

What’s Map Reduce?
 A paradigm for analyzing distributed data sets
Raw Data ( K, [V1..Vn] )(K,V)

What About Hive And Pig?
Use them whenever possible!

Data States in Map Reduce (Letter Count)
HelloWorld
Hello
World
H,1
E,1
L,1
L,1
O,1
W,1
O,1
R,1
L,1
D,1
H,[1]
E,[1]
L,[1,1,1]
O,[1,1]
W,[1]
R,[1]
D,[1]
H,1
E,1
L,3
O,2
W,1
R,1
D,1
Split
Map Partition/Shuffle
Reduce

Basic Map Reduce Program Structure
MyMapReduceProgram {
MyMapperClass extends Mapper {
map() {
// map code
}
}
MyReducerClass extends Reducer {
reduce() {
//reduce code
}
}
main() {
//driver code
}
}

Advanced Optimizations
 Drivers
 CustomTypes
 Setup Methods
 Partitioning
 Combiners
 Chaining
 FaultTolerance

Generating N-Grams
 N-Gram: Set of all n sequential elements in a set.
Trigram: “The quick brown fox jumps over the lazy dog”
(the quick brown), (quick brown fox), (brown fox jumps),
(fox jumps over), (jumps over the), (the lazy dog)

Solution Design
NGramCounter {
NGramMapper {
map() {
//Tokenize and Sanitize Inputs
// Create NGram
// Output (NGram ngram, Int count)
}
}
NGramCombiner {
combine() {
// Sum local NGrams counts that are of the same key
// Output (NGram ngram, Int Count)
}
}
NGramReducer {
reduce() {
// Sum Ngrams counts of the same key
// Output (NGram ngram, Int Count)
}
}
}
CustomType!

Work Flow
 Prototype (Python)
 CustomType (Trigram)
 UnitTests
 Mapper
 Reducer

Prototype
Quick and Dirty Python

Prototype
def test_mapper():
lines = [“the quick brown fox jumped over the lazy dog", "the quick brown”]
for line in lines:
words = line.split()
length = len(words)
sys.stdout.write("nLength of %d n-------------------n" % length)
i = 0
while (i+2 < length):
first = words[i]
second = words [i+1]
third = words[i+2]
trigram = "%s %s %s n" % (first, second, third)
sys.stdout.write(trigram)
i += 1

Output
Length of 9
-------------------
the quick brown
quick brown fox
brown fox jumped
fox jumped over
jumped over the
over the lazy
the lazy dog
Length of 3
-------------------
the quick brown

Custom KeyTypes
Must implement Hadoops WritableComparable interface
 Writable:The key can be serialized and transmitted across a
network
 Comparable:The key can be compared to other keys &
combined/sorted for the reduce phase
write() readFields() compareTo() hashCode()
toString() equals()

Trigram.java
public class Trigram implements WritableComparable<Trigram> {
…
public int compareTo(Trigram other) {
int compared = first.compareTo(other.first);
if (compared != 0) {
return compared;
}
compared = second.compareTo(other.second);
if (compared != 0) {
return compared;
}
return third.compareTo(other.third);
}
public int hashCode() {
return first.hashCode()*163 + second.hashCode() + third.hashCode();
}
}

TrigramMapper
public static class TrigramMapper
extends Mapper<Object, Text, Trigram, IntWritable> {
…
public void map(Object key, Text value, Context context) {
String line = value.toString().toLowerCase(); // create string and lower case
line = line.replaceAll("[^a-zs]",""); // remove bad non-word chars
String[] words = line.split("s"); // split line into list of words
int len = words.length; // need the length for our loop condition
for(int i = 0; i+2 < len; i++) {
if(len <= 1) { continue; } // remove short lines
first.set(words[i]);
second.set(words[i+1]);
third.set(words[i+2]);
trigram.set(first, second, third);
context.write(trigram, one);

TrigramReducer
public static class TrigramReducer
extends Reducer<Trigram, IntWritable, Trigram, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Trigram key, Iterable<IntWritable> values, Context context ) {
int sum = 0;
for(IntWritable value : values) {
sum += value.get();
}
result.set(sum);
context.write(key, result);
…

Driver
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Trigram Count");
job.setJarByClass(TrigramCount.class);
job.setMapperClass(TrigramMapper.class);
job.setMapOutputKeyClass(Trigram.class);
job.setMapOutputValueClass(IntWritable.class);
job.setReducerClass(TrigramReducer.class);
job.setCombinerClass(TrigramReducer.class);
job.setOutputKeyClass(Trigram.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}

Speech Recognition
(Trigram1, 90)
(Trigram2, 76)
(Trigram3, 8)
(Trigram4, 1)

Other Applications
 Blog Posts
 Stocks
 GIS Coordinates
Any object with multiple attributes!

Stock
Attributes
Text timeStamp;
Text ticker;
Float price;

Conclusion
Custom DataTypes Can:
 Improve Runtime Performance
 Result in Reusable Code
 Provide a Consistent Interface

ThankYou!
Scott Crespo
scott@orlandods.com

Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

More Related Content

Similar to Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Similar to Mastering Hadoop Map Reduce - Custom Types and Other Optimizations (20)

Recently uploaded

Recently uploaded (20)

Mastering Hadoop Map Reduce - Custom Types and Other Optimizations