SlideShare a Scribd company logo
MapReduce
Writing MapReduce with Java
MapReduce
Recap - Why MapReduce?
● Instead of processing Big Data directly
● We breakdown the logic into
○ map()
■ Executed on machines with data
■ Gives out key-value pairs
○ Reduce()
■ Gets output of maps grouped by key
■ Grouping is done by MapReduce Framework
■ Can aggregate data
MapReduce
MAP / REDUCE - Why JAVA
Why in Java?
• Primary Support
• Can modify behaviour to a very large extent
MapReduce
MAP / REDUCE - JAVA - Objective
Write a map-reduce job to count unique words
this is a cow
this is a buffalo
there is a hen
3 a
1 buffalo
1 cow
1 hen
3 is
1 there
2 this
MapReduce
MAP / REDUCE - JAVA - Objective
Write a map-reduce job to count unique words in text file
/data/mr/wordcount/input/big.txt
Location in HDFS
In CloudxLab
Input File
MapReduce
MAP / REDUCE - JAVA - Mapper
InputFormat
Datanode
HDFS Block1
Record1
(key, value)
Record2
Record3
Map()
Mapper
Map()
Map()
(key1, value1)
(key2, value2)
Nothing
(key3, value3)
InputSplit
We need to write the code which
would break down the input record
into key-value.
MapReduce
MAP / REDUCE - JAVA - Mapper
TextInputFormat
Datanode
this is a cownthis is a
buffalonthere is a hen
Record
(0, "this is a cow")
(15, "this is a buffalo")
(34, "there is a hen")
InputSplit
MapReduce
MAP / REDUCE - JAVA - Mapper
TextInputFormat
Datanode
this is a cownthis is a buffalonthere is a hen
Record
(0, "this is a cow")
(15, "this is a buffalo")
(34, "there is a hen")
InputSplit
34
15
Location where
line starts
MapReduce
public class StubMapper extends Mapper<Object, Text, Text, LongWritable> {
// A class in java is a complex data type that can have methods in it too
// Or a class a blue print.
// Person is a class and sandeep is an object.
}
MAP / REDUCE - JAVA - Mapper - Class
MapReduce
public class StubMapper extends Mapper<Object, Text, Text, LongWritable> {
// Out class stubmapper is inheriting the parent class Mapper
// Which is provided by framework
// StubMapper is initialized for each input split
}
MAP / REDUCE - JAVA - Mapper - Extends
MapReduce
public class StubMapper extends Mapper<Object, Text, Text, LongWritable> {
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split("[ t]+");
for(String word:words)
{
context.write(new Text(word), new LongWritable(1));
}
}
}
MAP / REDUCE - JAVA - Mapper - Datatypes
Data types of input, ouput key and value
Data type of Input Key.
In our example, it is number
of bytes at which the value
is starting
The Data type of
input value.
In our case, input
value is each line, i.e.
Text
The Data type of
output key,
We are going to
give key as word,
therefore it is Text
The Data type of output value,
We are going to give value as
1 therefore it is Long
MapReduce
public class StubMapper extends Mapper<Object, Text, Text, LongWritable> {
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split("[ t]+");
for(String word:words)
{
context.write(new Text(word), new LongWritable(1));
}
}
}
MAP / REDUCE - JAVA - Mapper - method
MapReduce
public class StubMapper extends Mapper<Object, Text, Text, LongWritable> {
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split("[ t]+");
for(String word:words)
{
context.write(new Text(word), new LongWritable(1));
}
}
}
MAP / REDUCE - JAVA - Mapper - method
The input line is split by space
or tabs into array of strings
MapReduce
public class StubMapper extends Mapper<Object, Text, Text, LongWritable> {
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split("[ t]+");
for(String word:words)
{
context.write(new Text(word), new LongWritable(1));
}
}
}
MAP / REDUCE - JAVA - Mapper - method
For Each of the words ...
MapReduce
public class StubMapper extends Mapper<Object, Text, Text, LongWritable> {
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split("[ t]+");
for(String word:words)
{
context.write(new Text(word), new LongWritable(1));
}
}
}
MAP / REDUCE - JAVA - Mapper - method
… we give out the
word as key ...
… and numeric 1 as the value.
MapReduce
MAP / REDUCE - JAVA - Writable
What is "new Text(word) "?
Usual types of Java to represent numbers and text were not efficient. So,
mapreduce team designed their own classes called writables
String
long
int
Text
LongWritable
IntWritable
Java Hadoop
MapReduce
MAP / REDUCE - JAVA - Writable
What is "new Text(word) "?
Before handling over anything to MapReduce, you need to wrap it into
corresponding writable class or create a new one.
new Text(word)
new LongWritable(word)
Wrapping
value.toString()
Unwrapping
MapReduce
public class StubMapper extends Mapper<Object, Text, Text,
LongWritable> {
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] words = value.toString().split("[ t]+");
for(String word:words)
{
context.write(new Text(word), new LongWritable(1));
}
}
}
MAP / REDUCE - Java - Full Code
Create a Mapper
MapReduce
MAP / REDUCE - Java - Full Code
Take a look at complete code at gihub folder:
https://github.com/singhabhinav/cloudxlab/tree/master/hdpexamples/java/src/com/cloudxlab/wordcount
MapReduce
MAP / REDUCE - Java - Complete Code
The output of Mapper
Record
(0, "this is a cow")
(15, "this is a buffalo")
(34, "there is a hen")
InputSplit
this 1
is 1
a 1
cow 1
this 1
is 1
a 1
buffalo 1
there 1
is 1
a 1
hen 1
StubMapper.map()
StubMapper.map()
StubMapper.map()
MapReduce
MAP / REDUCE - JAVA - Reducer
public class StubReducer extends Reducer<Text, LongWritable, Text,
LongWritable> {
@Override
public void reduce(Text key, Iterable<LongWritable> values, Context context)
throws IOException, InterruptedException {
long sum = 0;
for(LongWritable iw:values)
{
sum += iw.get();
}
context.write(key, new LongWritable(sum));
}
}
Create a Reducer
MapReduce
MAP / REDUCE - JAVA
public class StubDriver {
public static void main(String[] args) throws Exception {
Job job = Job.getInstance();
job.setJarByClass(StubDriver.class);
job.setMapperClass(StubMapper.class);
job.setReducerClass(StubReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.addInputPath(job, new Path("/data/mr/wordcount/input/big.txt");
FileOutputFormat.setOutputPath(job, new Path("javamrout"));
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
Create a Driver
MapReduce
MAP / MAP / REDUCE - JAVA -
Writing Map-Reduce in Java (Continued)
9. Export jar
10. scp jar to the hadoop server
11. Run it using the following command:
hadoop jar sandeep/training2.jar StubDriver <arguments>
e.g: hadoop jar sandeep/training2.jar StubDriver
/users/root/wordcount/input
/users/root/wordcount/output16/
12. In case there is a need use -use-lib
13. Testing: Add all the jars provided
Using external Jars:
$ export LIBJARS=/path/jar1,/path/jar2
$ hadoop jar my-example.jar com.example.MyTool -libjars ${LIBJARS}
MapReduce
MAP / MAP / REDUCE - JAVA - Hands-ON
## These are the examples of Map-Reduce
git clone https://github.com/<your github login>/bigdata.git
cd cloudxlab/hdpexamples/java
ant jar
To Run wordcount MapReduce, use:
hadoop jar build/jar/hdpexamples.jar
com.cloudxlab.wordcount.StubDriver
MapReduce
MAP / REDUCE - INPUT SPLITS (CONT.)
public abstract class InputSplit {
public abstract long getLength()
public abstract String[] getLocations()
}
• Has length and locations
• Largest gets processed first
• InputFormat creates splits
• Default one is TextInput
Format
• Extend it to custom
splits/records
public abstract class InputFormat {
List getSplits (JobContext);
RecordReader createRecordReader
(InputSplit,TaskAttemptContext);
}
MapReduce
MAP / REDUCE - Secondary Sorting
• The key-value pairs generated by Mapper are sorted by key
• Reducer recieves the values for each key.
• These values are not sorted.
• To have these sorted, you need to use Secondary Sorting.
Sorting
(Primary & Secondary)
Grouping
partitioning
Reducer
Mapper
GroupingReducerH
D
F
S
MapReduce
MAP / REDUCE - Secondary Sorting
1. Define Sorting:
a. Create a WritableComparable class instead of “key”
b. In this class, return Primary and Secondary Key.
2. Define Grouping
a. Create Grouping class by extending WritableComparator
3. Define Paritioning
a. Extend Partitioner and implement how to parition on PK
See the folder “nextword” from “Session 5” project
More: here and here and here and in “The Definitive Guide of Hadoop”.
MapReduce
MAP / REDUCE - DATA FLOW WITH SINGLE REDUCER
Network Transfer
Local Transfer
Node
MapReduce
MAP / REDUCE - MULTIPLE REDUCERS
MapReduce
MAP / REDUCE - PARTITIONER
• Defines the key for partitioning
• Decides which key goes to which reducers
public static class AgePartitioner extends Partitioner<Text, Text> {
public int getPartition(Text gender, Text value, int numReduceTasks) {
if(gender.getString().equals(“M”))
return 0;
else
return 1;
}
}
MapReduce
MAP / REDUCE - HOW MANY REDUCERS?
• By Default One
• Too many reducers effort of shuffling is high
• Too few reducers, computation takes time
• Tune it to the total number of slots
MapReduce
MAP / REDUCE - COMBINER FUNCTIONS
• Runs on the same node after Map has finished
• Processes the output of Map
• Helps in minimise the data transfer
• Does not replace reducer
• Should be commutative and associative
MapReduce
MAP / REDUCE - COMBINER FUNCTIONS
• Defined Using Reducer Class - Same signature as reducer
• No matter in what way it is applied, output should be same
• Examples: Sum, Min, Max
• max(0, 20, 10, 25) = max(max(0, 20), max(10,25)) = max(20,
25) = 25
• = max(max(0, 10), max(20,25)) = max(10, 25) = 25
• Not: average or mean
• avg(0, 20, 10, 25) => 11.25
• = avg(avg(0, 20), avg(10,25)) = avg(10, 17.5) = 13.75
• = avg(avg(0, 10, 20), avg(25)) = avg(10, 25) = 17.5
• is function f(a, b,c…) = {return sqrt(a*a+b*b+c*c…);}
job.setCombinerClass(MaxTemperatureReducer.class);
MapReduce
MAP / REDUCE - Job Chaining
Map1 Reduce1
Job1
Map2 Reduce2
Job2
Method1:
Using our Java Code:
if(job1.waitForCompletion(true))
{
job2.waitForCompletion(true);
}
MapReduce
MAP / REDUCE - Job Chaining
Method2: Using Unix
hadoop jar x.jar Driver1 inputdir outputdir1 && hadoop jar x.jar Driver2 outputdir1 outdir2
Method3: Using Oozie
We will discuss it later.
Method4: Using dependencies
//job2 can’t start until job1 completes
job2.addDependingJob(job1);
See this project.
In this project we chain our previously done wordcount with new job to
order the words in descending order of counts
MapReduce
1. For running C/C++ code
2. Better than streaming
3. You can run as following:
$ bin/hadoop pipes -input inputPath -output outputPath -program path/to/executable
MAP / REDUCE - Pipes
MapReduce
Thank you.
Hadoop & Spark
+1 419 665 3276 (US)
+91 803 959 1464 (IN)
support@knowbigdata.com
Subscribe to our Youtube channel for latest videos -
https://www.youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA
MapReduce
Writing Map-Reduce in Java
1. Install Eclipse
2. Create a Java project
3. Add Libs: hadoop-mapreduce-client-core.jar,
hadoop-common.jar
4. Change JDK to 7.0
5. Change the Java Compiler settings to 1.6
MAP / REDUCE - JAVA
MapReduce
MAP / REDUCE - JAVA
Checkout & Follow instructions at
https://github.com/girisandeep/mrexamples
MapReduce
MAP / REDUCE - JAVA
public class StubTest {
@Before
public void setUp() {
mapDriver = new MapDriver<Object, Text, Text, LongWritable>();
mapDriver.setMapper(new StubMapper(););
reduceDriver = new ReduceDriver<Text, LongWritable, Text, LongWritable>();
reduceDriver.setReducer(new StubReducer(););
mapReduceDriver = new MapReduceDriver<Object, Text, Text, LongWritable, Text,
LongWritable>();
mapReduceDriver.setMapper(mapper);
mapReduceDriver.setReducer(reducer);
}
@Test
public void testXYZ() {
….
}
}
14. Create Test Case
MapReduce
MAP / REDUCE - JAVA
@Test
public void testMapReduce() throws IOException {
mapReduceDriver.addInput(new Pair<Object, Text>
("1", new Text("sandeep giri is here")));
mapReduceDriver.addInput(new Pair<Object, Text>
("2", new Text("teach the map and reduce class is fun.")));
List<Pair<Text, LongWritable>> output = mapReduceDriver.run();
for (Pair<Text, LongWritable> p : output) {
System.out.print(p.getFirst() + “-“ + p.getSecond());
//assert here
….
}
}
15. Create a test case
MapReduce
Custom Writable
● Objects that are serialized need to extend Writable
● Examples: Text, IntWritable, LongWritable, FloatWritable, BooleanWritable etc. (See)
● You can define you own
MapReduce
AVAILABLE INPUT SPLITS
• You can directly read files inside your mapper:
• FileSystem fs = FileSystem.get(URI.create(uri), conf);
• Third Party - GZIp Splittable: http://niels.basjes.nl/splittable-gzip
• https://github.com/twitter/hadoop-lzo
Notes

More Related Content

What's hot

Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
Fernando Rodriguez
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
Konrad Malawski
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Spark Summit
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
Cheng Min Chi
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
Scala+data
Scala+dataScala+data
Scala+data
Samir Bessalah
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
Cheng Lian
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
jlacefie
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Toni Cebrián
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
Samir Bessalah
 
Scalding
ScaldingScalding
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
DB Tsai
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
Sergey Zelvenskiy
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 

What's hot (20)

Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Scala+data
Scala+dataScala+data
Scala+data
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
 
Scalding
ScaldingScalding
Scalding
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark Streaming, Machine Learning and  meetup.com streaming API.Spark Streaming, Machine Learning and  meetup.com streaming API.
Spark Streaming, Machine Learning and meetup.com streaming API.
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 

Similar to Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | CloudxLab

MapReduce
MapReduceMapReduce
MapReduce
ahmedelmorsy89
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
Jairam Chandar
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
datasalt
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
Gabriele Modena
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
Kuldeep Dhole
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Robert Metzger
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
Koichi Fujikawa
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
Chicago Hadoop Users Group
 
Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章
moai kids
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
Ed Kohlwey
 
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
Hadoop Programming - MapReduce, Input, Output, Serialization, JobHadoop Programming - MapReduce, Input, Output, Serialization, Job
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
Jason J Pulikkottil
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
An introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAn introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduce
Ananth PackkilDurai
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
Adrian Florea
 
Hadoop 3
Hadoop 3Hadoop 3

Similar to Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | CloudxLab (20)

MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Cs267 hadoop programming
Cs267 hadoop programmingCs267 hadoop programming
Cs267 hadoop programming
 
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
Hadoop Programming - MapReduce, Input, Output, Serialization, JobHadoop Programming - MapReduce, Input, Output, Serialization, Job
Hadoop Programming - MapReduce, Input, Output, Serialization, Job
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
An introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAn introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduce
 
Spark overview
Spark overviewSpark overview
Spark overview
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 

More from CloudxLab

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
CloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
CloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
CloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
CloudxLab
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
CloudxLab
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
CloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
CloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
CloudxLab
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
CloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
CloudxLab
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
CloudxLab
 

More from CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 

Recently uploaded

WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Lidia A.
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
Stephanie Beckett
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
rajancomputerfbd
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
Matthew Sinclair
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
 
The Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive ComputingThe Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive Computing
Larry Smarr
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
Toru Tamaki
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 

Recently uploaded (20)

WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
 
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
Choose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presenceChoose our Linux Web Hosting for a seamless and successful online presence
Choose our Linux Web Hosting for a seamless and successful online presence
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
 
The Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive ComputingThe Rise of Supernetwork Data Intensive Computing
The Rise of Supernetwork Data Intensive Computing
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 

Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | CloudxLab

  • 2. MapReduce Recap - Why MapReduce? ● Instead of processing Big Data directly ● We breakdown the logic into ○ map() ■ Executed on machines with data ■ Gives out key-value pairs ○ Reduce() ■ Gets output of maps grouped by key ■ Grouping is done by MapReduce Framework ■ Can aggregate data
  • 3. MapReduce MAP / REDUCE - Why JAVA Why in Java? • Primary Support • Can modify behaviour to a very large extent
  • 4. MapReduce MAP / REDUCE - JAVA - Objective Write a map-reduce job to count unique words this is a cow this is a buffalo there is a hen 3 a 1 buffalo 1 cow 1 hen 3 is 1 there 2 this
  • 5. MapReduce MAP / REDUCE - JAVA - Objective Write a map-reduce job to count unique words in text file /data/mr/wordcount/input/big.txt Location in HDFS In CloudxLab Input File
  • 6. MapReduce MAP / REDUCE - JAVA - Mapper InputFormat Datanode HDFS Block1 Record1 (key, value) Record2 Record3 Map() Mapper Map() Map() (key1, value1) (key2, value2) Nothing (key3, value3) InputSplit We need to write the code which would break down the input record into key-value.
  • 7. MapReduce MAP / REDUCE - JAVA - Mapper TextInputFormat Datanode this is a cownthis is a buffalonthere is a hen Record (0, "this is a cow") (15, "this is a buffalo") (34, "there is a hen") InputSplit
  • 8. MapReduce MAP / REDUCE - JAVA - Mapper TextInputFormat Datanode this is a cownthis is a buffalonthere is a hen Record (0, "this is a cow") (15, "this is a buffalo") (34, "there is a hen") InputSplit 34 15 Location where line starts
  • 9. MapReduce public class StubMapper extends Mapper<Object, Text, Text, LongWritable> { // A class in java is a complex data type that can have methods in it too // Or a class a blue print. // Person is a class and sandeep is an object. } MAP / REDUCE - JAVA - Mapper - Class
  • 10. MapReduce public class StubMapper extends Mapper<Object, Text, Text, LongWritable> { // Out class stubmapper is inheriting the parent class Mapper // Which is provided by framework // StubMapper is initialized for each input split } MAP / REDUCE - JAVA - Mapper - Extends
  • 11. MapReduce public class StubMapper extends Mapper<Object, Text, Text, LongWritable> { @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.toString().split("[ t]+"); for(String word:words) { context.write(new Text(word), new LongWritable(1)); } } } MAP / REDUCE - JAVA - Mapper - Datatypes Data types of input, ouput key and value Data type of Input Key. In our example, it is number of bytes at which the value is starting The Data type of input value. In our case, input value is each line, i.e. Text The Data type of output key, We are going to give key as word, therefore it is Text The Data type of output value, We are going to give value as 1 therefore it is Long
  • 12. MapReduce public class StubMapper extends Mapper<Object, Text, Text, LongWritable> { @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.toString().split("[ t]+"); for(String word:words) { context.write(new Text(word), new LongWritable(1)); } } } MAP / REDUCE - JAVA - Mapper - method
  • 13. MapReduce public class StubMapper extends Mapper<Object, Text, Text, LongWritable> { @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.toString().split("[ t]+"); for(String word:words) { context.write(new Text(word), new LongWritable(1)); } } } MAP / REDUCE - JAVA - Mapper - method The input line is split by space or tabs into array of strings
  • 14. MapReduce public class StubMapper extends Mapper<Object, Text, Text, LongWritable> { @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.toString().split("[ t]+"); for(String word:words) { context.write(new Text(word), new LongWritable(1)); } } } MAP / REDUCE - JAVA - Mapper - method For Each of the words ...
  • 15. MapReduce public class StubMapper extends Mapper<Object, Text, Text, LongWritable> { @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.toString().split("[ t]+"); for(String word:words) { context.write(new Text(word), new LongWritable(1)); } } } MAP / REDUCE - JAVA - Mapper - method … we give out the word as key ... … and numeric 1 as the value.
  • 16. MapReduce MAP / REDUCE - JAVA - Writable What is "new Text(word) "? Usual types of Java to represent numbers and text were not efficient. So, mapreduce team designed their own classes called writables String long int Text LongWritable IntWritable Java Hadoop
  • 17. MapReduce MAP / REDUCE - JAVA - Writable What is "new Text(word) "? Before handling over anything to MapReduce, you need to wrap it into corresponding writable class or create a new one. new Text(word) new LongWritable(word) Wrapping value.toString() Unwrapping
  • 18. MapReduce public class StubMapper extends Mapper<Object, Text, Text, LongWritable> { @Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String[] words = value.toString().split("[ t]+"); for(String word:words) { context.write(new Text(word), new LongWritable(1)); } } } MAP / REDUCE - Java - Full Code Create a Mapper
  • 19. MapReduce MAP / REDUCE - Java - Full Code Take a look at complete code at gihub folder: https://github.com/singhabhinav/cloudxlab/tree/master/hdpexamples/java/src/com/cloudxlab/wordcount
  • 20. MapReduce MAP / REDUCE - Java - Complete Code The output of Mapper Record (0, "this is a cow") (15, "this is a buffalo") (34, "there is a hen") InputSplit this 1 is 1 a 1 cow 1 this 1 is 1 a 1 buffalo 1 there 1 is 1 a 1 hen 1 StubMapper.map() StubMapper.map() StubMapper.map()
  • 21. MapReduce MAP / REDUCE - JAVA - Reducer public class StubReducer extends Reducer<Text, LongWritable, Text, LongWritable> { @Override public void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long sum = 0; for(LongWritable iw:values) { sum += iw.get(); } context.write(key, new LongWritable(sum)); } } Create a Reducer
  • 22. MapReduce MAP / REDUCE - JAVA public class StubDriver { public static void main(String[] args) throws Exception { Job job = Job.getInstance(); job.setJarByClass(StubDriver.class); job.setMapperClass(StubMapper.class); job.setReducerClass(StubReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class); FileInputFormat.addInputPath(job, new Path("/data/mr/wordcount/input/big.txt"); FileOutputFormat.setOutputPath(job, new Path("javamrout")); boolean result = job.waitForCompletion(true); System.exit(result ? 0 : 1); } } Create a Driver
  • 23. MapReduce MAP / MAP / REDUCE - JAVA - Writing Map-Reduce in Java (Continued) 9. Export jar 10. scp jar to the hadoop server 11. Run it using the following command: hadoop jar sandeep/training2.jar StubDriver <arguments> e.g: hadoop jar sandeep/training2.jar StubDriver /users/root/wordcount/input /users/root/wordcount/output16/ 12. In case there is a need use -use-lib 13. Testing: Add all the jars provided Using external Jars: $ export LIBJARS=/path/jar1,/path/jar2 $ hadoop jar my-example.jar com.example.MyTool -libjars ${LIBJARS}
  • 24. MapReduce MAP / MAP / REDUCE - JAVA - Hands-ON ## These are the examples of Map-Reduce git clone https://github.com/<your github login>/bigdata.git cd cloudxlab/hdpexamples/java ant jar To Run wordcount MapReduce, use: hadoop jar build/jar/hdpexamples.jar com.cloudxlab.wordcount.StubDriver
  • 25. MapReduce MAP / REDUCE - INPUT SPLITS (CONT.) public abstract class InputSplit { public abstract long getLength() public abstract String[] getLocations() } • Has length and locations • Largest gets processed first • InputFormat creates splits • Default one is TextInput Format • Extend it to custom splits/records public abstract class InputFormat { List getSplits (JobContext); RecordReader createRecordReader (InputSplit,TaskAttemptContext); }
  • 26. MapReduce MAP / REDUCE - Secondary Sorting • The key-value pairs generated by Mapper are sorted by key • Reducer recieves the values for each key. • These values are not sorted. • To have these sorted, you need to use Secondary Sorting. Sorting (Primary & Secondary) Grouping partitioning Reducer Mapper GroupingReducerH D F S
  • 27. MapReduce MAP / REDUCE - Secondary Sorting 1. Define Sorting: a. Create a WritableComparable class instead of “key” b. In this class, return Primary and Secondary Key. 2. Define Grouping a. Create Grouping class by extending WritableComparator 3. Define Paritioning a. Extend Partitioner and implement how to parition on PK See the folder “nextword” from “Session 5” project More: here and here and here and in “The Definitive Guide of Hadoop”.
  • 28. MapReduce MAP / REDUCE - DATA FLOW WITH SINGLE REDUCER Network Transfer Local Transfer Node
  • 29. MapReduce MAP / REDUCE - MULTIPLE REDUCERS
  • 30. MapReduce MAP / REDUCE - PARTITIONER • Defines the key for partitioning • Decides which key goes to which reducers public static class AgePartitioner extends Partitioner<Text, Text> { public int getPartition(Text gender, Text value, int numReduceTasks) { if(gender.getString().equals(“M”)) return 0; else return 1; } }
  • 31. MapReduce MAP / REDUCE - HOW MANY REDUCERS? • By Default One • Too many reducers effort of shuffling is high • Too few reducers, computation takes time • Tune it to the total number of slots
  • 32. MapReduce MAP / REDUCE - COMBINER FUNCTIONS • Runs on the same node after Map has finished • Processes the output of Map • Helps in minimise the data transfer • Does not replace reducer • Should be commutative and associative
  • 33. MapReduce MAP / REDUCE - COMBINER FUNCTIONS • Defined Using Reducer Class - Same signature as reducer • No matter in what way it is applied, output should be same • Examples: Sum, Min, Max • max(0, 20, 10, 25) = max(max(0, 20), max(10,25)) = max(20, 25) = 25 • = max(max(0, 10), max(20,25)) = max(10, 25) = 25 • Not: average or mean • avg(0, 20, 10, 25) => 11.25 • = avg(avg(0, 20), avg(10,25)) = avg(10, 17.5) = 13.75 • = avg(avg(0, 10, 20), avg(25)) = avg(10, 25) = 17.5 • is function f(a, b,c…) = {return sqrt(a*a+b*b+c*c…);} job.setCombinerClass(MaxTemperatureReducer.class);
  • 34. MapReduce MAP / REDUCE - Job Chaining Map1 Reduce1 Job1 Map2 Reduce2 Job2 Method1: Using our Java Code: if(job1.waitForCompletion(true)) { job2.waitForCompletion(true); }
  • 35. MapReduce MAP / REDUCE - Job Chaining Method2: Using Unix hadoop jar x.jar Driver1 inputdir outputdir1 && hadoop jar x.jar Driver2 outputdir1 outdir2 Method3: Using Oozie We will discuss it later. Method4: Using dependencies //job2 can’t start until job1 completes job2.addDependingJob(job1); See this project. In this project we chain our previously done wordcount with new job to order the words in descending order of counts
  • 36. MapReduce 1. For running C/C++ code 2. Better than streaming 3. You can run as following: $ bin/hadoop pipes -input inputPath -output outputPath -program path/to/executable MAP / REDUCE - Pipes
  • 37. MapReduce Thank you. Hadoop & Spark +1 419 665 3276 (US) +91 803 959 1464 (IN) support@knowbigdata.com Subscribe to our Youtube channel for latest videos - https://www.youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA
  • 38. MapReduce Writing Map-Reduce in Java 1. Install Eclipse 2. Create a Java project 3. Add Libs: hadoop-mapreduce-client-core.jar, hadoop-common.jar 4. Change JDK to 7.0 5. Change the Java Compiler settings to 1.6 MAP / REDUCE - JAVA
  • 39. MapReduce MAP / REDUCE - JAVA Checkout & Follow instructions at https://github.com/girisandeep/mrexamples
  • 40. MapReduce MAP / REDUCE - JAVA public class StubTest { @Before public void setUp() { mapDriver = new MapDriver<Object, Text, Text, LongWritable>(); mapDriver.setMapper(new StubMapper();); reduceDriver = new ReduceDriver<Text, LongWritable, Text, LongWritable>(); reduceDriver.setReducer(new StubReducer();); mapReduceDriver = new MapReduceDriver<Object, Text, Text, LongWritable, Text, LongWritable>(); mapReduceDriver.setMapper(mapper); mapReduceDriver.setReducer(reducer); } @Test public void testXYZ() { …. } } 14. Create Test Case
  • 41. MapReduce MAP / REDUCE - JAVA @Test public void testMapReduce() throws IOException { mapReduceDriver.addInput(new Pair<Object, Text> ("1", new Text("sandeep giri is here"))); mapReduceDriver.addInput(new Pair<Object, Text> ("2", new Text("teach the map and reduce class is fun."))); List<Pair<Text, LongWritable>> output = mapReduceDriver.run(); for (Pair<Text, LongWritable> p : output) { System.out.print(p.getFirst() + “-“ + p.getSecond()); //assert here …. } } 15. Create a test case
  • 42. MapReduce Custom Writable ● Objects that are serialized need to extend Writable ● Examples: Text, IntWritable, LongWritable, FloatWritable, BooleanWritable etc. (See) ● You can define you own
  • 43. MapReduce AVAILABLE INPUT SPLITS • You can directly read files inside your mapper: • FileSystem fs = FileSystem.get(URI.create(uri), conf); • Third Party - GZIp Splittable: http://niels.basjes.nl/splittable-gzip • https://github.com/twitter/hadoop-lzo Notes