SlideShare a Scribd company logo
HADOOP
Map- Reduce
Prashant Gupta
Combiner
• A Combiner, also known as a Mini-reduce or Mapper side reducer
• The Combiner will receive as input all data emitted by the Mapper
instances on a given node and its output from the Combiner is then
sent to the Reducers,
• The Combiner will be used in between the Map class and the
Reduce class to reduce the volume of data transfer between Map
and Reduce.
• Usage of the Combiner is optional.
When ?
• If a reduce function is both
commutative and associative , we do not need to write any
additional code to take advantage .
job.setCombinerClass(Reduce.class);
• The Combiner should be an instance of the Reducer interface. A
combiner does not have a predefined interface
• If your Reducer itself cannot be used directly as a Combiner
because of commutativity or associativity, you might
still be able to write a third class to use as a Combiner for your job.
• Note – Hadoop does not guarantee that how many times combiners
function will run and how many times it will run for a map output.
Map reduce prashant
Map reduce prashant
Hadoop Reducer is used without
a Combiner
Map reduce prashant
Speculative execution
• One problem with the Hadoop system is that by dividing the tasks
across many nodes, it is possible for a few slow nodes to rate-limit
the rest of the program.
• Hadoop platform will schedule redundant copies of the remaining
tasks across several nodes which do not have other work to
perform. This process is known as speculative execution.
• When tasks complete, they announce this fact to the JobTracker.
Whichever copy of a task finishes first becomes the definitive copy.
• Speculative execution is enabled by default. You can disable
speculative execution for the mappers and reducers by
configuration ;
• mapred.map.tasks.speculative.execution
• mapred.reduce.tasks.speculative.execution
• There is a hard limit of 10% of slots used for speculation across all
hadoop jobs. This is not configurable right now. However there is a
per-job option to cap the ratio of speculated tasks to total tasks:
mapreduce.job.speculative.speculativecap=0.1
Locating Stragglers
 Hadoop monitors each task progress using a progress score
between 0 and 1
 If a task’s progress score is less than (average – 0.2), and the
task has run for at least 1 minute, it is marked as a straggler
COUNTERS
• Counters are used to determine if and how often a
particular event occurred during a job execution.
• 4 categories of counters in Hadoop
• File system,
• Job
• Map Reduce Framework,
• Custom counter
COUNTER continue …
Map reduce prashant
Map reduce prashant
Map reduce prashant
Custom Counters
• MapReduce allows you to define your own custom
counters. Custom counters are useful for counting
specific records such as Bad Records, as the framework
counts only total records. Custom counters can also be
used to count outliers such as example maximum and
minimum values, and for summations.
Steps to write custome counter
• define a enum (mapper or reducer , anywhere based upon requirement );
• public static enum MATCH_COUNTER {
Score_above_400,
Score_below_20,Temp_abv_55;
}
context.getCounter(MATCH_COUNTER.Score_above_400).increment(1);
Data Types
• Hadoop MapReduce uses typed data at all times when it
interacts with user-provided Mappers and Reducers.
• In wordCount, you must have seen LongWritable,
IntWrtitable and Text. It is fairly easy to understand the
relation between them and Java’s primitive types.
LongWritable is equivalent to long, IntWritable to int and
Text to String.
Hadoop writable classes (data
types) vs Java Data types
Java Hadoop
Byte Bytewritable
int Intwritable /Vintwritable/
float floatwritable
long Longwritable / VLongwritable
Double DoubleWritable
String Text / Nullwritable
• What is a Writable in Hadoop?
• Why does Hadoop use Writable(s)?
• Limitation of primitive Hadoop Writable classes
• Custom Writable
Writable in Hadoop
• It is fairly easy to understand the relation between them and Java’s
primitive types. LongWritable is equivalent to long, IntWritable to int
and Text to String.
• Writable in an interface in Hadoop and types in Hadoop must
implement this interface. Hadoop provides these writable wrappers
for almost all Java primitive types and some other types
• To implement the Writable interface we require two methods ;
public interface Writable {
void readFields(DataInput in);
void write(DataOutput out);
}
Why does Hadoop use
Writable(s)
• As we already know, data needs to be transmitted between different
nodes in a distributed computing environment.
• This requires serialization and deserialization of data to convert the
data that is in structured format to byte stream and vice-versa.
• Hadoop therefore uses simple and efficient serialization protocol to
serialize data between map and reduce phase and these are called
Writable(s).
WritableComparable
• interface is just a subinterface of the Writable and
java.lang.Comparable interfaces.
• For implementing a WritableComparable we must have compareTo
method apart from readFields and write methods.
• Comparison of types is crucial for MapReduce, where there is a
sorting phase during which keys are compared with one another.
public interface WritableComparable extends Writable,
Comparable{
}
• public interface WritableComparable extends Writable, Comparable
{
void readFields(DataInput in);
void write(DataOutput out);
int compareTo(WritableComparable o)
}
• WritableComparables can be compared to each other, typically via
Comparators. Any type which is to be used as a key in the Hadoop
Map-Reduce framework should implement this interface.
• Any type which is to be used as a value in the Hadoop Map-Reduce
framework should implement the Writable interface.
Limitation of primitive
Hadoop Writable classes
• Writable that can be used in simple applications like wordcount, but
clearly these cannot serve our purpose all the time.
• Now if you want to still use the primitive Hadoop Writable(s), you
would have to convert the value into a string and transmit it. However
it gets very messy when you have to deal with string manipulations.
Map reduce prashant
INPUT Format
• The InputFormat class is one of the fundamental classes in the
Hadoop Map Reduce framework. This class is responsible for
defining two main things:
 Data splits
 Record reader
• Data split is a fundamental concept in Hadoop Map Reduce
framework which defines both the size of individual Map tasks and
its potential execution server.
• The Record Reader is responsible for actual reading records from
the input file and submitting them (as key/value pairs) to the
mapper.
• public abstract class InputFormat<K, V> {
public abstract List<InputSplit> getSplits(JobContext context)
throws IOException, InterruptedException;
public abstract RecordReader<K, V>
createRecordReader(InputSplit split, TaskAttemptContext
context) throws IOException, InterruptedException;
}
INPUT Format
MultiInputs
• We use MultipleInputs class which supports MapReduce
jobs that have multiple input paths with a different
InputFormat and Mapper for each path.
• MultipleInputs is a feature that supports different input
formats in the MapReduce.
MutipleInput Files
• Step :1 Add configuration in driver class
MultipleInputs.addInputPath(job,new
Path(args[0]),TextInputFormat.class,myMapper1.class);
MultipleInputs.addInputPath(job,new
Path(args[1]),TextInputFormat.class, myMapper2.class);
• Step :2 Write different Mapper for different the file path ;
myMapper1 extend mapper<Ki,Vi,Ko,Vo> {
}
• myMapper2 extend mapper<Ki,Vi,Ko,Vo>{
}
MultipleOutputFormat
• FileOutputFormat and its subclasses generate a set of
files in the output directory.
• There is one file per reducer, and files are named by the
partition number: part-00000, part-00001, etc.
• There is sometimes a need to have more control over
the naming of the files or to produce multiple files per
reducer.
• Step -1
MultipleOutputs.addNamedOutput(job, " NAMED_OUTPUT",
TextOutputFormat.class, Text.class, DoubleWritable.class);
• Step-2
Overide setup() method in reducer class and create an instance of
multiOutputs() ;
public void setup(Context context) throws IOException,
InterruptedException {
mos = new MultipleOutputs<Text, DoubleWritable>(context); }
• Step-3
We will use multiOutputs() instance in reduce() method to write data
to the ouput
mos.write(“NAMED_OUTPUT", outputKey, outputValue);
DISTRIBUTED CACHE
• If you are writing Map Reduce Applications, where you want some
files to be shared across all nodes in Hadoop Cluster. It can be
simple properties file or can be executable jar file.
• This Distributed Cache is configured with Job Configuration, What it
does is, it provides read only data to all machine on the cluster.
• The framework will copy the necessary files on to the slave node
before any tasks for the job are executed on that node
Map reduce prashant
• Step 1 : Put file to HDFS
# hdfs -put /rakesh/someFolder /user/rakesh/cachefile1
• Step 2: Add cachefile in Job Configuration
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI(" /user/rakesh/cachefile1
"),job.getConfiguration());
• Step 3: Access Cached file ;
Path[] cacheFiles = context.getLocalCacheFiles();
FileInputStream fileStream = new
FileInputStream(cacheFiles[0].toString());
Mapreduce 1.0 vs Mapreduce 2.0
• one easy way to differentiate between Hadoop old api and new api
is packages.
• old api packages - mapred org.apache.hadoop.mapred package,
• new api packages – mapreduce -org.apache.hadoop.mapreduce
package.
Continue …..
Joins
• Joins is one of the interesting features available in MapReduce.
• When processing large data sets the need for joining data by a
common key can be very useful.
• By joining data you can further gain insight such as joining with
timestamps to correlate events with a time a day.
• Joins is one of the interesting features available in MapReduce.
MapReduce can perform joins between very large
datasets.Implementation of join depends on how large the datasets
are and how they are partiotioned . If the join is performed by the
mapper, it is called a map-side join, whereas if it is performed by
the reducer it is called a reduce-side join.
Map-Side Join
• A map-side join between large inputs works by performing the join
before the data reaches the map function.
• For this to work, though, the inputs to each map must be partitioned
and sorted in a particular way.
• Each input data set must be divided into the same number of
partitions, and it must be sorted by the same key (the join key) in
each source.
• All the records for a particular key must reside in the same partition.
This may sound like a strict requirement (and it is), but it actually fits
the description of the output of a MapReduce job.
Reduce side Join
• Reduce-Side joins are more simple than Map-Side joins since the
input datasets need not to be structured. But it is less efficient as
both datasets have to go through the MapReduce shuffle phase. the
records with the same key are brought together in the reducer. We
can also use the Secondary Sort technique to control the order of
the records.
• How it is done?
The key of the map output, of datasets being joined, has to be the
join key - so they reach the same reducer.
• Each dataset has to be tagged with its identity, in the mapper- to
help differentiate between the datasets in the reducer, so they can
be processed accordingly.
• In each reducer, the data values from both datasets, for
keys assigned to the reducer, are available, to be
processed as required.
• A secondary sort needs to be done to ensure the
ordering of the values sent to the reducer.
• If the input files are of different formats, we would need
separate mappers, and we would need to use
MultipleInputs class in the driver to add the inputs and
associate the specific mapper to the same.
Improving MapReduce
Performance
• Use Compression technique (LZO,GZIP,Snappy….)
• Tune the number of map and reduce tasks appropriately
• Write a Combiner
• Use the most appropriate and compact Writable type for your data
• Reuse Writables
• Refrence : http://blog.cloudera.com/blog/2009/12/7-tips-for-
improving-mapreduce-performance/
Yet Another Resource Negotiator
(YARN)
• YARN (Yet Another Resource Negotiator) is the resource
management layer for the Apache Hadoop ecosystem.
In a YARN cluster, there are two types of hosts;
• The ResourceManager is the master daemon that communicates
with the client, tracks resources on the cluster, and orchestrates
work by assigning tasks to NodeManagers.
• A NodeManager is a worker daemon that launches and tracks
processes spawned on worker hosts.
• Containers are an important YARN concept. You can think of a
container as a request to hold resources on the YARN cluster.
• Use of a YARN cluster begins with a request from a client consisting
of an application. The ResourceManager negotiates the necessary
resources for a container and launches an ApplicationMaster to
represent the submitted application.
• Using a resource-request protocol, the ApplicationMaster negotiates
resource containers for the application at each node. Upon
execution of the application, the ApplicationMaster monitors the
container until completion. When the application is complete, the
ApplicationMaster unregisters its container with the
ResourceManager, and the cycle is complete.
Map reduce prashant
Map reduce prashant
Map reduce prashant
Thank You

More Related Content

What's hot

PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Shubham Parmar
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Rutvik Bapat
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
Vigen Sahakyan
 
Data streaming fundamentals
Data streaming fundamentalsData streaming fundamentals
Data streaming fundamentals
Mohammed Fazuluddin
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
Dan Gunter
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
GauravBiswas9
 
Privacy, security and ethics in data science
Privacy, security and ethics in data sciencePrivacy, security and ethics in data science
Privacy, security and ethics in data science
Nikolaos Vasiloglou
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Sri Prasanna
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
joelcrabb
 
Pig latin
Pig latinPig latin
Pig latin
Sadiq Basha
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Michel Bruley
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Dr. C.V. Suresh Babu
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
Mishika Bharadwaj
 

What's hot (20)

PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Data streaming fundamentals
Data streaming fundamentalsData streaming fundamentals
Data streaming fundamentals
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Privacy, security and ethics in data science
Privacy, security and ethics in data sciencePrivacy, security and ethics in data science
Privacy, security and ethics in data science
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Pig latin
Pig latinPig latin
Pig latin
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 

Similar to Map reduce prashant

Unit-5 [Pig] working and architecture.pptx
Unit-5 [Pig] working and architecture.pptxUnit-5 [Pig] working and architecture.pptx
Unit-5 [Pig] working and architecture.pptx
tripathineeharika
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
BikalAdhikari4
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
KrishnaVeni451953
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
Asad Masood Qazi
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
bigdatagurus_meetup
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
jani shaik
 
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
bhuvankumar3877
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
Hadoop
HadoopHadoop
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
Jazan University
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
KhanKhaja1
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
MapReduce
MapReduceMapReduce
MapReduce
ahmedelmorsy89
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
softwarequery
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit
 
Hadoop
HadoopHadoop
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
Kalyan Hadoop
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 

Similar to Map reduce prashant (20)

Unit-5 [Pig] working and architecture.pptx
Unit-5 [Pig] working and architecture.pptxUnit-5 [Pig] working and architecture.pptx
Unit-5 [Pig] working and architecture.pptx
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Hadoop
HadoopHadoop
Hadoop
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 

More from Prashant Gupta

Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
Prashant Gupta
 
Sqoop
SqoopSqoop
6.hive
6.hive6.hive
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
Prashant Gupta
 
Sonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysisSonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysis
Prashant Gupta
 

More from Prashant Gupta (7)

Spark core
Spark coreSpark core
Spark core
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Sqoop
SqoopSqoop
Sqoop
 
6.hive
6.hive6.hive
6.hive
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Mongodb - NoSql Database
Mongodb - NoSql DatabaseMongodb - NoSql Database
Mongodb - NoSql Database
 
Sonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysisSonar Tool - JAVA code analysis
Sonar Tool - JAVA code analysis
 

Recently uploaded

[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
Amazon Web Services Korea
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript
taqyea
 
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model SafePitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
vasudha malikmonii$A17
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
aarusi sexy model
 
Cloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco ProductsCloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco Products
luqmansyauqi2
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
kihus38
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
chetankumar9855
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
GaneshGanesh399816
 
University of the Sunshine Coast degree offer diploma Transcript
University of the Sunshine Coast  degree offer diploma TranscriptUniversity of the Sunshine Coast  degree offer diploma Transcript
University of the Sunshine Coast degree offer diploma Transcript
taqyea
 
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
ASISHSABAT3
 
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
Amazon Web Services Korea
 
Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript
taqyea
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
taqyea
 
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
KiranKumar139571
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
Amazon Web Services Korea
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
gargtinna79
 
EGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithmEGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithm
fatimaezzahraboumaiz2
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
bookmybebe1
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
SanelaNikodinoska1
 

Recently uploaded (20)

[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript
 
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model SafePitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
 
Cloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco ProductsCloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco Products
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
 
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECTMUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
MUMBAI MONTHLY RAINFALL CAPSTONE PROJECT
 
University of the Sunshine Coast degree offer diploma Transcript
University of the Sunshine Coast  degree offer diploma TranscriptUniversity of the Sunshine Coast  degree offer diploma Transcript
University of the Sunshine Coast degree offer diploma Transcript
 
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
 
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
 
Victoria University degree offer diploma Transcript
Victoria University  degree offer diploma TranscriptVictoria University  degree offer diploma Transcript
Victoria University degree offer diploma Transcript
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
 
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
 
EGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithmEGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithm
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
 

Map reduce prashant

  • 2. Combiner • A Combiner, also known as a Mini-reduce or Mapper side reducer • The Combiner will receive as input all data emitted by the Mapper instances on a given node and its output from the Combiner is then sent to the Reducers, • The Combiner will be used in between the Map class and the Reduce class to reduce the volume of data transfer between Map and Reduce. • Usage of the Combiner is optional.
  • 3. When ? • If a reduce function is both commutative and associative , we do not need to write any additional code to take advantage . job.setCombinerClass(Reduce.class); • The Combiner should be an instance of the Reducer interface. A combiner does not have a predefined interface • If your Reducer itself cannot be used directly as a Combiner because of commutativity or associativity, you might still be able to write a third class to use as a Combiner for your job. • Note – Hadoop does not guarantee that how many times combiners function will run and how many times it will run for a map output.
  • 6. Hadoop Reducer is used without a Combiner
  • 8. Speculative execution • One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. • Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. • When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy.
  • 9. • Speculative execution is enabled by default. You can disable speculative execution for the mappers and reducers by configuration ; • mapred.map.tasks.speculative.execution • mapred.reduce.tasks.speculative.execution • There is a hard limit of 10% of slots used for speculation across all hadoop jobs. This is not configurable right now. However there is a per-job option to cap the ratio of speculated tasks to total tasks: mapreduce.job.speculative.speculativecap=0.1
  • 10. Locating Stragglers  Hadoop monitors each task progress using a progress score between 0 and 1  If a task’s progress score is less than (average – 0.2), and the task has run for at least 1 minute, it is marked as a straggler
  • 11. COUNTERS • Counters are used to determine if and how often a particular event occurred during a job execution. • 4 categories of counters in Hadoop • File system, • Job • Map Reduce Framework, • Custom counter
  • 16. Custom Counters • MapReduce allows you to define your own custom counters. Custom counters are useful for counting specific records such as Bad Records, as the framework counts only total records. Custom counters can also be used to count outliers such as example maximum and minimum values, and for summations.
  • 17. Steps to write custome counter • define a enum (mapper or reducer , anywhere based upon requirement ); • public static enum MATCH_COUNTER { Score_above_400, Score_below_20,Temp_abv_55; } context.getCounter(MATCH_COUNTER.Score_above_400).increment(1);
  • 18. Data Types • Hadoop MapReduce uses typed data at all times when it interacts with user-provided Mappers and Reducers. • In wordCount, you must have seen LongWritable, IntWrtitable and Text. It is fairly easy to understand the relation between them and Java’s primitive types. LongWritable is equivalent to long, IntWritable to int and Text to String.
  • 19. Hadoop writable classes (data types) vs Java Data types Java Hadoop Byte Bytewritable int Intwritable /Vintwritable/ float floatwritable long Longwritable / VLongwritable Double DoubleWritable String Text / Nullwritable
  • 20. • What is a Writable in Hadoop? • Why does Hadoop use Writable(s)? • Limitation of primitive Hadoop Writable classes • Custom Writable
  • 21. Writable in Hadoop • It is fairly easy to understand the relation between them and Java’s primitive types. LongWritable is equivalent to long, IntWritable to int and Text to String. • Writable in an interface in Hadoop and types in Hadoop must implement this interface. Hadoop provides these writable wrappers for almost all Java primitive types and some other types • To implement the Writable interface we require two methods ; public interface Writable { void readFields(DataInput in); void write(DataOutput out); }
  • 22. Why does Hadoop use Writable(s) • As we already know, data needs to be transmitted between different nodes in a distributed computing environment. • This requires serialization and deserialization of data to convert the data that is in structured format to byte stream and vice-versa. • Hadoop therefore uses simple and efficient serialization protocol to serialize data between map and reduce phase and these are called Writable(s).
  • 23. WritableComparable • interface is just a subinterface of the Writable and java.lang.Comparable interfaces. • For implementing a WritableComparable we must have compareTo method apart from readFields and write methods. • Comparison of types is crucial for MapReduce, where there is a sorting phase during which keys are compared with one another. public interface WritableComparable extends Writable, Comparable{ }
  • 24. • public interface WritableComparable extends Writable, Comparable { void readFields(DataInput in); void write(DataOutput out); int compareTo(WritableComparable o) } • WritableComparables can be compared to each other, typically via Comparators. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface. • Any type which is to be used as a value in the Hadoop Map-Reduce framework should implement the Writable interface.
  • 25. Limitation of primitive Hadoop Writable classes • Writable that can be used in simple applications like wordcount, but clearly these cannot serve our purpose all the time. • Now if you want to still use the primitive Hadoop Writable(s), you would have to convert the value into a string and transmit it. However it gets very messy when you have to deal with string manipulations.
  • 27. INPUT Format • The InputFormat class is one of the fundamental classes in the Hadoop Map Reduce framework. This class is responsible for defining two main things:  Data splits  Record reader • Data split is a fundamental concept in Hadoop Map Reduce framework which defines both the size of individual Map tasks and its potential execution server. • The Record Reader is responsible for actual reading records from the input file and submitting them (as key/value pairs) to the mapper.
  • 28. • public abstract class InputFormat<K, V> { public abstract List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException; public abstract RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException; }
  • 30. MultiInputs • We use MultipleInputs class which supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path. • MultipleInputs is a feature that supports different input formats in the MapReduce.
  • 32. • Step :1 Add configuration in driver class MultipleInputs.addInputPath(job,new Path(args[0]),TextInputFormat.class,myMapper1.class); MultipleInputs.addInputPath(job,new Path(args[1]),TextInputFormat.class, myMapper2.class); • Step :2 Write different Mapper for different the file path ; myMapper1 extend mapper<Ki,Vi,Ko,Vo> { } • myMapper2 extend mapper<Ki,Vi,Ko,Vo>{ }
  • 33. MultipleOutputFormat • FileOutputFormat and its subclasses generate a set of files in the output directory. • There is one file per reducer, and files are named by the partition number: part-00000, part-00001, etc. • There is sometimes a need to have more control over the naming of the files or to produce multiple files per reducer.
  • 34. • Step -1 MultipleOutputs.addNamedOutput(job, " NAMED_OUTPUT", TextOutputFormat.class, Text.class, DoubleWritable.class); • Step-2 Overide setup() method in reducer class and create an instance of multiOutputs() ; public void setup(Context context) throws IOException, InterruptedException { mos = new MultipleOutputs<Text, DoubleWritable>(context); } • Step-3 We will use multiOutputs() instance in reduce() method to write data to the ouput mos.write(“NAMED_OUTPUT", outputKey, outputValue);
  • 35. DISTRIBUTED CACHE • If you are writing Map Reduce Applications, where you want some files to be shared across all nodes in Hadoop Cluster. It can be simple properties file or can be executable jar file. • This Distributed Cache is configured with Job Configuration, What it does is, it provides read only data to all machine on the cluster. • The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node
  • 37. • Step 1 : Put file to HDFS # hdfs -put /rakesh/someFolder /user/rakesh/cachefile1 • Step 2: Add cachefile in Job Configuration Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); DistributedCache.addCacheFile(new URI(" /user/rakesh/cachefile1 "),job.getConfiguration()); • Step 3: Access Cached file ; Path[] cacheFiles = context.getLocalCacheFiles(); FileInputStream fileStream = new FileInputStream(cacheFiles[0].toString());
  • 38. Mapreduce 1.0 vs Mapreduce 2.0 • one easy way to differentiate between Hadoop old api and new api is packages. • old api packages - mapred org.apache.hadoop.mapred package, • new api packages – mapreduce -org.apache.hadoop.mapreduce package.
  • 40. Joins • Joins is one of the interesting features available in MapReduce. • When processing large data sets the need for joining data by a common key can be very useful. • By joining data you can further gain insight such as joining with timestamps to correlate events with a time a day. • Joins is one of the interesting features available in MapReduce. MapReduce can perform joins between very large datasets.Implementation of join depends on how large the datasets are and how they are partiotioned . If the join is performed by the mapper, it is called a map-side join, whereas if it is performed by the reducer it is called a reduce-side join.
  • 41. Map-Side Join • A map-side join between large inputs works by performing the join before the data reaches the map function. • For this to work, though, the inputs to each map must be partitioned and sorted in a particular way. • Each input data set must be divided into the same number of partitions, and it must be sorted by the same key (the join key) in each source. • All the records for a particular key must reside in the same partition. This may sound like a strict requirement (and it is), but it actually fits the description of the output of a MapReduce job.
  • 42. Reduce side Join • Reduce-Side joins are more simple than Map-Side joins since the input datasets need not to be structured. But it is less efficient as both datasets have to go through the MapReduce shuffle phase. the records with the same key are brought together in the reducer. We can also use the Secondary Sort technique to control the order of the records. • How it is done? The key of the map output, of datasets being joined, has to be the join key - so they reach the same reducer. • Each dataset has to be tagged with its identity, in the mapper- to help differentiate between the datasets in the reducer, so they can be processed accordingly.
  • 43. • In each reducer, the data values from both datasets, for keys assigned to the reducer, are available, to be processed as required. • A secondary sort needs to be done to ensure the ordering of the values sent to the reducer. • If the input files are of different formats, we would need separate mappers, and we would need to use MultipleInputs class in the driver to add the inputs and associate the specific mapper to the same.
  • 44. Improving MapReduce Performance • Use Compression technique (LZO,GZIP,Snappy….) • Tune the number of map and reduce tasks appropriately • Write a Combiner • Use the most appropriate and compact Writable type for your data • Reuse Writables • Refrence : http://blog.cloudera.com/blog/2009/12/7-tips-for- improving-mapreduce-performance/
  • 45. Yet Another Resource Negotiator (YARN) • YARN (Yet Another Resource Negotiator) is the resource management layer for the Apache Hadoop ecosystem. In a YARN cluster, there are two types of hosts; • The ResourceManager is the master daemon that communicates with the client, tracks resources on the cluster, and orchestrates work by assigning tasks to NodeManagers. • A NodeManager is a worker daemon that launches and tracks processes spawned on worker hosts.
  • 46. • Containers are an important YARN concept. You can think of a container as a request to hold resources on the YARN cluster. • Use of a YARN cluster begins with a request from a client consisting of an application. The ResourceManager negotiates the necessary resources for a container and launches an ApplicationMaster to represent the submitted application. • Using a resource-request protocol, the ApplicationMaster negotiates resource containers for the application at each node. Upon execution of the application, the ApplicationMaster monitors the container until completion. When the application is complete, the ApplicationMaster unregisters its container with the ResourceManager, and the cycle is complete.

Editor's Notes

  1. The Combiner is a "mini-reduce" process which operates only on data generated by one machine.
  2. commutativity – a*b =b*a 3 + 4 = 4 + 3   or  2 × 5 = 5 × 2 Associativity - (x ∗ y) ∗ z = x ∗ (y ∗ z)  (2+3) + 4 = 2+(3+4)
  3. When a MapReduce Job is run on a large dataset, Hadoop Mapper generates large chunks of intermediate data that is passed on to Hadoop Reducer for further processing, which leads to massive network congestion.  So  reducing this network congestion , MapReduce framework offers  ‘Combiner’ 
  4. In MapReduce a job is broken into several tasks which will execute in parallel. This model of execution is sensitive to slow tasks (even if they are very few in number) as they will slowdown the overall execution of a job. Therefore, Hadoop detects such slow tasks and runs (duplicate) backup tasks for such tasks. This is calledspeculative execution. Speculating more tasks can help jobs finish faster - but can also waste CPU cycles. Conversely - speculating fewer tasks can save CPU cycles - but cause jobs to finish slower. The options documented here allow the users to control the aggressiveness of the speculation algorithms and choose the right balance between efficiency and latency.
  5. The FILE_BYTES_WRITTEN counter is incremented for each byte written to the local file system. These writes occur during the map phase when the mappers write their intermediate results to the local file system. They also occur during the shuffle phase when the reducers spill intermediate results to their local disks while sorting. The off-the-shelf Hadoop counters that correspond to MAPRFS_BYTES_READ and MAPRFS_BYTES_WRITTEN are HDFS_BYTES_READ and HDFS_BYTES_WRITTEN. The amount of data read and written will depend on the compression algorithm you use, if any.
  6. The table above describes the counters that apply to Hadoop jobs. The DATA_LOCAL_MAPS indicates how many map tasks executed on local file systems. Optimally, all the map tasks will execute on local data to exploit locality of reference, but this isn’t always possible. The FALLOW_SLOTS_MILLIS_MAPS indicates how much time map tasks wait in the queue after the slots are reserved but before the map tasks execute. A high number indicates a possible mismatch between the number of slots configured for a task tracker and how many resources are actually available. The SLOTS_MILLIS_* counters show how much time in milliseconds expired for the tasks. This value indicates wall clock time for the map and reduce tasks. The TOTAL_LAUNCHED_MAPS counter defines how many map tasks were launched for the job, including failed tasks. Optimally, this number is the same as the number of splits for the job.
  7. The COMBINE_* counters show how many records were read and written by the optional combiner. If you don’t specify a combiner, these counters will be 0. The CPU statistics are gathered from /proc/cpuinfo and indicate how much total time was spent executing map and reduce tasks for a job. The garbage collection counter is reported from GarbageCollectorMXBean.getCollectionTime(). The MAP*RECORDS are incremented for every successful record read and written by the mappers. Records that the map tasks failed to read or write are not included in these counters. The PHYSICAL_MEMORY_BYTES statistics are gathered from /proc/meminfo and indicate how much RAM (not including swap space) was consumed by all the tasks.
  8. All the counters, whether custom or framework, are stored in the JobTracker JVM memory, so there’s a practical limit to the number of counters you should use. The rule of thumb is to use less than 100, but this will vary based on physical memory capacity.
  9. Serialization : it is a mechanism of writing the state of an object into a byte stream. A Java object is serializable if its class or any of its superclasses implements either the java.io.Serializable interface or its subinterface. More technically , To serialize an object means to convert its state to a byte stream so that the byte stream can be reverted back into a copy of the object. The reverse operation of serialization is called deserialization.
  10. bjects which can be marshaled to or from files and across the network must obey a particular interface, called Writable, which allows Hadoop to read and write the data in a serialized form for transmission. Hadoop provides several stock classes which implement Writable: Text (which stores String data), IntWritable, LongWritable, FloatWritable, BooleanWritable, and several others. The entire list is in theorg.apache.hadoop.io package of the Hadoop source (see the API reference - http://hadoop.apache.org/docs/current/api/index.html).
  11. Custom writable : public class MyWritable implements Writable { // Some data private int counter; private long timestamp; public void write(DataOutput out) throws IOException { out.writeInt(counter); out.writeLong(timestamp); } public void readFields(DataInput in) throws IOException { counter = in.readInt(); timestamp = in.readLong(); } public static MyWritable read(DataInput in) throws IOException { MyWritable w = new MyWritable(); w.readFields(in); return w; } }
  12. public interface Comparable{ public int compareTo(Object obj); }
  13. WritableComparables can be compared to each other, typically via Comparators. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface.
  14. Any split implementation extends the Apache base abstract class - InputSplit, defining a split length and locations. A split length is the size of the split data (in bytes), while locations is the list of node names where the data for the split would be local. Split locations are a way for a scheduler to decide on which particular machine to execute this split. A very simple[1] a job tracker works as follows: Receive a heartbeat form one of the task trackers, reporting map slot availability. Find queued up split for which the available node is "local". Submit split to the task tracker for the execution. Locality can mean different things depending on storage mechanisms and the overall execution strategy. In the case of HDFS, for example, a split typically corresponds to a physical data block size and locations is a set of machines (with the set size defined by a replication factor) where this block is physically located. This is how FileInputFormat calculates splits.
  15. HIPI is a framework for image processing of the image file with MapReduce.
  16. Code example : http://www.lichun.cc/blog/2012/05/hadoop-multipleinputs-usage/
  17. MultipleInputs.addInputPath(job,new Path(args[0]),TextInputFormat.class,CounterMapper.class); MultipleInputs.addInputPath(job,new Path(args[1]),TextInputFormat.class,CountertwoMapper.class);
  18. . Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves. How big is the DistributedCache? The local.cache.size parameter controls the size of the DistributedCache. By default, it’s set to 10 GB. Where does the DistributedCache store data? /tmp/hadoop-<user.name>/mapred/local/taskTracker/archive
  19. If both datasets are too large for either to be copied to each node in the cluster, we can still join them using MapReduce with a map-side or reduce-side join, depending on how the data is structured. One common example of this case is a user database and a log of some user activity (such as access logs). For a popular service, it is not feasible to distribute the user database (or the logs) to all the MapReduce nodes. Before diving into the implementation let us understand the problem thoroughly.
  20. A map-side join can be used to join the outputs of several jobs that had the same number of reducers, the same keys, and output files that are not splittable which means the ouput files should not be bigger than the HDFS block size. Using the org.apache.hadoop.mapred.join.CompositeInputFormat class we can achieve this. If we have two datasets, for example, one dataset having user ids, names and the other having the user activity over the application. In-order to find out which user have performed what activity on the application we might need to join these two datasets such as both user names and the user activity will be joined together. Join can be applied based on the dataset size if one dataset is very small to be distributed across the cluster then we can use Side Data Distribution technique
  21. Almost every Hadoop job that generates an non-negligible amount of map output will benefit from intermediate data compression with LZO. Although LZO adds a little bit of CPU overhead, the reduced amount of disk IO during the shuffle will usually save time overall. Whenever a job needs to output a significant amount of data, LZO compression can also increase performance on the output side. Since writes are replicated 3x by default, each GB of outpunnnnnnnt data you save will save 3GB of disk writes. In order to enable LZO compression, check out our recent guest blog from Twitter. Be sure to setmapred.compress.map.output to true.
  22. The YARN configuration file is an XML file that contains properties. This file is placed in a well-known location on each host in the cluster and is used to configure the ResourceManager and NodeManager. By default, this file is named yarn-site.xml. The basic properties in this file used to configure YARN are covered in the later sections.
  23. Conclusion Summarizing the important concepts presented in this section: A cluster is made up of two or more hosts connected by an internal high-speed network. Master hosts are a small number of hosts reserved to control the rest of the cluster. Worker hosts are the non-master hosts in the cluster. In a cluster with YARN running, the master process is called the ResourceManager and the worker processes are called NodeManagers. The configuration file for YARN is named yarn-site.xml. There is a copy on each host in the cluster. It is required by the ResourceManager and NodeManager to run properly. YARN keeps track of two resourceson the cluster, vcores and memory. The NodeManager on each host keeps track of the local host’s resources, and the ResourceManager keeps track of the cluster’s total. A container in YARN holds resources on the cluster. YARN determines where there is room on a host in the cluster for the size of the hold for the container. Once the container is allocated, those resources are usable by the container. An application in YARN comprises three parts: The application client, which is how a program is run on the cluster. An ApplicationMaster which provides YARN with the ability to perform allocation on behalf of the application. One or more tasks that do the actual work (runs in a process) in the container allocated by YARN. A MapReduce application consists of map tasks and reduce tasks. A MapReduce application running in a YARN cluster looks very much like the MapReduce application paradigm, but with the addition of an ApplicationMaster as a YARN requirement.