SlideShare a Scribd company logo
Mapreduce script
K.Haripritha
II-MSC (IT)
Bigdata analysis.
Nadar saraswathi college of arts and science.
INTRODUCTION
• MapReduce is a programming model and an associated
implementation for processing and generating big data sets
with a parallel, distributed algorithm on a cluster.
• A MapReduce program is composed of a map procedure (or
method), which performs filtering and sorting , and
a reduce method, which performs a summary operation.
• The "MapReduce System" (also called "infrastructure" or
"framework") orchestrates the processing by marshalling the
distributed servers, running the various tasks in parallel,
managing all communications and data transfers between the
various parts of the system, and providing
for redundancy and fault tolerance.
OVER VIEW
• MapReduce is a framework for
processing parallelizable problems across large
datasets using a large number of computers
(nodes), collectively referred to as a cluster (if
all nodes are on the same local network and
use similar hardware) or a grid (if the nodes
are shared across geographically and
administratively distributed systems, and use
more heterogeneous hardware).
• A MapReduce framework is usually composed of
three operations :
• Map: each worker node applies the map function
to the local data, and writes the output to a
temporary storage. A master node ensures that
only one copy of redundant input data is
processed.
• Shuffle: worker nodes redistribute data based on
the output keys , such that all data belonging to
one key is located on the same worker node.
• Reduce: worker nodes now process each group of
output data, per key, in parallel.
• The Map and Reduce functions of MapReduce are both
defined with respect to data structured in (key, value)
pairs. Map takes one pair of data with a type in one data
domain, and returns a list of pairs in a different domain:
• Map(k1,v1) → list(k2,v2)
• The Map function is applied in parallel to every pair (keyed
by k1) in the input dataset. This produces a list of pairs (keyed
by k2) for each call. After that, the MapReduce framework
collects all pairs with the same key (k2) from all lists and
groups them together, creating one group for each key.
• The Reduce function is then applied in parallel to each group,
which in turn produces a collection of values in the same
domain:
• Reduce(k2, list (v2)) → list(v3)
DATA FLOW
• Software framework architecture adheres to open-closed
principle where code is effectively divided into
unmodifiable frozen spots and extensible hot spots. The
frozen spot of the MapReduce framework is a large
distributed sort. The hot spots, which the application
defines, are:
• an input reader
• a Map function
• a partition function
• a compare function
• a Reduce function
• an output writer
• Input reader:
• The input reader divides the input into appropriate size
'splits' and the framework assigns one split to
each Map function. The input readerreads data from stable
storage and generates key/value pairs.
• Map function:
• The Map function takes a series of key/value pairs,
processes each, and generates zero or more output key/value
pairs. The input and output types of the map can be different
from each other.
• Partition function:
• Each Map function output is allocated to a
particular reducer by the application's partition function
for sharding purposes. The partition function is given the
key and the number of reducers and returns the index of the
desired reducer.
• Comparison function:
• The input for each Reduce is pulled from the machine where
the Map ran and sorted using the
application's comparison function.
• Reduce function:
• The framework calls the application's Reduce function once
for each unique key in the sorted order. The Reduce can
iterate through the values that are associated with that key
and produce zero or more outputs.
• In the word count example, the Reduce function takes the
input values, sums them and generates a single output of the
word and the final sum.
• Output writer:
• The Output Writer writes the output of the Reduce to the
stable storage.
Performance considerations
• MapReduce programs are not guaranteed to be fast. The
main benefit of this programming model is to exploit
the optimized shuffle operation of the platform, and
only having to write the Map and Reduce parts of the
program.
• In practice, the author of a MapReduce program
however has to take the shuffle step into consideration;
in particular the partition function and the amount of
data written by the Map function can have a large
impact on the performance and scalability.
Distribution and reliability
• MapReduce achieves reliability by parceling out a
number of operations on the set of data to each node in
the network. Each node is expected to report back
periodically with completed work and status updates.
• If a node falls silent for longer than that interval, the
master node records the node as dead and sends out the
node's assigned work to other nodes.
• Individual operations use atomic operations for naming
file outputs as a check to ensure that there are not
parallel conflicting threads running.
Uses
• MapReduce is useful in a wide range of
applications, including distributed pattern-
based searching, distributed sorting, web link-
graph reversal, Singular Value
Decomposition,web access log stats, inverted
index construction, document
clustering, machine learning, and statistical
machine translation.

More Related Content

Mapreduce script

  • 1. Mapreduce script K.Haripritha II-MSC (IT) Bigdata analysis. Nadar saraswathi college of arts and science.
  • 2. INTRODUCTION • MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. • A MapReduce program is composed of a map procedure (or method), which performs filtering and sorting , and a reduce method, which performs a summary operation. • The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.
  • 3. OVER VIEW • MapReduce is a framework for processing parallelizable problems across large datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware).
  • 4. • A MapReduce framework is usually composed of three operations : • Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed. • Shuffle: worker nodes redistribute data based on the output keys , such that all data belonging to one key is located on the same worker node. • Reduce: worker nodes now process each group of output data, per key, in parallel.
  • 5. • The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: • Map(k1,v1) → list(k2,v2) • The Map function is applied in parallel to every pair (keyed by k1) in the input dataset. This produces a list of pairs (keyed by k2) for each call. After that, the MapReduce framework collects all pairs with the same key (k2) from all lists and groups them together, creating one group for each key. • The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain: • Reduce(k2, list (v2)) → list(v3)
  • 6. DATA FLOW • Software framework architecture adheres to open-closed principle where code is effectively divided into unmodifiable frozen spots and extensible hot spots. The frozen spot of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are: • an input reader • a Map function • a partition function • a compare function • a Reduce function • an output writer
  • 7. • Input reader: • The input reader divides the input into appropriate size 'splits' and the framework assigns one split to each Map function. The input readerreads data from stable storage and generates key/value pairs. • Map function: • The Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map can be different from each other. • Partition function: • Each Map function output is allocated to a particular reducer by the application's partition function for sharding purposes. The partition function is given the key and the number of reducers and returns the index of the desired reducer.
  • 8. • Comparison function: • The input for each Reduce is pulled from the machine where the Map ran and sorted using the application's comparison function. • Reduce function: • The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs. • In the word count example, the Reduce function takes the input values, sums them and generates a single output of the word and the final sum. • Output writer: • The Output Writer writes the output of the Reduce to the stable storage.
  • 9. Performance considerations • MapReduce programs are not guaranteed to be fast. The main benefit of this programming model is to exploit the optimized shuffle operation of the platform, and only having to write the Map and Reduce parts of the program. • In practice, the author of a MapReduce program however has to take the shuffle step into consideration; in particular the partition function and the amount of data written by the Map function can have a large impact on the performance and scalability.
  • 10. Distribution and reliability • MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network. Each node is expected to report back periodically with completed work and status updates. • If a node falls silent for longer than that interval, the master node records the node as dead and sends out the node's assigned work to other nodes. • Individual operations use atomic operations for naming file outputs as a check to ensure that there are not parallel conflicting threads running.
  • 11. Uses • MapReduce is useful in a wide range of applications, including distributed pattern- based searching, distributed sorting, web link- graph reversal, Singular Value Decomposition,web access log stats, inverted index construction, document clustering, machine learning, and statistical machine translation.