SlideShare a Scribd company logo
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Big Data Analysis using Hadoop!
!
Map-Reduce – An Introduction!
!
Lecture 2!
!
!
Brendan Tierney
[from Hadoop in Practice, Alex Holmes]
HDFS
Architecture
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

MapReduce
•  A batch based, distributed computing framework modelled on Google’s paper on
MapReduce [http://research.google.com/archive/mapreduce.html]
•  MapReduce decomposes work into small parallelised map and reduce tasks which
are scheduled for remote execution on slave nodes
•  Terminology
•  A job is a full programme
•  A task is the execution of a single map or reduce task over a slice of
data called a split
•  A Mapper is a map task
•  A Reducer is a reduce task
•  MapReduce works by manipulating key/value pairs in the general format 
map(key1,value1)➝ list(key2,value2)
reduce(key2,list(value2)) ➝ (key3, value3)
[from Hadoop in Practice, Alex Holmes]
A MapReduce Job

Recommended for you

Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS

The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a Master and Slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. The NameNode tracks locations of data blocks and regulates access to files, while DataNodes store file blocks and manage read/write operations as directed by the NameNode. HDFS provides high-performance, scalable access to data across large Hadoop clusters.

introduction to hdfshdfsnamenode
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop

This presentation discusses the follow topics What is Hadoop? Need for Hadoop History of Hadoop Hadoop Overview Advantages and Disadvantages of Hadoop Hadoop Distributed File System Comparing: RDBMS vs. Hadoop Advantages and Disadvantages of HDFS Hadoop frameworks Modules of Hadoop frameworks Features of 'Hadoop‘ Hadoop Analytics Tools

hadoopgoogle analyticsbig data analytics
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture

This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.

 
by EMC
apache hadoophadoophadoop 101
[from Hadoop in Practice, Alex Holmes]
A MapReduce Job
The input is
divided into
fixed-size pieces
called input
splits
A map task is
created for each
split
[from Hadoop in Practice, Alex Holmes]
A MapReduce Job
The role of
the
programmer
is to define
the Map and
Reduce
functions
[from Hadoop in Practice, Alex Holmes]
A MapReduce Job
The Shuffle &
Sort phases
between the Map
and the Reduce
phases combines
map outputs and
sorts them for
the Reducers...
[from Hadoop in Practice, Alex Holmes]
A MapReduce Job
The Shuffle &
Sort phases
between the Map
and the Reduce
phases combines
map outputs and
sorts them for
the Reducers...
The Reduce phase
merges the data,
as defined by the
programmer to
produce the
outputs.

Recommended for you

RDD
RDDRDD
RDD

referance:Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case

This document provides an overview of Apache Sqoop, a tool for transferring bulk data between Apache Hadoop and structured data stores like relational databases. It describes how Sqoop can import data from external sources into HDFS or related systems, and export data from Hadoop to external systems. The document also demonstrates how to use basic Sqoop commands to list databases and tables, import and export data between MySQL and HDFS, and perform updates during export.

big dataapachesqoop
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.

mapreducedistributedcomputingapachehadoop
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Map 
•  The Map function 
•  The Mapper takes as input a key/value pair which represents a logical
record from the input data source (e.g. a line in a file) 
•  It produces zero or more outputs key/value pairs for each input pair
•  e.g. a filtering function may only produce output if a certain
condition is met
•  e.g. a counting function may produce multiple key/value pairs, one
per element being counted
map(in_key, in_value) ➝ list(temp_key, temp_value)
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Reduce
•  The Reducer(s)
•  A single Reducer handles all the map output for a unique map output
key
•  A Reducer outputs zero to many key/value pairs 
•  The output is written to HDFS files, to external DBs, or to any data sink...
reduce(temp_key,list(temp_values) ➝ list(out_key, out_value)
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

MapReduce
•  JobTracker - (Master)
•  Controls MapReduce jobs
•  Assigns Map & Reduce tasks to the other nodes on the cluster
•  Monitors the tasks as they are running
•  Relaunches failed tasks on other nodes in the cluster
•  TaskTracker - (Slave)
•  A single TaskTracker per slave node 
•  Manage the execution of the individual tasks on the node
•  Can instantiate many JVMs to handle tasks in parallel
•  Communicates back to the JobTracker (via a heartbeat)
[from Hadoop in Practice, Alex Holmes]

Recommended for you

Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt

The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and an ecosystem of related projects like Hive, HBase, Pig and Zookeeper that provide additional functions. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.

zookeeperhivehadoop
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop

This document provides an introduction to Hadoop, including its motivation and key components. It discusses the scale of cloud computing that Hadoop addresses, and describes the core Hadoop technologies - the Hadoop Distributed File System (HDFS) and MapReduce framework. It also briefly introduces the Hadoop ecosystem, including other related projects like Pig, HBase, Hive and ZooKeeper. Sample code is walked through to illustrate MapReduce programming. Key aspects of HDFS like fault tolerance, scalability and data reliability are summarized.

big data ingestionbig data analyticsdata visualization
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?

Hadoop Introduction you connect with us: http://www.linkedin.com/profile/view?id=232566291&trk=nav_responsive_tab_profile

hadoopapache hadoophadoop introduction
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

[from Hadoop the Definitive Guide,Tom White]
A MapReduce Job
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

[from Hadoop the Definitive Guide,Tom White]
Monitoring progress
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

YARN (Yet Another Resource Negotiator) Framework
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Data Locality !
“This is a local node for local Data” 
•  Whenever possible Hadoop will attempt to ensure that a Mapper on a node is
working on a block of data stored locally on that node vis HDFS
•  If this is not possible, the Mapper will have to transfer the data across the network as
it accesses the data
•  Once all the Map tasks are finished, the map output data is transferred across the
network to the Reducers
•  Although Reducers may run on the same node (physical machine) as the Mappers
there is no concept of data locality for Reducers

Recommended for you

Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology

Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop

The document discusses big data and distributed computing. It provides examples of the large amounts of data generated daily by organizations like the New York Stock Exchange and Facebook. It explains how distributed computing frameworks like Hadoop use multiple computers connected via a network to process large datasets in parallel. Hadoop's MapReduce programming model and HDFS distributed file system allow users to write distributed applications that process petabytes of data across commodity hardware clusters.

apache hadoopdistributed computinghadoop
Acid properties
Acid propertiesAcid properties
Acid properties

ACID properties Atomicity, Consistency, Isolation, Durability Transactions should possess several properties, often called the ACID properties; they should be enforced by the concurrency control and recovery methods of the DBMS.

engineeringpropertiesdatabase
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Bottlenecks?
•  Reducers cannot start until all Mappers are finished and the output has been
transferred to the Reducers and sorted
•  To alleviate bottlenecks in Shuffle & Sort - Hadoop starts to transfer data to the
Reducers as the Mappers finish
•  The percentage of Mappers which should finish before the Reducers
start retrieving data is configurable
•  To alleviate bottlenecks caused by slow Mappers - Hadoop uses speculative
execution
•  If a Mapper appears to be running significantly slower than the others, a
new instance of the Mapper will be started on another machine,
operating on the same data (remember replication) 
•  The results of the first Mapper to finish will be used
•  The Mapper which is still running will be terminated by Hadoop
Introduction to Map-Reduce
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

The MapReduce Job!
!
Let us build up an example
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

The Scenario
•  Build a Word Counter
•  Using the Shakespeare Poems
•  Count the number of times a word appears
in the data set
•  Use Map-Reduce to do this work
•  Step-by-Step of creating the MR process

Recommended for you

Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce

This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens. The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.

elastic map reducemap reduceapache hadoop
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop

This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.

bigdatahadoop
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN

This document provides an overview of YARN (Yet Another Resource Negotiator), the resource management system for Hadoop. It describes the key components of YARN including the Resource Manager, Node Manager, and Application Master. The Resource Manager tracks cluster resources and schedules applications, while Node Managers monitor nodes and containers. Application Masters communicate with the Resource Manager to manage applications. YARN allows Hadoop to run multiple applications like Spark and HBase, improves on MapReduce scheduling, and transforms Hadoop into a distributed operating system for big data processing.

hadoopbig datayarn
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driving Class
Mapper
Reducer
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Setting up the MapReduce Job 

•  A Job object forms the specification for the job
•  Job needs to know:
•  the jar file that the code is in which will be distributed around the cluster; setJarByClass()
•  the input path(s) (in HDFS) for the job; FileInputFormat.addInputPath()
•  the output path(s) (in HDFS) for the job; FileOutputFormat.setOutputPath()
•  the Mapper and Reducer classes; setMapperClass() setReducerClass()
•  the output key and value classes; setOutputKeyClass() setOutputValueClass()
•  the Mapper output key and value classes if they are different from the Reducer;
setMapOutputKeyClass() setMapOutputValueClass()
•  the Mapper output key and value classes
•  the name of the job, default is the name of the jar file; setJobName()
•  The default input considers the file as lines of text
•  The default key input is LongWriteable (the byte offset into the file)
•  The default value input is Text (the contents of the line read from the file)
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
You will typically import these classes into every
MapReduce job you write. We will omit the import
statements in future slides for brevity.

Recommended for you

Distributed DBMS - Unit 1 - Introduction
Distributed DBMS - Unit 1 - IntroductionDistributed DBMS - Unit 1 - Introduction
Distributed DBMS - Unit 1 - Introduction

Introduction: Distributed Data Processing, Distributed Database Systems, Promises of DDBSs, Complicating factors, Problem areas

Sqoop
SqoopSqoop
Sqoop

Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB

big datahadoop
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce

This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.

mapreducegooglemapreducegoogle
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
The main method accepts two command-line arguments: the
input and output directories.
The first step is to ensure that we have been given two
command line arguments. If not, print a help message and exit.
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Create a new job, specify the class which will be called to run
the job, and give it a Job Name.
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Give the Job information about the classes for the Mapper and
the reducer

Recommended for you

Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS

This document provides an introduction to Hadoop and HDFS. It defines big data and Hadoop, describing how Hadoop uses a scale-out approach to distribute data and processing across clusters of commodity servers. It explains that HDFS is the distributed file system of Hadoop, which splits files into blocks and replicates them across multiple nodes for reliability. HDFS is optimized for large streaming reads and writes of large files. The document also gives an overview of the Hadoop ecosystem and common Hadoop distributions.

hdfshadoop
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce

The document provides an introduction to MapReduce, describing its motivation as a framework for simplifying large-scale data processing across distributed systems. It outlines MapReduce's programming model and main features, including automatic parallelization, fault tolerance, and locality. The document also provides a detailed example of counting letter frequencies in a large file to illustrate how MapReduce works.

data processingmap-reducebig data
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle Database

These are the slides from my presentation on Running R in the Database using Oracle R Enterprise. The second half of the presentation is a live demo of using the Oracle R Enterprise. Unfortunately the demo is not listed in these slides

oracle advanced analyticsoracleoracle r enterpries
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Specify the format of the intermediate output key and value
produced by the Mapper
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Specify the types for the Reducer output key and value
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Specify the input directory (where the data will be read from)
and the output directory where the data will be written.
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

File formats - Inputs
•  The default InputFormat (TextInputFormat) will be used unless you specify otherwise
•  To use an InputFormat other than the default, use e.g.

conf.setInputFormat(KeyValueTextInputFormat.class)
•  By default, FileInputFormat.setInputPaths() will read all files from a specified directory
and send them to Mappers
•  Exceptions: items whose names begin with a period (.) or underscore (_)
•  Globs can be specified to restrict input 
•  For example, /2010/*/01/*

Recommended for you

SQL: The one language to rule all your data
SQL: The one language to rule all your dataSQL: The one language to rule all your data
SQL: The one language to rule all your data

My presentation is all about SQL and how it can be used to access, store, retrieve, analyze and protect all your data.

hadooporacledatabase
Predictive analytics: Mining gold and creating valuable product
Predictive analytics: Mining gold and creating valuable productPredictive analytics: Mining gold and creating valuable product
Predictive analytics: Mining gold and creating valuable product

My presentation about building predictive analytics and machine learning solutions. Presented using a number of real world projects that I've worked on over the past couple of years

case studyrmachine learning
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce

MapReduce provides a programming model for processing large datasets in a distributed, parallel manner. It involves two main steps - the map step where the input data is converted into intermediate key-value pairs, and the reduce step where the intermediate outputs are aggregated based on keys to produce the final results. Hadoop is an open-source software framework that allows distributed processing of large datasets across clusters of computers using MapReduce.

hadoopmapreduce
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

File formats - Outputs
•  FileOutputFormat.setOutputPath() specifies the directory to which the Reducers will
write their final output
•  The driver can also specify the format of the output data
•  Default is a plain text file 
•  Could be explicitly written as 

conf.setOutputFormat(TextOutputFormat.class);
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Submit the Job and wait for completion
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Driver Code
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: WordCount <input path> <output path>");
System.exit(-1); }
Job job = Job.getInstance();
job.setJarByClass(WordCount.class);
job.setJobName("WordCount");
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Mapper
•  The Mapper takes as input a key/value pair which represents a logical record from the
input data source (e.g. a line in a file) 
•  The Mapper may use or ignore the input key
•  E.g. a standard pattern is to read a file one line at a time
•  Key = byte offset into the file where the line starts
•  Value = contents of the line in the file 
•  Typically the key can be considered irrelevant
•  It produces zero or more outputs key/value pairs for each input pair
•  e.g. a filtering function may only produce output if a certain condition is
met
•  e.g. a counting function may produce multiple key/value pairs, one per
element being counted

Recommended for you

An Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDBAn Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDB

This document provides an introduction to using MapReduce with MongoDB. It explains what MapReduce is, how it works, and provides examples of mapping and reducing sample data to calculate applications by state, applications by status and state, and average wages by visa class and status. It also discusses some limitations and considerations when using MapReduce with MongoDB.

mongodbmapreduceukd1
OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016

OUG Ireland meet-up held on 20th October 20116, with presentations on updates from Oracle Open World 2016. Covering Tech/Database, Big Data, Analyitcs, and Oracle Cloud

oracleoracle open worldoug ireland
OUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th JanuaryOUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th January

Slides from the OUG Ireland meet-up held on the 12th January 2017. Presentation topics covered DevOps and Oracle Cloud

oracledevopscloud
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Mapper Class
•  extends the Mapper <K1, V1, K2, V2> class
•  key and value classes implement the WriteableComparable and
Writeable interfaces
•  most Mappers override the map method which is called once for every
key/value pair in the input
•  void map (K1 key,
V1 value,
Context context) throws IOException,
InterruptedException
•  the default map method is the Identity mapper - maps the inputs directly
to the outputs
•  in general the map input types K1, V1 are different from the map output
types K2, V2
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Mapper Class
•  Hadoop provides a number of Mapper implementations:
InverseMapper - swaps the keys and values
TokenCounterMapper - tokenises the input and outputs each token with a 

count of 1
RegexMapper - extracts text matching a regular expression
Example:
job.setMapperClass(TokenCounterMapper.class);
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Mapper Code
...
public class WordMapper extends Mapper<LongWritable, Text,
Text, IntWritable> {
public void map(LongWritable key, Text value, Context
context)
throws IOException, InterruptedException {
String s = value.toString();
for (String word : s.split("W+")) {
if (word.length() > 0) {
context.write(new Text(word), new IntWritable(1));
}
}
}
}
Inputs
 Outputs
Writes the outputs
Processes the input text
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

What the mapper does
•  Input to the Mapper:
•  Output from the Mapper:
(“this one I think is called a yink”)
(“he likes to wink, he likes to drink”)
(“he likes to drink and drink and drink”)
(this, 1)
(one, 1)
(I, 1)
(think, 1)
(is, 1)
(called,1)
(a, 1)
(yink,1)
(he, 1)
(likes,1)
(to,1)
(wink,1)
(he,1)
(likes,1)
(to,1)
(drink,1)
(he,1)
(likes,1)
(to,1)
(drink 1)
(and,1)
(drink,1)
(and,1)
(drink,1)

Recommended for you

Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction

The document provides an overview of the Hadoop Distributed File System (HDFS). It describes HDFS's master-slave architecture with a single NameNode master and multiple DataNode slaves. The NameNode manages filesystem metadata and data placement, while DataNodes store data blocks. The document outlines HDFS components like the SecondaryNameNode, DataNodes, and how files are written and read. It also discusses high availability solutions, operational tools, and the future of HDFS.

hadoophdfs
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals

This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.

clouderaapache hadoopmapreduce
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial

Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.

big data hadoop trainingcloudera hadoop developer traininghadoop classes
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Shuffle and sort
•  Shuffle 
•  Integrates the data (key/value pairs) from outputs of each mapper
•  For now, integrates into 1 file
•  Sort 
•  The set of intermediate keys on a single node is automatically
sorted by Hadoop before they are presented to the Reducer
•  Sorted within key
•  Determines what subset of data goes to which Reducer
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

(a, [1])
(and,[1,1])
(called,[1])
(drink,[1,1,1,1])
(he, [1,1,1])
(I, [1])
(is, [1])
(likes,[1,1,1])
(one, [1])
(think, [1])
(this, [1])
(to,[1,1,1])
(wink,[1])
(yink,[1])
(this, 1)
(one, 1)
(I, 1)
(think, 1)
(is, 1)
(called,1)
(a, 1)
(yink,1)
(he, 1)
(likes,1)
(to,1)
(wink,1)
(he,1)
(likes,1)
(to,1)
(drink,1)
(he,1)
(likes,1)
(to,1)
(drink 1)
(and,1)
(drink,1)
(and,1)
(drink,1)
(this, [1])
(one, [1])
(I, [1])
(think, [1])
(called,[1])
(is, [1])
(a, [1])
(yink,[1])
(he, [1,1,1])
(likes,[1,1,1])
(to,[1,1,1])
(wink,[1])
(drink,[1,1,1,1])
(and,[1,1])
Mapper
Shuffle (Group)
Sort
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Reducer Class
•  extends the Reducer <K2, V2, K3, V3> class
•  key and value classes implement the WriteableComparable and Writeable interfaces
•  void reduce (K2 key,
Iterable<V2> values,
Context context) throws IOException, InterruptedException
•  called once for each input key
•  generates a list of output key/values pairs by iterating over the values associated with the
input key
•  the reduce input types K2, V2 must be the same types as the map output types
•  the reduce output types K3, V3 can be different from the reduce input types
•  the default reduce method is the Identity reducer - outputs each input/value pair directly
•  getConfiguration() - access the Configuration for a Job
•  void setup (Context context) - called once at the beginning of the reduce task
•  void cleanup(Context context) - called at the end of the task to wrap up any
loose ends, closes files, db connections etc.
•  Default number of reducers = 1
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Reducer Class
•  Hadoop provides some Reducer implementations
IntSumReducer - sums the values (integers) for a given key 
LongSumReducer - sums the values (longs) for a given key
Example:
job.setReducerClass(IntSumReducer.class);
http://hadooptutorial.info/predefined-mapper-and-reducer-classes/
http://www.programcreek.com/java-api-examples/index.php?
api=org.apache.hadoop.mapreduce.lib.map.InverseMapper

Recommended for you

Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop

Hadoop, flexible and available architecture for large scale computation and data processing on a network of commodity hardware.

hadoophbasehive
Big data ppt
Big  data pptBig  data ppt
Big data ppt

This document provides an overview of big data. It defines big data as large volumes of diverse data that are growing rapidly and require new techniques to capture, store, distribute, manage, and analyze. The key characteristics of big data are volume, velocity, and variety. Common sources of big data include sensors, mobile devices, social media, and business transactions. Tools like Hadoop and MapReduce are used to store and process big data across distributed systems. Applications of big data include smarter healthcare, traffic control, and personalized marketing. The future of big data is promising with the market expected to grow substantially in the coming years.

Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce

MapReduce is a framework for processing large datasets in a distributed manner. It involves two functions: map and reduce. The map function processes individual elements to generate intermediate key-value pairs, and the reduce function merges all intermediate values with the same key. Hadoop is an open-source implementation of MapReduce that uses HDFS for storage. A typical MapReduce job in Hadoop involves defining map and reduce functions, configuring the job, and submitting it to the JobTracker which schedules tasks across nodes and monitors execution.

www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Reducer Code
public class SumReducer extends Reducer<Text, IntWritable,
Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Reducer Code
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
Inputs
 Outputs
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Reducer Code
public class SumReducer extends Reducer<Text, IntWritable,
Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
Processes the input text
www.oralytics.com

 
t : @brendantierney 

 
e : brendan.tierney@oralytics.com 

 

Reducer Code
public class SumReducer extends Reducer<Text, IntWritable,
Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
context.write(key, new IntWritable(wordCount));
}
}
Writes the outputs

Recommended for you

Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco

This document provides an introduction to MapReduce and Disco, an open source implementation of MapReduce in Erlang and Python. It explains the motivation for MapReduce frameworks like Google's in addressing the need to process massive amounts of data across large clusters reliably. The core concepts of MapReduce are described, including how the input is split and mapped in parallel, intermediate key-value pairs are grouped and reduced, and the final output is produced. An example word counting algorithm demonstrates how a problem can be solved using MapReduce.

mapreducepythondisco
Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, Providers

This slide gives a simple and purposeful knowledge about popular Hadoop platforms. From simple definition to importance of Hadoop in modern era the presentation also introduces Hadoop service providers along with its core components. Do go through it once and comment below with your feedback. I am sure that this slide will help many in presenting basics of Hadoop for their projects or business purpose. The crisp information has been generated after going through detailed information available on internet as well as research papers

amazoniotamazon web services
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce

Very detail description if MapReduce in this slides. I hope you will learn very much about MapReduce after reading these slides

mapreducesoftware engineeringcloud computing

More Related Content

What's hot

HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Rutvik Bapat
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
Bhavesh Padharia
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
RDD
RDDRDD
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
Davin Abraham
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
Ajit Koti
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Apache Apex
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Acid properties
Acid propertiesAcid properties
Acid properties
Abhilasha Lahigude
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
Newvewm
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
Vigen Sahakyan
 
Distributed DBMS - Unit 1 - Introduction
Distributed DBMS - Unit 1 - IntroductionDistributed DBMS - Unit 1 - Introduction
Distributed DBMS - Unit 1 - Introduction
Gyanmanjari Institute Of Technology
 
Sqoop
SqoopSqoop

What's hot (20)

HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
RDD
RDDRDD
RDD
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Acid properties
Acid propertiesAcid properties
Acid properties
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Distributed DBMS - Unit 1 - Introduction
Distributed DBMS - Unit 1 - IntroductionDistributed DBMS - Unit 1 - Introduction
Distributed DBMS - Unit 1 - Introduction
 
Sqoop
SqoopSqoop
Sqoop
 

Viewers also liked

Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
Brendan Tierney
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
Francisco Pérez-Sorrosal
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle Database
Brendan Tierney
 
SQL: The one language to rule all your data
SQL: The one language to rule all your dataSQL: The one language to rule all your data
SQL: The one language to rule all your data
Brendan Tierney
 
Predictive analytics: Mining gold and creating valuable product
Predictive analytics: Mining gold and creating valuable productPredictive analytics: Mining gold and creating valuable product
Predictive analytics: Mining gold and creating valuable product
Brendan Tierney
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
Bhupesh Chawda
 
An Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDBAn Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDB
Rainforest QA
 
OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016
Brendan Tierney
 
OUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th JanuaryOUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th January
Brendan Tierney
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
Nasrin Hussain
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
TrendProgContest13
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
Jim Roepcke
 
Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, Providers
Mrigendra Sharma
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
Hassan A-j
 
Embedded R Execution using SQL
Embedded R Execution using SQLEmbedded R Execution using SQL
Embedded R Execution using SQL
Brendan Tierney
 

Viewers also liked (20)

Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle Database
 
SQL: The one language to rule all your data
SQL: The one language to rule all your dataSQL: The one language to rule all your data
SQL: The one language to rule all your data
 
Predictive analytics: Mining gold and creating valuable product
Predictive analytics: Mining gold and creating valuable productPredictive analytics: Mining gold and creating valuable product
Predictive analytics: Mining gold and creating valuable product
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
An Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDBAn Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDB
 
OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016
 
OUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th JanuaryOUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th January
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
 
Hadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, ProvidersHadoop Platforms - Introduction, Importance, Providers
Hadoop Platforms - Introduction, Importance, Providers
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Embedded R Execution using SQL
Embedded R Execution using SQLEmbedded R Execution using SQL
Embedded R Execution using SQL
 

Similar to Introduction to Map-Reduce

Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
Jazan University
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
jani shaik
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Hadoop
HadoopHadoop
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
Donald Miner
 
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
bhuvankumar3877
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
KhanKhaja1
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Dr. C.V. Suresh Babu
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
Kalyan Hadoop
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
Jay Nagar
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
SakthiVinoth78
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
Prashant Gupta
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
BikalAdhikari4
 

Similar to Introduction to Map-Reduce (20)

Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Hadoop
HadoopHadoop
Hadoop
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
map reduce Technic in big data
map reduce Technic in big data map reduce Technic in big data
map reduce Technic in big data
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 

Recently uploaded

[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
Amazon Web Services Korea
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
aarusi sexy model
 
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
butwhat24
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
kihus38
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
kamli sharma#S10
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
sanjay singh
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
gargtinna79
 
EGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithmEGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithm
fatimaezzahraboumaiz2
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
kumkum tuteja$A17
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
dipti singh$A17
 
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
bookmybebe1
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
Jyotishko Biswas
 
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeMalviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
butwhat24
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeRK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Alisha Pathan $A17
 
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
butwhat24
 
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeSaket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
shruti singh$A17
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
Cloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco ProductsCloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco Products
luqmansyauqi2
 
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
KiranKumar139571
 

Recently uploaded (20)

[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Arti Singh Top Model Safe
 
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
 
EGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithmEGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithm
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Vishakha Singla Top Model Safe
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
 
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeMalviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
 
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeRK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
 
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeDaryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Daryaganj @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
 
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeSaket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
Cloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco ProductsCloud Analytics Use Cases - Telco Products
Cloud Analytics Use Cases - Telco Products
 
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
 

Introduction to Map-Reduce

  • 1. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Big Data Analysis using Hadoop! ! Map-Reduce – An Introduction! ! Lecture 2! ! ! Brendan Tierney
  • 2. [from Hadoop in Practice, Alex Holmes] HDFS Architecture
  • 3. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com MapReduce •  A batch based, distributed computing framework modelled on Google’s paper on MapReduce [http://research.google.com/archive/mapreduce.html] •  MapReduce decomposes work into small parallelised map and reduce tasks which are scheduled for remote execution on slave nodes •  Terminology •  A job is a full programme •  A task is the execution of a single map or reduce task over a slice of data called a split •  A Mapper is a map task •  A Reducer is a reduce task •  MapReduce works by manipulating key/value pairs in the general format map(key1,value1)➝ list(key2,value2) reduce(key2,list(value2)) ➝ (key3, value3)
  • 4. [from Hadoop in Practice, Alex Holmes] A MapReduce Job
  • 5. [from Hadoop in Practice, Alex Holmes] A MapReduce Job The input is divided into fixed-size pieces called input splits A map task is created for each split
  • 6. [from Hadoop in Practice, Alex Holmes] A MapReduce Job The role of the programmer is to define the Map and Reduce functions
  • 7. [from Hadoop in Practice, Alex Holmes] A MapReduce Job The Shuffle & Sort phases between the Map and the Reduce phases combines map outputs and sorts them for the Reducers...
  • 8. [from Hadoop in Practice, Alex Holmes] A MapReduce Job The Shuffle & Sort phases between the Map and the Reduce phases combines map outputs and sorts them for the Reducers... The Reduce phase merges the data, as defined by the programmer to produce the outputs.
  • 9. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Map •  The Map function •  The Mapper takes as input a key/value pair which represents a logical record from the input data source (e.g. a line in a file) •  It produces zero or more outputs key/value pairs for each input pair •  e.g. a filtering function may only produce output if a certain condition is met •  e.g. a counting function may produce multiple key/value pairs, one per element being counted map(in_key, in_value) ➝ list(temp_key, temp_value)
  • 10. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Reduce •  The Reducer(s) •  A single Reducer handles all the map output for a unique map output key •  A Reducer outputs zero to many key/value pairs •  The output is written to HDFS files, to external DBs, or to any data sink... reduce(temp_key,list(temp_values) ➝ list(out_key, out_value)
  • 11. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com MapReduce •  JobTracker - (Master) •  Controls MapReduce jobs •  Assigns Map & Reduce tasks to the other nodes on the cluster •  Monitors the tasks as they are running •  Relaunches failed tasks on other nodes in the cluster •  TaskTracker - (Slave) •  A single TaskTracker per slave node •  Manage the execution of the individual tasks on the node •  Can instantiate many JVMs to handle tasks in parallel •  Communicates back to the JobTracker (via a heartbeat)
  • 12. [from Hadoop in Practice, Alex Holmes]
  • 13. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com [from Hadoop the Definitive Guide,Tom White] A MapReduce Job
  • 14. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com [from Hadoop the Definitive Guide,Tom White] Monitoring progress
  • 15. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com YARN (Yet Another Resource Negotiator) Framework
  • 16. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Data Locality ! “This is a local node for local Data” •  Whenever possible Hadoop will attempt to ensure that a Mapper on a node is working on a block of data stored locally on that node vis HDFS •  If this is not possible, the Mapper will have to transfer the data across the network as it accesses the data •  Once all the Map tasks are finished, the map output data is transferred across the network to the Reducers •  Although Reducers may run on the same node (physical machine) as the Mappers there is no concept of data locality for Reducers
  • 17. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Bottlenecks? •  Reducers cannot start until all Mappers are finished and the output has been transferred to the Reducers and sorted •  To alleviate bottlenecks in Shuffle & Sort - Hadoop starts to transfer data to the Reducers as the Mappers finish •  The percentage of Mappers which should finish before the Reducers start retrieving data is configurable •  To alleviate bottlenecks caused by slow Mappers - Hadoop uses speculative execution •  If a Mapper appears to be running significantly slower than the others, a new instance of the Mapper will be started on another machine, operating on the same data (remember replication) •  The results of the first Mapper to finish will be used •  The Mapper which is still running will be terminated by Hadoop
  • 19. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com The MapReduce Job! ! Let us build up an example
  • 20. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com The Scenario •  Build a Word Counter •  Using the Shakespeare Poems •  Count the number of times a word appears in the data set •  Use Map-Reduce to do this work •  Step-by-Step of creating the MR process
  • 21. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driving Class Mapper Reducer
  • 22. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Setting up the MapReduce Job •  A Job object forms the specification for the job •  Job needs to know: •  the jar file that the code is in which will be distributed around the cluster; setJarByClass() •  the input path(s) (in HDFS) for the job; FileInputFormat.addInputPath() •  the output path(s) (in HDFS) for the job; FileOutputFormat.setOutputPath() •  the Mapper and Reducer classes; setMapperClass() setReducerClass() •  the output key and value classes; setOutputKeyClass() setOutputValueClass() •  the Mapper output key and value classes if they are different from the Reducer; setMapOutputKeyClass() setMapOutputValueClass() •  the Mapper output key and value classes •  the name of the job, default is the name of the jar file; setJobName() •  The default input considers the file as lines of text •  The default key input is LongWriteable (the byte offset into the file) •  The default value input is Text (the contents of the line read from the file)
  • 23. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 24. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } You will typically import these classes into every MapReduce job you write. We will omit the import statements in future slides for brevity.
  • 25. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 26. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } The main method accepts two command-line arguments: the input and output directories. The first step is to ensure that we have been given two command line arguments. If not, print a help message and exit.
  • 27. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Create a new job, specify the class which will be called to run the job, and give it a Job Name.
  • 28. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Give the Job information about the classes for the Mapper and the reducer
  • 29. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Specify the format of the intermediate output key and value produced by the Mapper
  • 30. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Specify the types for the Reducer output key and value
  • 31. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Specify the input directory (where the data will be read from) and the output directory where the data will be written.
  • 32. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com File formats - Inputs •  The default InputFormat (TextInputFormat) will be used unless you specify otherwise •  To use an InputFormat other than the default, use e.g. conf.setInputFormat(KeyValueTextInputFormat.class) •  By default, FileInputFormat.setInputPaths() will read all files from a specified directory and send them to Mappers •  Exceptions: items whose names begin with a period (.) or underscore (_) •  Globs can be specified to restrict input •  For example, /2010/*/01/*
  • 33. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com File formats - Outputs •  FileOutputFormat.setOutputPath() specifies the directory to which the Reducers will write their final output •  The driver can also specify the format of the output data •  Default is a plain text file •  Could be explicitly written as conf.setOutputFormat(TextOutputFormat.class);
  • 34. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Submit the Job and wait for completion
  • 35. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Driver Code public class WordCount { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: WordCount <input path> <output path>"); System.exit(-1); } Job job = Job.getInstance(); job.setJarByClass(WordCount.class); job.setJobName("WordCount"); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 36. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Mapper •  The Mapper takes as input a key/value pair which represents a logical record from the input data source (e.g. a line in a file) •  The Mapper may use or ignore the input key •  E.g. a standard pattern is to read a file one line at a time •  Key = byte offset into the file where the line starts •  Value = contents of the line in the file •  Typically the key can be considered irrelevant •  It produces zero or more outputs key/value pairs for each input pair •  e.g. a filtering function may only produce output if a certain condition is met •  e.g. a counting function may produce multiple key/value pairs, one per element being counted
  • 37. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Mapper Class •  extends the Mapper <K1, V1, K2, V2> class •  key and value classes implement the WriteableComparable and Writeable interfaces •  most Mappers override the map method which is called once for every key/value pair in the input •  void map (K1 key, V1 value, Context context) throws IOException, InterruptedException •  the default map method is the Identity mapper - maps the inputs directly to the outputs •  in general the map input types K1, V1 are different from the map output types K2, V2
  • 38. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Mapper Class •  Hadoop provides a number of Mapper implementations: InverseMapper - swaps the keys and values TokenCounterMapper - tokenises the input and outputs each token with a 
 count of 1 RegexMapper - extracts text matching a regular expression Example: job.setMapperClass(TokenCounterMapper.class);
  • 39. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Mapper Code ... public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String s = value.toString(); for (String word : s.split("W+")) { if (word.length() > 0) { context.write(new Text(word), new IntWritable(1)); } } } } Inputs Outputs Writes the outputs Processes the input text
  • 40. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com What the mapper does •  Input to the Mapper: •  Output from the Mapper: (“this one I think is called a yink”) (“he likes to wink, he likes to drink”) (“he likes to drink and drink and drink”) (this, 1) (one, 1) (I, 1) (think, 1) (is, 1) (called,1) (a, 1) (yink,1) (he, 1) (likes,1) (to,1) (wink,1) (he,1) (likes,1) (to,1) (drink,1) (he,1) (likes,1) (to,1) (drink 1) (and,1) (drink,1) (and,1) (drink,1)
  • 41. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Shuffle and sort •  Shuffle •  Integrates the data (key/value pairs) from outputs of each mapper •  For now, integrates into 1 file •  Sort •  The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer •  Sorted within key •  Determines what subset of data goes to which Reducer
  • 42. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com (a, [1]) (and,[1,1]) (called,[1]) (drink,[1,1,1,1]) (he, [1,1,1]) (I, [1]) (is, [1]) (likes,[1,1,1]) (one, [1]) (think, [1]) (this, [1]) (to,[1,1,1]) (wink,[1]) (yink,[1]) (this, 1) (one, 1) (I, 1) (think, 1) (is, 1) (called,1) (a, 1) (yink,1) (he, 1) (likes,1) (to,1) (wink,1) (he,1) (likes,1) (to,1) (drink,1) (he,1) (likes,1) (to,1) (drink 1) (and,1) (drink,1) (and,1) (drink,1) (this, [1]) (one, [1]) (I, [1]) (think, [1]) (called,[1]) (is, [1]) (a, [1]) (yink,[1]) (he, [1,1,1]) (likes,[1,1,1]) (to,[1,1,1]) (wink,[1]) (drink,[1,1,1,1]) (and,[1,1]) Mapper Shuffle (Group) Sort
  • 43. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Reducer Class •  extends the Reducer <K2, V2, K3, V3> class •  key and value classes implement the WriteableComparable and Writeable interfaces •  void reduce (K2 key, Iterable<V2> values, Context context) throws IOException, InterruptedException •  called once for each input key •  generates a list of output key/values pairs by iterating over the values associated with the input key •  the reduce input types K2, V2 must be the same types as the map output types •  the reduce output types K3, V3 can be different from the reduce input types •  the default reduce method is the Identity reducer - outputs each input/value pair directly •  getConfiguration() - access the Configuration for a Job •  void setup (Context context) - called once at the beginning of the reduce task •  void cleanup(Context context) - called at the end of the task to wrap up any loose ends, closes files, db connections etc. •  Default number of reducers = 1
  • 44. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Reducer Class •  Hadoop provides some Reducer implementations IntSumReducer - sums the values (integers) for a given key LongSumReducer - sums the values (longs) for a given key Example: job.setReducerClass(IntSumReducer.class); http://hadooptutorial.info/predefined-mapper-and-reducer-classes/ http://www.programcreek.com/java-api-examples/index.php? api=org.apache.hadoop.mapreduce.lib.map.InverseMapper
  • 45. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Reducer Code public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } }
  • 46. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Reducer Code public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } } Inputs Outputs
  • 47. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Reducer Code public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } } Processes the input text
  • 48. www.oralytics.com t : @brendantierney e : brendan.tierney@oralytics.com Reducer Code public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int wordCount = 0; for (IntWritable value : values) { wordCount += value.get(); } context.write(key, new IntWritable(wordCount)); } } Writes the outputs