SlideShare a Scribd company logo
Dr. Sandeep G. Deshmukh
DataTorrent
1
Introduction to
Contents
 Motivation
 Scale of Cloud Computing
 Hadoop
 Hadoop Distributed File System (HDFS)
 MapReduce
 Sample Code Walkthrough
 Hadoop EcoSystem
2
Motivation - Traditional Distributed systems
 Processor Bound
 Using multiple machines
 Developer is burdened with
managing too many things
 Synchronization
 Failures
 Data moves from shared disk to
compute node
 Cost of maintaining clusters
 Scalability as and when required
not present
3
What is the scale we are talking about?
100s of CPUs?
Couple of CPUs?
10s of CPUs?
4

Recommended for you

Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial

This document describes how to set up a single-node Hadoop installation to perform MapReduce operations. It discusses supported platforms, required software including Java and SSH, and preparing the Hadoop cluster in either local, pseudo-distributed, or fully-distributed mode. The main components of the MapReduce execution pipeline are explained, including the driver, mapper, reducer, and input/output formats. Finally, a simple word count example MapReduce job is described to demonstrate how it works.

apache_hadoop mapreduce big_data cloud_computing
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem

The document describes the Hadoop ecosystem and its core components. It discusses HDFS, which stores large files across clusters and is made up of a NameNode and DataNodes. It also discusses MapReduce, which allows distributed processing of large datasets using a map and reduce function. Other components discussed include Hive, Pig, Impala, and Sqoop.

big datapigdata mining
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals

This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.

clouderaapache hadoopmapreduce
5
Hadoop @ Yahoo!
6
7
What we need
 Handling failure
 One computer = fails once in 1000 days
 1000 computers = 1 per day
 Petabytes of data to be processed in parallel
 1 HDD= 100 MB/sec
 1000 HDD= 100 GB/sec
 Easy scalability
 Relative increase/decrease of performance depending on
increase/decrease of nodes
8

Recommended for you

Hadoop
HadoopHadoop
Hadoop

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was created to support applications handling large datasets operating on many servers. Key Hadoop technologies include MapReduce for distributed computing, and HDFS for distributed file storage inspired by Google File System. Other related Apache projects extend Hadoop capabilities, like Pig for data flows, Hive for data warehousing, and HBase for NoSQL-like big data. Hadoop provides an effective solution for companies dealing with petabytes of data through distributed and parallel processing.

hadoop
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies

The document discusses various Hadoop technologies including HDFS, MapReduce, Pig/Hive, HBase, Flume, Oozie, Zookeeper, and HBase. HDFS provides reliable storage across multiple machines by replicating data on different nodes. MapReduce is a framework for processing large datasets in parallel. Pig and Hive provide high-level languages for analyzing data stored in Hadoop. Flume collects log data as it is generated. Oozie manages Hadoop jobs. Zookeeper allows distributed coordination. HBase provides a fault-tolerant way to store large amounts of sparse data.

Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics

This document provides an overview of topics to be covered in a Big Data training. It will discuss uses of Big Data, Hadoop, HDFS architecture, MapReduce algorithm, WordCount example, tips for MapReduce, and distributing Twitter data for testing. Key concepts that will be covered include what Big Data is, how HDFS is architected, the MapReduce phases of map, sort, shuffle, and reduce, and how WordCount works as a simple MapReduce example. The goal is to introduce foundational Big Data and Hadoop concepts.

big data
What we’ve got : Hadoop!
 Created by Doug Cutting
 Started as a module in nutch and then matured as an
apache project
 Named it after his son's stuffed
elephant
9
What we’ve got : Hadoop!
 Fault-tolerant file system
 Hadoop Distributed File System (HDFS)
 Modeled on Google File system
 Takes computation to data
 Data Locality
 Scalability:
 Program remains same for 10, 100, 1000,… nodes
 Corresponding performance improvement
 Parallel computation using MapReduce
 Other components – Pig, Hbase, HIVE, ZooKeeper
10
HDFS
Hadoop distributed File System
11
How HDFS works
NameNode -
Master
DataNodes
- Slave
Secondary
NameNode
12

Recommended for you

Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs

This document discusses Hadoop and Hive development at Facebook, including how they generate large amounts of user data daily, how they store the data in Hadoop clusters, and how they use Hive as a data warehouse to efficiently run SQL queries on the Hadoop data using a SQL-like language. It also outlines some of Hive's architecture and features like partitioning, buckets, and UDF/UDAF support, as well as its performance improvements over time and future planned work.

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture

This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.

 
by EMC
apache hadoophadoophadoop 101
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce

This document provides an overview of Hadoop and MapReduce. It discusses how Hadoop uses HDFS for distributed storage and replication of data blocks across commodity servers. It also explains how MapReduce allows for massively parallel processing of large datasets by splitting jobs into mappers and reducers. Mappers process data blocks in parallel and generate intermediate key-value pairs, which are then sorted and grouped by the reducers to produce the final results.

Storing file on HDFS
Motivation
 Reliability,
 Availability,
 Network Bandwidth
 The input file (say 1 TB) is split into smaller chunks/blocks of 64 MB (or multiples
of 64MB)
 The chunks are stored on multiple nodes as independent files on slave nodes
 To ensure that data is not lost, replicas are stored in the following way:
 One on local node
 One on remote rack (incase local rack fails)
 One on local rack (incase local node fails)
 Other randomly placed
 Default replication factor is 3
13
B1
B1
B1
B2
B2
B2
B3 Bn
Hub 1 Hub 2
Datanodes
File
Master
Node
8 gigabit
1 gigabit
Blocks
14
NameNode -
Master
The master node: NameNode
Functions:
 Manages File System- mapping files to blocks and blocks
to data nodes
 Maintaining status of data nodes
 Heartbeat
 Datanode sends heartbeat at regular intervals
 If heartbeat is not received, datanode is declared dead
 Blockreport
 DataNode sends list of blocks on it
 Used to check health of HDFS
15
NameNode Functions
 Replication
 On Datanode failure
 On Disk failure
 On Block corruption
 Data integrity
 Checksum for each block
 Stored in hidden file
 Rebalancing- balancer tool
 Addition of new nodes
 Decommissioning
 Deletion of some files
NameNode -
Master
16

Recommended for you

Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained

The document provides an overview of MapReduce and how it addresses the problem of processing large datasets in a distributed computing environment. It explains how MapReduce inspired by functional programming works by splitting data, mapping functions to pieces in parallel, and then reducing the results. Examples are given of word count and sorting word counts to find the most frequent word. Finally, it discusses how Hadoop popularized MapReduce by providing an open-source implementation and ecosystem.

apache hadoopmapreduce
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Abstract: The presentation describes - What is the BigData problem - How Hadoop helps to solve BigData problems - The main principles of the Hadoop architecture as a distributed computational platform - History and definition of the MapReduce computational model - Practical examples of how to write MapReduce programs and run them on Hadoop clusters The talk is targeted to a wide audience of engineers who do not have experience using Hadoop.

apache hadoopmapreducedistributed computing
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction

This document provides an overview of Hadoop, a tool for processing large datasets across clusters of computers. It discusses why big data has become so large, including exponential growth in data from the internet and machines. It describes how Hadoop uses HDFS for reliable storage across nodes and MapReduce for parallel processing. The document traces the history of Hadoop from its origins in Google's file system GFS and MapReduce framework. It provides brief explanations of how HDFS and MapReduce work at a high level.

hadoopbig data
HDFS Robustness
 Safemode
 At startup: No replication possible
 Receives Heartbeats and Blockreports from Datanodes
 Only a percentage of blocks are checked for defined replication
factor
17
All is well   Exit Safemode
 Replicate blocks wherever necessary
HDFS Summary
 Fault tolerant
 Scalable
 Reliable
 File are distributed in large blocks for
 Efficient reads
 Parallel access
18
Questions?
19
MapReduce
20

Recommended for you

Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview

This document provides an overview of Apache Hadoop, HDFS, and MapReduce. It describes how Hadoop uses a distributed file system (HDFS) to store large amounts of data across commodity hardware. It also explains how MapReduce allows distributed processing of that data by allocating map and reduce tasks across nodes. Key components discussed include the HDFS architecture with NameNodes and DataNodes, data replication for fault tolerance, and how the MapReduce engine works with a JobTracker and TaskTrackers to parallelize jobs.

apache hadoophdfsbig data
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt

This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.

nosqlmapreducehadoop
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report

1. The document discusses the evolution of computing from mainframes to smaller commodity servers and PCs. It then introduces cloud computing as an emerging technology that is changing the technology landscape, with examples like Google File System and Amazon S3. 2. It discusses the need for large data processing due to increasing amounts of data from sources like the stock exchange, Facebook, genealogy sites, and scientific experiments. 3. Hadoop is introduced as a framework for distributed computing and reliable shared storage and analysis of large datasets using its Hadoop Distributed File System (HDFS) for storage and MapReduce for analysis.

What is MapReduce?
 It is a powerful paradigm for parallel computation
 Hadoop uses MapReduce to execute jobs on files in
HDFS
 Hadoop will intelligently distribute computation over
cluster
 Take computation to data
21
Origin: Functional Programming
map f [a, b, c] = [f(a), f(b), f(c)]
map sq [1, 2, 3] = [sq(1), sq(2), sq(3)]
= [1,4,9]
 Returns a list constructed by applying a function (the first
argument) to all items in a list passed as the second
argument
22
Origin: Functional Programming
reduce f [a, b, c] = f(a, b, c)
reduce sum [1, 4, 9] = sum(1, sum(4,sum(9,sum(NULL))))
= 14
 Returns a list constructed by applying a function (the first
argument) on the list passed as the second argument
 Can be identity (do nothing)
23
Sum of squares example
[1,2,3,4]
Sq (1) Sq (2) Sq (3) Sq (4)
16941
30
Input
Intermediate
output
Output
MAPPER
REDUCER
M1 M2 M3 M4
R1
24

Recommended for you

Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce

Here is how you can solve this problem using MapReduce and Unix commands: Map step: grep -o 'Blue\|Green' input.txt | wc -l > output This uses grep to search the input file for the strings "Blue" or "Green" and print only the matches. The matches are piped to wc which counts the lines (matches). Reduce step: cat output This isn't really needed as there is only one mapper. Cat prints the contents of the output file which has the count of Blue and Green. So MapReduce has been simulated using grep for the map and cat for the reduce functionality. The key aspects are - grep extracts the relevant data (map

hadoopbig dataapache apex
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies

kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderaba www.kellytechno.com

hadoop training in hyderabadhadoop institutes in hyderabadhadoop training centers in hyderabad
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies

Hadoop Institutes: kelly technologies are the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad. http://www.kellytechno.com/Hyderabad/Course/Hadoop-Training

hadoop training institutes in hyderabadhadoop institutes in hyderabadhadoop classes
Sum of squares of even and odd
[1,2,3,4]
Sq (1) Sq (2) Sq (3) Sq (4)
(even, 16)(odd, 9)(even, 4)(odd, 1)
(even, 20) (odd, 10)
Input
Intermediate
output
Output
MAPPER
REDUCER
M1 M2 M3 M4
R1 R2
25
Programming model- key, value pairs
Format of input- output
(key, value)
Map: (k1, v1) → list (k2, v2)
Reduce: (k2, list v2) → list (k3, v3)
26
Sum of squares of even and odd and prime
[1,2,3,4]
Sq (1) Sq (2) Sq (3) Sq (4)
(even, 16)(odd, 9)
(prime, 9)
(even, 4)
(prime, 4)
(odd, 1)
(even, 20) (odd, 10)
(prime, 13)
Input
Intermediate
output
Output
R2R1
R3
27
Many keys, many values
Format of input- output
(key, value)
Map: (k1, v1) → list (k2, v2)
Reduce: (k2, list v2) → list (k3, v3)
28

Recommended for you

Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore

Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.

hadoop training in bangalorehadoop training institutes in bangalore
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx

This document discusses MapReduce and its ability to process large datasets in a distributed manner. MapReduce addresses challenges of distributed computation by allowing programmers to specify map and reduce functions. It then parallelizes the execution of these functions across large clusters and handles failures transparently. The map function processes input key-value pairs to generate intermediate pairs, which are then grouped by key and passed to reduce functions to generate the final output.

Hadoop
HadoopHadoop
Hadoop

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a simple programming model called MapReduce that automatically parallelizes and distributes work across nodes. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and MapReduce execution engine for processing. HDFS stores data as blocks replicated across nodes for fault tolerance. MapReduce jobs are split into map and reduce tasks that process key-value pairs in parallel. Hadoop is well-suited for large-scale data analytics as it scales to petabytes of data and thousands of machines with commodity hardware.

Fibonacci sequence
 f(n) = f(n-1) + f(n-2)
i.e. f(5) = f(4) + f(3)
 0, 1, 1, 2, 3, 5, 8, 13,…
f(5)
f(4) f(3)
f(2)f(3)
f(1)f(2)
f(2) f(1)
f(0)f(1)
•MapReduce will not work
on this kind of calculation
•No inter-process
communication
•No data sharing
29
Input:
1TB text file containing color
names- Blue, Green, Yellow,
Purple, Pink, Red, Maroon, Grey,
Desired
output:
Occurrence of colors Blue
and Green
30
N1 f.001
Blue
Purple
Blue
Red
Green
Blue
Maroon
Green
Yellow
N1 f.001
Blue
Blue
Green
Blue
Green
grep Blue|Green
Nn
f.00n
Green
Blue
Blue
Blue
Green
Blue= 3000
Green= 5500
Blue=500
Green=200
Blue=420
Green=200
sort |unique -c
awk ‘{arr[$1]+=$2;}
END{print arr[Blue], arr[Green]}’
COMBINER
MAPPER
REDUCER
awk ‘{arr[$1]++;}
END{print arr[Blue], arr[Green]}’Nn f.00n
Blue
Purple
Blue
Red
Green
Blue
Maroon
Green
Yellow
31
Input
data
Map
Map
Map
Reduce
Reduce
Output
INPUT MAP SHUFFLE REDUCE OUTPUT
Works on a record
Works on output of Map
32
MapReduce Overview

Recommended for you

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into account for launching speculative map tasks, because it is assumed that most maps are data-local. Unfortunately, both the homogeneity and data locality assumptions are not satisfied in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably reduce the MapReduce performance. In this paper, we address the problem of how to place data across nodes in a way that each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster, our data placement scheme adaptively balances the amount of data stored in each node to achieve improved data-processing performance. Experimental results on two real data-intensive applications show that our data placement strategy can always improve the MapReduce performance by rebalancing data across nodes before performing a data-intensive application in a heterogeneous Hadoop cluster.

heterogeneous clustermapreducedata placement
Data Science
Data ScienceData Science
Data Science

The document provides an overview of data science with Python and integrating Python with Hadoop and Apache Spark frameworks. It discusses: - Why Python should be integrated with Hadoop and the ecosystem including HDFS, MapReduce, and Spark. - Key concepts of Hadoop including HDFS for storage, MapReduce for processing, and how Python can be integrated via APIs. - Benefits of Apache Spark like speed, simplicity, and efficiency through its RDD abstraction and how PySpark enables Python access. - Examples of using Hadoop Streaming and PySpark to analyze data and determine word counts from documents.

mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx

This document provides an introduction to MapReduce programming model. It describes how MapReduce inspired by Lisp functions works by dividing tasks into mapping and reducing parts that are distributed and processed in parallel. It then gives examples of using MapReduce for word counting and calculating total sales. It also provides details on MapReduce daemons in Hadoop and includes demo code for summing array elements in Java and doing word counting on a text file using the Hadoop framework in Python.

mapreducepython
Input
data
Combine
Combine
Combine
Map
Map
Map
Reduce
Reduce
Output
INPUT MAP REDUCE OUTPUT
Works on output of Map Works on output of Combiner
33
MapReduce Overview
34
MapReduce Overview
 Mapper, reducer and combiner act on <key, value>
pairs
 Mapper gets one record at a time as an input
 Combiner (if present) works on output of map
 Reducer works on output of map (or combiner, if
present)
 Combiner can be thought of local-reducer
 Reduces output of maps that are executed on same
node
35
MapReduce Summary
What Hadoop is not..
 It is not a POSIX file system
 It is not a SAN file system
 It is not for interactive file accessing
 It is not meant for a large number of small files-
it is for a small number of large files
 MapReduce cannot be used for any and all
applications
36

Recommended for you

Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala

Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics. This slide covers the Advance Map reduce concepts of Hadoop and Big Data. For training queries you can contact us: Email: admin@designpathshala.com Call us at: +91 98 188 23045 Visit us at: http://designpathshala.com Join us at: http://www.designpathshala.com/contact-us Course details: http://www.designpathshala.com/course/view/65536 Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792 Business Analytics Course details: http://www.designpathshala.com/course/view/196608

bigdatapigdesignpathshala
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through HDFS and distributed processing via MapReduce. HDFS handles storage and MapReduce provides a programming model for parallel processing of large datasets across a cluster. The MapReduce framework consists of a mapper that processes input key-value pairs in parallel, and a reducer that aggregates the output of the mappers by key.

Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile

The document discusses network performance profiling of Hadoop jobs. It presents results from running two common Hadoop benchmarks - Terasort and Ranked Inverted Index - on different Amazon EC2 instance configurations. The results show that the shuffle phase accounts for a significant portion (25-29%) of total job runtime. They aim to reproduce existing findings that network performance is a key bottleneck for shuffle-intensive Hadoop jobs. Some questions are also raised about inconsistencies in reported network bandwidth capabilities for EC2.

Hadoop: Take Home
 Takes computation to data
 Suitable for large data centric operations
 Scalable on demand
 Fault tolerant and highly transparent
37
Questions?
38
 Coming up next …
 First hadoop program
 Second hadoop program
Your first program in hadoop
Open up any tutorial on hadoop and first program
you see will be of wordcount 
Task:
 Given a text file, generate a list of words with the
number of times each of them appear in the file
Input:
 Plain text file
Expected Output:
 <word, frequency> pairs for all words in the file
hadoop is a framework written in java
hadoop supports parallel processing
and is a simple framework
<hadoop, 2>
<is, 2>
<a , 2>
<java , 1>
<framework , 2>
<written , 1>
<in , 1>
<and,1>
<supports , 1>
<parallel , 1>
<processing. , 1>
<simple,1>
39
Your second program in hadoop
Task:
 Given a text file containing numbers, one
per line, count sum of squares of odd, even
and prime
Input:
 File containing integers, one per line
Expected Output:
 <type, sum of squares> for odd, even, prime
1
2
5
3
5
6
3
7
9
4
<odd, 302>
<even, 278>
<prime, 323 >
40

Recommended for you

Map Reduce
Map ReduceMap Reduce
Map Reduce

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.

hadoopmapreduce
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce

Hadoop MapReduce is an open source framework for distributed processing of large datasets across clusters of computers. It allows parallel processing of large datasets by dividing the work across nodes. The framework handles scheduling, fault tolerance, and distribution of work. MapReduce consists of two main phases - the map phase where the data is processed key-value pairs and the reduce phase where the outputs of the map phase are aggregated together. It provides an easy programming model for developers to write distributed applications for large scale processing of structured and unstructured data.

chapterstudentvnit-acm
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...

This document provides an overview of MapReduce and Hadoop. It describes the Map and Reduce functions, explaining that Map applies a function to each element of a list and Reduce reduces a list to a single value. It gives examples of Map and Reduce using employee salary data. It then discusses Hadoop and its core components HDFS for distributed storage and MapReduce for distributed processing. Key aspects covered include the NameNode, DataNodes, input/output formats, and the job launch process. It also addresses some common questions around small files, large files, and accessing SQL data from Hadoop.

Your second program in hadoop
File on HDFS
41
3
9
6
2
3
7
8
Map: square
3 <odd,9>
7 <odd,49>
2
6 <even,36>
9 <odd,81>
3 <odd,9>
8 <even,64>
<prime,4>
<prime,9>
<prime,9>
<even,4>
Reducer: sum
prime:<9,4,9>
odd:<9,81,9,49>
even:<,36,4,64>
<odd,148>
<even,104>
<prime,22>
Input
Value
Output
(Key,Value)
Input (Key, List of Values)
Output
(Key,Value)
Your second program in hadoop
42
Map
(Invoked on a record)
Reduce
(Invoked on a key)
void map (int x){
int sq = x * x;
if(x is odd)
print(“odd”,sq);
if(x is even)
print(“even”,sq);
if(x is prime)
print(“prime”,sq);
}
void reduce(List l ){
for(y in List l){
sum += y;
}
print(Key, sum);
}
Library functions
boolean odd(int x){ …}
boolean even(int x){ …}
boolean prime(int x){ …}
Your second program in hadoop
43
Map
(Invoked on a record)
Map
(Invoked on a record)
void map (int x){
int sq = x * x;
if(x is odd)
print(“odd”,sq);
if(x is even)
print(“even”,sq);
if(x is prime)
print(“prime”,sq);
}
Your second program in hadoop: Reduce
44
Reduce
(Invoked on a key)
Reduce
(Invoked on a
key)
void reduce(List l ){
for(y in List l){
sum += y;
}
print(Key, sum);
}

Recommended for you

Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question

Hadoop is an open source framework for distributed storage and processing of vast amounts of data across clusters of computers. It uses a master-slave architecture with a single JobTracker master and multiple TaskTracker slaves. The JobTracker schedules tasks like map and reduce jobs on TaskTrackers, which each run task instances in separate JVMs. It monitors task progress and reschedules failed tasks. Hadoop uses MapReduce programming model where the input is split and mapped in parallel, then outputs are shuffled, sorted, and reduced to form the final results.

ccdhcloudera certificationhadoop
Hadoop
HadoopHadoop
Hadoop

This document provides an overview of Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop was developed based on Google's MapReduce algorithm and how it uses HDFS for scalable storage and MapReduce as an execution engine. Key components of Hadoop architecture include HDFS for fault-tolerant storage across data nodes and the MapReduce programming model for parallel processing of data blocks. The document also gives examples of how MapReduce works and industries that use Hadoop for big data applications.

Handout3o
Handout3oHandout3o
Handout3o

This document provides an overview of Google's Bigtable distributed storage system. It describes Bigtable's data model as a sparse, multidimensional sorted map indexed by row, column, and timestamp. Bigtable stores data across many tablet servers, with a single master server coordinating metadata operations like tablet assignment and load balancing. The master uses Chubby, a distributed lock service, to track which tablet servers are available and reassign tablets if servers become unreachable.

cloud computing
Your second program in hadoop
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, “OddEvenPrime");
job.setJarByClass(OddEvenPrime.class);
job.setMapperClass(OddEvenPrimeMapper.class);
job.setCombinerClass(OddEvenPrimeReducer.class);
job.setReducerClass(OddEvenPrimeReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
45
Questions?
46
 Coming up next …
 More examples
 Hadoop Ecosystem
Example: Counting Fans
47
Example: Counting Fans
48
Problem: Give Crowd
statistics
Count fans
supporting India and
Pakistan

Recommended for you

Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*

This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.

matlabgpuintel software
Meethadoop
MeethadoopMeethadoop
Meethadoop

Hadoop is an open source framework for running large-scale data processing jobs across clusters of computers. It has two main components: HDFS for reliable storage and Hadoop MapReduce for distributed processing. HDFS stores large files across nodes through replication and uses a master-slave architecture. MapReduce allows users to write map and reduce functions to process large datasets in parallel and generate results. Hadoop has seen widespread adoption for processing massive datasets due to its scalability, reliability and ease of use.

Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last

Short introduction to the Apache Pig solution and the Pig Latin language. Apache Pig is a mix between a procedural (C++) and a declarative (SQL) language to execute quieries on large amounts of data (BigData)

apache pigpig latinbig data
49
45882 67917
 Traditional Way
 Central Processing
 Every fan comes to the
centre and presses
India/Pak button
 Issues
 Slow/Bottlenecks
 Only one processor
 Processing time
determined by the
speed at which
people(data) can
move
Example: Counting Fans
50
Hadoop Way
 Appoint processors per
block (MAPPER)
 Send them to each block
and ask them to send a
signal for each person
 Central processor will
aggregate the results
(REDUCER)
 Can make the processor
smart by asking him/her
to aggregate locally and
only send aggregated
value (COMBINER)
Example: Counting Fans
Reducer
Combiner
HomeWork : Exit Polls 2014
51
Hadoop EcoSystem: Basic
52

Recommended for you

Hadoop Distributions
53
Who all are using Hadoop
54
http://wiki.apache.org/hadoop/PoweredBy
References
For understanding Hadoop
 Official Hadoop website- http://hadoop.apache.org/
 Hadoop presentation wiki-
http://wiki.apache.org/hadoop/HadoopPresentations?a
ction=AttachFile
 http://developer.yahoo.com/hadoop/
 http://wiki.apache.org/hadoop/
 http://www.cloudera.com/hadoop-training/
 http://developer.yahoo.com/hadoop/tutorial/module2.
html#basics
55
References
Setup and Installation
 Installing on Ubuntu
 http://www.michael-
noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_
(Single-Node_Cluster)
 http://www.michael-
noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_
(Multi-Node_Cluster)
 Installing on Debian
 http://archive.cloudera.com/docs/_apt.html
56

Recommended for you

 Hadoop: The Definitive Guide: Tom White
 http://developer.yahoo.com/hadoop/tutorial/
 http://www.cloudera.com/content/cloudera-
content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-
Tutorial.html
57
Further Reading
Questions?
58
59
Acknowledgements
Surabhi Pendse
Sayali Kulkarni
Parth Shah

More Related Content

What's hot

Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
jeffturner
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
Paladion Networks
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
Chirag Ahuja
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Hadoop
HadoopHadoop
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
Kannappan Sirchabesan
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
Abhishek Mukherjee
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
Dmytro Sandu
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Konstantin V. Shvachko
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Atul Kushwaha
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
Apache Apex
 

What's hot (19)

Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 

Similar to Hadoop-Introduction

Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
Kelly Technologies
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
Kelly Technologies
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
GiannisPagges
 
Hadoop
HadoopHadoop
Hadoop
Anil Reddy
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
Habiba Abderrahim
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 
Hadoop
HadoopHadoop
Handout3o
Handout3oHandout3o
Handout3o
Shahbaz Sidhu
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
Intel® Software
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last
Francesc Lordan Gomis
 

Similar to Hadoop-Introduction (20)

Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Data Science
Data ScienceData Science
Data Science
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Hadoop
HadoopHadoop
Hadoop
 
Handout3o
Handout3oHandout3o
Handout3o
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Eedc.apache.pig last
Eedc.apache.pig lastEedc.apache.pig last
Eedc.apache.pig last
 

Hadoop-Introduction

  • 1. Dr. Sandeep G. Deshmukh DataTorrent 1 Introduction to
  • 2. Contents  Motivation  Scale of Cloud Computing  Hadoop  Hadoop Distributed File System (HDFS)  MapReduce  Sample Code Walkthrough  Hadoop EcoSystem 2
  • 3. Motivation - Traditional Distributed systems  Processor Bound  Using multiple machines  Developer is burdened with managing too many things  Synchronization  Failures  Data moves from shared disk to compute node  Cost of maintaining clusters  Scalability as and when required not present 3
  • 4. What is the scale we are talking about? 100s of CPUs? Couple of CPUs? 10s of CPUs? 4
  • 5. 5
  • 7. 7
  • 8. What we need  Handling failure  One computer = fails once in 1000 days  1000 computers = 1 per day  Petabytes of data to be processed in parallel  1 HDD= 100 MB/sec  1000 HDD= 100 GB/sec  Easy scalability  Relative increase/decrease of performance depending on increase/decrease of nodes 8
  • 9. What we’ve got : Hadoop!  Created by Doug Cutting  Started as a module in nutch and then matured as an apache project  Named it after his son's stuffed elephant 9
  • 10. What we’ve got : Hadoop!  Fault-tolerant file system  Hadoop Distributed File System (HDFS)  Modeled on Google File system  Takes computation to data  Data Locality  Scalability:  Program remains same for 10, 100, 1000,… nodes  Corresponding performance improvement  Parallel computation using MapReduce  Other components – Pig, Hbase, HIVE, ZooKeeper 10
  • 12. How HDFS works NameNode - Master DataNodes - Slave Secondary NameNode 12
  • 13. Storing file on HDFS Motivation  Reliability,  Availability,  Network Bandwidth  The input file (say 1 TB) is split into smaller chunks/blocks of 64 MB (or multiples of 64MB)  The chunks are stored on multiple nodes as independent files on slave nodes  To ensure that data is not lost, replicas are stored in the following way:  One on local node  One on remote rack (incase local rack fails)  One on local rack (incase local node fails)  Other randomly placed  Default replication factor is 3 13
  • 14. B1 B1 B1 B2 B2 B2 B3 Bn Hub 1 Hub 2 Datanodes File Master Node 8 gigabit 1 gigabit Blocks 14
  • 15. NameNode - Master The master node: NameNode Functions:  Manages File System- mapping files to blocks and blocks to data nodes  Maintaining status of data nodes  Heartbeat  Datanode sends heartbeat at regular intervals  If heartbeat is not received, datanode is declared dead  Blockreport  DataNode sends list of blocks on it  Used to check health of HDFS 15
  • 16. NameNode Functions  Replication  On Datanode failure  On Disk failure  On Block corruption  Data integrity  Checksum for each block  Stored in hidden file  Rebalancing- balancer tool  Addition of new nodes  Decommissioning  Deletion of some files NameNode - Master 16
  • 17. HDFS Robustness  Safemode  At startup: No replication possible  Receives Heartbeats and Blockreports from Datanodes  Only a percentage of blocks are checked for defined replication factor 17 All is well   Exit Safemode  Replicate blocks wherever necessary
  • 18. HDFS Summary  Fault tolerant  Scalable  Reliable  File are distributed in large blocks for  Efficient reads  Parallel access 18
  • 21. What is MapReduce?  It is a powerful paradigm for parallel computation  Hadoop uses MapReduce to execute jobs on files in HDFS  Hadoop will intelligently distribute computation over cluster  Take computation to data 21
  • 22. Origin: Functional Programming map f [a, b, c] = [f(a), f(b), f(c)] map sq [1, 2, 3] = [sq(1), sq(2), sq(3)] = [1,4,9]  Returns a list constructed by applying a function (the first argument) to all items in a list passed as the second argument 22
  • 23. Origin: Functional Programming reduce f [a, b, c] = f(a, b, c) reduce sum [1, 4, 9] = sum(1, sum(4,sum(9,sum(NULL)))) = 14  Returns a list constructed by applying a function (the first argument) on the list passed as the second argument  Can be identity (do nothing) 23
  • 24. Sum of squares example [1,2,3,4] Sq (1) Sq (2) Sq (3) Sq (4) 16941 30 Input Intermediate output Output MAPPER REDUCER M1 M2 M3 M4 R1 24
  • 25. Sum of squares of even and odd [1,2,3,4] Sq (1) Sq (2) Sq (3) Sq (4) (even, 16)(odd, 9)(even, 4)(odd, 1) (even, 20) (odd, 10) Input Intermediate output Output MAPPER REDUCER M1 M2 M3 M4 R1 R2 25
  • 26. Programming model- key, value pairs Format of input- output (key, value) Map: (k1, v1) → list (k2, v2) Reduce: (k2, list v2) → list (k3, v3) 26
  • 27. Sum of squares of even and odd and prime [1,2,3,4] Sq (1) Sq (2) Sq (3) Sq (4) (even, 16)(odd, 9) (prime, 9) (even, 4) (prime, 4) (odd, 1) (even, 20) (odd, 10) (prime, 13) Input Intermediate output Output R2R1 R3 27
  • 28. Many keys, many values Format of input- output (key, value) Map: (k1, v1) → list (k2, v2) Reduce: (k2, list v2) → list (k3, v3) 28
  • 29. Fibonacci sequence  f(n) = f(n-1) + f(n-2) i.e. f(5) = f(4) + f(3)  0, 1, 1, 2, 3, 5, 8, 13,… f(5) f(4) f(3) f(2)f(3) f(1)f(2) f(2) f(1) f(0)f(1) •MapReduce will not work on this kind of calculation •No inter-process communication •No data sharing 29
  • 30. Input: 1TB text file containing color names- Blue, Green, Yellow, Purple, Pink, Red, Maroon, Grey, Desired output: Occurrence of colors Blue and Green 30
  • 31. N1 f.001 Blue Purple Blue Red Green Blue Maroon Green Yellow N1 f.001 Blue Blue Green Blue Green grep Blue|Green Nn f.00n Green Blue Blue Blue Green Blue= 3000 Green= 5500 Blue=500 Green=200 Blue=420 Green=200 sort |unique -c awk ‘{arr[$1]+=$2;} END{print arr[Blue], arr[Green]}’ COMBINER MAPPER REDUCER awk ‘{arr[$1]++;} END{print arr[Blue], arr[Green]}’Nn f.00n Blue Purple Blue Red Green Blue Maroon Green Yellow 31
  • 32. Input data Map Map Map Reduce Reduce Output INPUT MAP SHUFFLE REDUCE OUTPUT Works on a record Works on output of Map 32 MapReduce Overview
  • 33. Input data Combine Combine Combine Map Map Map Reduce Reduce Output INPUT MAP REDUCE OUTPUT Works on output of Map Works on output of Combiner 33 MapReduce Overview
  • 35.  Mapper, reducer and combiner act on <key, value> pairs  Mapper gets one record at a time as an input  Combiner (if present) works on output of map  Reducer works on output of map (or combiner, if present)  Combiner can be thought of local-reducer  Reduces output of maps that are executed on same node 35 MapReduce Summary
  • 36. What Hadoop is not..  It is not a POSIX file system  It is not a SAN file system  It is not for interactive file accessing  It is not meant for a large number of small files- it is for a small number of large files  MapReduce cannot be used for any and all applications 36
  • 37. Hadoop: Take Home  Takes computation to data  Suitable for large data centric operations  Scalable on demand  Fault tolerant and highly transparent 37
  • 38. Questions? 38  Coming up next …  First hadoop program  Second hadoop program
  • 39. Your first program in hadoop Open up any tutorial on hadoop and first program you see will be of wordcount  Task:  Given a text file, generate a list of words with the number of times each of them appear in the file Input:  Plain text file Expected Output:  <word, frequency> pairs for all words in the file hadoop is a framework written in java hadoop supports parallel processing and is a simple framework <hadoop, 2> <is, 2> <a , 2> <java , 1> <framework , 2> <written , 1> <in , 1> <and,1> <supports , 1> <parallel , 1> <processing. , 1> <simple,1> 39
  • 40. Your second program in hadoop Task:  Given a text file containing numbers, one per line, count sum of squares of odd, even and prime Input:  File containing integers, one per line Expected Output:  <type, sum of squares> for odd, even, prime 1 2 5 3 5 6 3 7 9 4 <odd, 302> <even, 278> <prime, 323 > 40
  • 41. Your second program in hadoop File on HDFS 41 3 9 6 2 3 7 8 Map: square 3 <odd,9> 7 <odd,49> 2 6 <even,36> 9 <odd,81> 3 <odd,9> 8 <even,64> <prime,4> <prime,9> <prime,9> <even,4> Reducer: sum prime:<9,4,9> odd:<9,81,9,49> even:<,36,4,64> <odd,148> <even,104> <prime,22> Input Value Output (Key,Value) Input (Key, List of Values) Output (Key,Value)
  • 42. Your second program in hadoop 42 Map (Invoked on a record) Reduce (Invoked on a key) void map (int x){ int sq = x * x; if(x is odd) print(“odd”,sq); if(x is even) print(“even”,sq); if(x is prime) print(“prime”,sq); } void reduce(List l ){ for(y in List l){ sum += y; } print(Key, sum); } Library functions boolean odd(int x){ …} boolean even(int x){ …} boolean prime(int x){ …}
  • 43. Your second program in hadoop 43 Map (Invoked on a record) Map (Invoked on a record) void map (int x){ int sq = x * x; if(x is odd) print(“odd”,sq); if(x is even) print(“even”,sq); if(x is prime) print(“prime”,sq); }
  • 44. Your second program in hadoop: Reduce 44 Reduce (Invoked on a key) Reduce (Invoked on a key) void reduce(List l ){ for(y in List l){ sum += y; } print(Key, sum); }
  • 45. Your second program in hadoop public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, “OddEvenPrime"); job.setJarByClass(OddEvenPrime.class); job.setMapperClass(OddEvenPrimeMapper.class); job.setCombinerClass(OddEvenPrimeReducer.class); job.setReducerClass(OddEvenPrimeReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setInputFormatClass(TextInputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } 45
  • 46. Questions? 46  Coming up next …  More examples  Hadoop Ecosystem
  • 48. Example: Counting Fans 48 Problem: Give Crowd statistics Count fans supporting India and Pakistan
  • 49. 49 45882 67917  Traditional Way  Central Processing  Every fan comes to the centre and presses India/Pak button  Issues  Slow/Bottlenecks  Only one processor  Processing time determined by the speed at which people(data) can move Example: Counting Fans
  • 50. 50 Hadoop Way  Appoint processors per block (MAPPER)  Send them to each block and ask them to send a signal for each person  Central processor will aggregate the results (REDUCER)  Can make the processor smart by asking him/her to aggregate locally and only send aggregated value (COMBINER) Example: Counting Fans Reducer Combiner
  • 51. HomeWork : Exit Polls 2014 51
  • 54. Who all are using Hadoop 54 http://wiki.apache.org/hadoop/PoweredBy
  • 55. References For understanding Hadoop  Official Hadoop website- http://hadoop.apache.org/  Hadoop presentation wiki- http://wiki.apache.org/hadoop/HadoopPresentations?a ction=AttachFile  http://developer.yahoo.com/hadoop/  http://wiki.apache.org/hadoop/  http://www.cloudera.com/hadoop-training/  http://developer.yahoo.com/hadoop/tutorial/module2. html#basics 55
  • 56. References Setup and Installation  Installing on Ubuntu  http://www.michael- noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_ (Single-Node_Cluster)  http://www.michael- noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_ (Multi-Node_Cluster)  Installing on Debian  http://archive.cloudera.com/docs/_apt.html 56
  • 57.  Hadoop: The Definitive Guide: Tom White  http://developer.yahoo.com/hadoop/tutorial/  http://www.cloudera.com/content/cloudera- content/cloudera-docs/HadoopTutorial/CDH4/Hadoop- Tutorial.html 57 Further Reading