MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab

MapReduce
TODAY’S CLASS
● Thinking in MapReduce
○ Word Frequency Problem
■ Solution 1 - Coding
■ Solution 2 - SQL
■ Solution 3 - Unix Pipes
■ Solution 4 - External Sort
● Map/Reduce Overview
● Visualisation
● Analogies to groupby
● Assignments

MapReduce
BIG DATA PROBLEM - PROCESSING
Q: How fast can 1GHz processor sort 1TB data? This
data is made up of 10 billion 100 byte size strings.
A: Around 6-10 hours
What's wrong 6-10 hours?
1. Faster Sort
2. Bigger Data Sorting
3. More often
We need

MapReduce
BIG DATA PROBLEM - PROCESSING
Google, 8 Sept, 2011:
Sorting 10PB took 6.5 hrs on 8000 computers

MapReduce
1. Every SQL Query is impacted by Sorting:
○ Where clause - Index (Sorting)
○ Group By - Involves Sorting
○ Joins - immensly enhanced by Sorting
○ Distinct
○ Order BY
2. Most of the algorithms depend on sorting
Why Sorting is such as big deal

MapReduce
• Programming Paradigm
• To help solve Big Data problems
• Specifically sorting intensive jobs or disc read
intensive
• You would have to code two functions:
• Mapper - Converts Input into “key - value” pairs
• Reducer - Aggregates all the values for a key
THINKING IN MAP / REDUCE
What is Map/Reduce?

MapReduce
• Also supported by many other systems such as
• MongoDB / CouchDB / Cassandra
• Apache Spark
• Mapper & Reducers in hadoop
• can be written in Java, Shell, Python or any binary
What is Map/Reduce?

MapReduce
MAP REDUCEMAP REDUCE
Split 0 Map
Sort
Split 1 Map
Sort
Split 2 Map
Sort
Reduce Part 0
Copy
Merge
HDFS
Block
HDFS
Block
HDFS
Block
TO
HDFS

MapReduce
MAP REDUCE
CutIntoPieces()

MapReduce
If you have the plain text file of containing 100s of text books,[500 mb]
how would you find the frequencies of words?

MapReduce
If you have the plain text file of all the Lord Of Rings books, how
would you find the frequencies of words?
Approach 1 (Programmatic):
• Create a frequency hash table / dictionary
• For each word in the files
• Increase its frequency in the hash table
• When no more words left in file, print the hash table
Problems?

MapReduce
Problems?
Start
Initialize a dictionary or
hashtable (word, count)
Read next word from file
Is Any
word
left?
Find word in
dictionary
Does the word
exist in
dictionary?
Increase the count by 1
Add new word
with count as 0
End
Print the word and
counts
1. wordcount={}
2. for word in file.read().split():
3. if word not in wordcount:
4. wordcount[word] = 0
5. wordcount[word] += 1
6. for k,v in wordcount.items():
7. print k, v
Line 1
2
2 3 4
5
6&7

MapReduce
Approach 1 (Programmatic):
• Create a frequency hash table / dictionary
• For each word in the file
• Increase its frequency in the hash table
• When no more words left in file, print the hash table
Problems?
Can not process the data beyond RAM size.

MapReduce
Approach2 (SQL):
• Break the books into one word per line
• Insert one word per row in database table
• Execute: select word, count(*) from table group by word.

MapReduce
Understanding Unix Pipeline
A program can take input from you.

MapReduce
A program may also print some output

MapReduce
command1 | command2
Command1 Command2Pipe

MapReduce
Approach 3 (Unix):
• Replace space with a newline
• Order lines with a sort command
• Then find frequencies using uniq
• Scans from top to bottom
• prints the count when line value changes
cat myfile| sed -E 's/[t ]+/n/g'| sort -S 1g | uniq -c

MapReduce
Problems in Approach 2 (SQL) & Approach 3 (Unix)?

MapReduce
Problems in Approach 2 (SQL) & Approach 3 (Unix)?
The moment the data starts going beyond RAM the time taken
starts increasing. The following become bottlenecks:
• CPU
• Disk Speed
• Disk Space

MapReduce
Then?
Approach 4: Use a external sort.
• Split the files to a size that fits RAM
• Use the previous approaches (2&3) to find freq
• Merge (sort -m) and sum-up frequencies
Machine 1 Machine 2
Launcher
sa, re, re ga, ga, re
re:2
sa:1
ga:2
re:1
merge
ga:2
re:3
sa:1

MapReduce
• Takes O(n) time to merge sorted data
• Or the time is proportional to the number of
elements to be merged
Merging

MapReduce
Merging
Merge the two sorted queues to
form another sorted queue

MapReduce
Merging
Compare the heads

MapReduce
Merging
Pick shorter

MapReduce
Merging
Compare the heads
again

MapReduce
Merging
Pick both if equal

MapReduce
Merging
Since no one is left on
second queue.
Put remaining from
first

MapReduce
Merging
This merges the two
queues into one

MapReduce
• For more than two lists
○ Use min-heap
Merging
1 4 6
9 10 12
6 7 8
8 9 9
3 5 7
5 10 17
To the output

MapReduce
○ Use min-heap
Merging
1 4 6
9 10 12
6 7 8
8 9 9
3 5 7
5 10 17
1,

MapReduce
○ Or merge two at a time
Merging

MapReduce
Problems with Approach 4?
Machine 1 Machine 2
Launcher
sa, re, re ga, ga, re
re:2
sa:1
ga:2
re:1
merge
ga:2
re:3
sa:1

MapReduce
Problems with external Sort?
Time is consumed in transport of data.
+
For each requirement we would need to
special purpose network oriented program.
+
Would Require A lot of Engineering.
Solution?
Use Map/Reduce

MapReduce
• Programming Paradigm
• To help solve Big Data problems
• Specifically sorting intensive jobs or disc read
intensive
• You would have to code two functions:
• Mapper - Convert Input into “key - value” pairs
• Reducer - Aggregates all the values for a key
What is Map/Reduce?

MapReduce
Function Mapper (Image):
Convert image
to 100x100 pixel
EXAMPLE OF ONLY MAPPER
Directory Of Profile Pictures in HDFS
Convert image
to 100x100 pixel
Convert image
to 100x100 pixel
HDFS - Output Directory Of 100x100px Profile Pictures
Machine 1 Machine 2 Machine 3

MapReduce
InputFormat
Datanode
Input Split
HDFS Block1
Record1
Record2
Record3
Map()
Map()
Map()
Mapper
(key1, value1)
(key2, value2)
Nothing
(key3, value3)
InputSplit

MapReduce
With Both mapper() & Reducer() code
HDFS
Input
HDFS

MapReduce
MAP / REDUCE
Mapper/Reducer for word frequency problem.
function map(line):
foreach(word in line) :
print(word, 1);
sa 1
re 1
re 1
sa 1
ga 1
hdfs
sa re re
sa ga

MapReduce
MAP / REDUCE
Mapper/Reducer for word frequency problem.
function map(line):
foreach(word in line) :
print(word, 1);
sa re re
sa ga
function reduce(word, freqArray):
return Array.sum(freqArray);
sa 1
re 1
re 1
sa 1
ga 1
ga [1]
re [1, 1]
sa [1, 1]
ga 1
re 2
sa 2
hdfs

MapReduce
Mapper/Reducer for computing max temp
def mapp(line):
(t, c, time) = line.split(",")
print(c, t)
def reduce(key, values):
return max(values)
20, NYC, 2014-01-01
20, NYC, 2015-01-01
21, NYC, 2014-01-02
23, BLR, 2012-01-01
25, Seatle, 2016-01-01
21, CHICAGO, 2013-01-05
24, NYC, 2016-5-05
NYC 20
NYC 20
NYC 21
BLR 23
SEATLE 25
CHICAGO 21
NYC 24
BLR 23
CHICAGO 21
NYC 20,20,21,24
SEATLE 25
BLR 23
CHICAGo 21
NYC 24
SEATLE 25
Temp, City, Date

MapReduce
Mapper/Reducer for computing max temp
def mapp(line):
(t, c, date) = line.split(",")
print(c, (t, date))
def reduce(key, values):
maxt = -19191919;
date = ''
for i in values:
T = i[0]
If T > maxt: maxt = T, date=i[1
return (maxt, date)
20, NYC, 2014-01-01
20, NYC, 2015-01-01
21, NYC, 2014-01-02
23, BLR, 2012-01-01
25, Seatle, 2016-01-01
21, CHICAGO, 2013-01-05
24, NYC, 2016-5-05
NYC (20, 2014-01-01)
NYC (20, 2015-01-01)
NYC 21
BLR 23
SEATLE 25
CHICAGO 21
NYC 24
BLR (23, '2014-01-01'
CHICAGO (21, '2015-01-01').
NYC 20,20,21,24
SEATLE 25
BLR (23, '2015-01-01')
CHICAGo 21
NYC 24
SEATLE 25
Temp, City, Date

MapReduce
MAP / REDUCE
Analogous to Group By
function map():
(temp, city, time) = line.split(",")
print(city, temp)
function reduce(city, arr_temps):
return max(arr_temps);
select city,
max(temp)
from table
group by city.

MapReduce
MAP / REDUCE
Analogous to Group By
function map():
foreach(word in input) :
print(word, 1);
function reduce(word, freqArray):
return Array.sum(freqArray);
select word,
count(*)
from table
group by
word.

MapReduce
MAP REDUCE - Multiple Reducers
Split 0 Map
Sort
Split 1 Map
Sort
Split 2 Map
Sort
Reduce Part 0
Reduce Part 1
Copy
Merge
HDFS
Block
HDFS
Block
HDFS
Block
TO
HDFS
TO
HDFS
Apple
Banana
Apricot
Carrots

MapReduce
MAP REDUCE - Paritioning
Reducer 0 Reducer 1 Reducer 2 Reducer 3
1
5
9
.
.
.
3201
2
6
10
.
.
.
3202
3
7
11
.
.
.
3203
0
4
8
.
.
.
3200
Key k will go to this reducer: hashcode(k) % total_reducers
Keys

MapReduce
Thank you.
Hadoop & Spark
support@knowbigdata.com
+1 419 665 3276 (US)
+91 803 959 1464 (IN)
Subscribe to our Youtube channel for latest videos -
https://www.youtube.com/channel/UCxugRFe5wETYA7nMH6VGyEA

MapReduce
MAP / REDUCE - RECAP

MapReduce
MAP / REDUCE
The data generated by the mapper is given to
reducer and then it is sorted / shuffled [Yes/No]?

MapReduce
MAP / REDUCE
The data generated by the mapper is given to
reducer and then it is sorted / shuffled [Yes/No]?
No. The output of mapper is first
shuffled/sorted and then given to reducers.

MapReduce
MAP / REDUCE
The mapper can only generate a single key value
pair for an input value [True/False]?

MapReduce
MAP / REDUCE
The mapper can only generate a single key value
pair for an input value [True/False]?
False. Mapper can generate as many key-value pair
as it wants for an input.

MapReduce
MAP / REDUCE
A mapper always have to generate at least a
key-value pair[Correct/Wrong]?

MapReduce
MAP / REDUCE
A mapper always generates at least a key-value
pair[Correct/Wrong]?
Wrong

MapReduce
MAP / REDUCE
By default there is only one reducer in case of
streaming job [Yes/No]?

MapReduce
MAP / REDUCE
By default there is only one reducer in case of
streaming job [Yes/No]?
Yes. By default there is a single reducer job but it
can be split by specifying cmd option :
mapred.reduce.tasks.

MapReduce
MAP / REDUCE
In hadoop 1.0, What is the role of job tracker?
A: Executing the Map/Reduce Logic
B: Delegate the Map/Reduce Logic to task
tracker.

MapReduce
MAP / REDUCE
What is the role of job tracker?
A: Executing the Map/Reduce Logic
B: Delegate the Map/Reduce Logic to task
tracker.
B.

MapReduce
MAP / REDUCE
Q: The Map logic is executed preferably on the
nodes that have the required data [Yes/No]?

MapReduce
MAP / REDUCE
Q: The Map logic is executed preferably on the
nodes that have the required data [Yes/No]?
Yes.

MapReduce
MAP / REDUCE
Q: The Map logic is always executed on the nodes
that have the required data [Correct/Wrong]?

MapReduce
MAP / REDUCE
Wrong
Q: The Map logic is always executed on the nodes
that have the required data [Correct/Wrong]?

MapReduce
MAP / REDUCE
Where does Hadoop Store the result of reducer?
In HDFS or Local File System?

MapReduce
MAP / REDUCE
In HDFS.
Where does Hadoop Store the result of reducer?
In HDFS or Local File System?

MapReduce
MAP / REDUCE
Where does Hadoop Store the intermediate data
such as output of Map Tasks?
In HDFS or Local File System or Memory?

MapReduce
MAP / REDUCE
First in Memory and purged to
Local File System.
Output of mapper is saved in HDFS directly only if
there is no reduce phase.
Where does Hadoop Store the intermediate data
such as output of Map Tasks?
In HDFS or File System or Memory?

MapReduce
MAP / REDUCE Assignment For Tomorrow
1. Frequencies of letters [a-z] - Do you need Map/Reduce?
2. Find anagrams in a huge text. An anagram is basically a
different arrangement of letters in a word. Anagram does not
need have a meaning.
Input:
“the cat act in tic tac toe”
Output:
cat, tac, act
the
toe
in
tic

MapReduce
MAP / REDUCE
3a. A file contains the DNA sequence of people. Find all the
people who have same DNAs.
Output:
User1, User4
User2
User3, User 5
User6
Input:
“User1 ACGT”
“User2 TGCA”
“User3 ACG”
“User4 ACGT”
“User5 ACG”
“User6 AGCT”
Assignment For Tomorrow

MapReduce
3b. A file contains the DNA sequence of people. Find all the
people who have same or mirror image of DNAs.
Input:
“User1 ACGT”
“User2 TGCA”
“User3 ACG”
“User4 ACGT”
“User5 ACG”
“User6 ACCT”
Output:
User1, User2, User4
User3, User 5
User6

MapReduce
4. In an unusual democracy, everyone is not equal. The vote count is a
function of worth of the voter. Though everyone is voting for each other.
As example, if A with a worth of 5 and B with a worth of 1 are voting
for C, the vote count of C would be 6.
You are given a list of people with their value of vote. You are also given
another list describing who voted for who all.
List1
Voter Votee
A C
B C
C F
Find out what is the vote count of everyone?
List2
PersonWorth
A 5
B 1
C 11
Result
PersonVoteCount
A 0
B 0
C 6
F 11

MapReduce
JOB TRACKER (DETAILED)

MapReduce
QUICK - CLUSTER HANDS ON
MapReduce Command
The Example is available here
Remove old output directory
hadoop fs -rm -r /user/student/wordcount/output
Execute the mapReduce Command:
hadoop jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-mapreduce-examples.jar
wordcount /data/mr/wordcount/input mrout

MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab

Similar to MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab (20)

More from CloudxLab

More from CloudxLab (20)

Recently uploaded

Recently uploaded (20)

MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab