WELCOME TO BIG DATA TRANING

Abhishek Mukherjee
Utkarsh Srivastava
12th,September
Not everything that can be counted counts, and not
everything that counts can be counted.
WELCOME TO BIG DATA
TRANING

What are we going to cover today?
 Uses of Big Data
 What is Hadoop?
 Short intro to the HDFS architecture.
 What is Map Reduce?
 The components of Map Reduce Algorithm
 Hello world of map reduce i.e. Word Count Algorithm
 Tips and Tricks of Map Reduce

 Big data is an evolving term that describes any voluminous
amount of structured, semi-structured and
unstructured data that has the potential to be mined for
information.
 Lots of Data(Zetabytes or Terabytes or Petabytes)
 Systems / Enterprises generate huge amount of data from
Terabytes to and even Petabytes of information.
 A airline jet collects 10 terabytes of sensor data for every 30
minutes of flying time.
What is Big Data?

Serial vs sequential processingSerial vs parallel processing
WHY BIGDATA?

WHY BIGDATA?
Walmart has exhaustive customer data of
close to 145 million Americans of which 60%
of the data is of U.S adults. Walmart tracks
and targets every consumer individually
Walmart observed a significant 10% to 15%
increase in online sales for $1 billion in
incremental revenue.

Accessible
Robust
Scalable
Simple
Differentiating Factors:

 Map Phase
 Combiner Phase(Optional)
 Sort Phase
 Shuffle Phase
 Partition Phase(Optional)
 Reducer Phase
Key points
Map Reduce Algorithm

 Hello my name is abhishek Hello my name is utsav
 Hello my passion is cricket
Imagine this as the input file:
Map Phase
This file has 2 lines. Each line in the file has a byte offset of
its own which serves as a key to the mapper and the
value of the mapper is the data which is present In the
line.

Operation on output of map phase
Hello 1
my 1
name 1
is 1
abhishek 1
Hello 1
my 1
name 1
is 1
utsav 1
Hello 1
my 1
passion 1
is 1
cricket 1
Hello(1,1,1)
my(1,1,1)
name(1,1,1)
is(1,1,1)
abhishek(1)
utsav(1)
passion(1)
cricket(1)
Key(tuple of values)

 The key points are as follows:
 Sort the key value pairs according to the key values
 Shuffle the mapped output to get values with same key to
create a tuple of values with same key
 This output is fed to the reducer which in turn maps the
values of the tuple by returning a single value for a list of
values present in the tuple
Explaination of sort and shuffle phase

Reducer phase
Hello(1,1,1)
my(1,1,1)
name(1,1,1)
is(1,1,1)
abhishek(1)
utsav(1)
passion(1)
cricket(1)
Key(tuple of values)
abhishek(1)
cricket(1)
Hello(3)
is(3)
my(3)
name(3)
passion(1)
utsav(1)
Key(single value)

 sudo su – makes temporary super user.
 hadoop fs -ls /
 hadoop fs -mkdir /mycreatedfolderinhdfs
 hadoop fs -put /usr/directoryinlocal /user/root/directoryinhdfs
 hadoop fs -get /user/root/mycreatedfolderinhdfs /usr/folderinlocal
 hadoop fs -r -mr /mycreatedfolderinhdfs
 Hadoop jar com.bigdata.session.hadoop.tool.jar {sourcepath} {Destination
path}
BASIC HADOOP COMMANDS

 Two types of splitting of input files are possible
 HDFS split: Splitting of files into blocks of fixed size e.g.
splitting a file into blocks of 64 MB to promote parallel
processing.
 N line split: Splitting of files into lines of fixed number of
lines to promote parallel processing
 Lets see an example in the next slide
Types of splits(Parallel processing in action):

 Consider this as the input file:
 Map reduce is a framework based on processing of
data paralelly. This algorithm consists of three phases
namely map , shuffle and sort ,reduce. Here we will
observe the effect of n line splitter on the number of
map tasks i.e. the number of mappers created. This will
create a better understanding on how a file splits.
N LINE SPLITTING:
Can you guess what will happen?????

 Assume the value of n as 3
 Map reduce is a framework based on processing of
data paralelly. This algorithm consists of three phases
namely map , shuffle and sort ,reduce. Here we will
N LINE SPLITTING contd.
observe the effect of n line splitter on the number of
map tasks i.e. the number of mappers created. This will
create a better understanding of how a file splits.
So both of these splits of the file will be sent to two different
mappers while in the case of HDFS split the amount of data
being sent to mappers depends on the size of the respective
splits

 Hadoop uses its own serialization format, Writables, which
is certainly compact and fast. Data needs to be serialized
to be sent via a network path.
Data Types available in Map Reduce
Thus we see that these
Serialized data types
are Java equivalent
data types

 Combiner optimization
 Partitioner optimization
 Custom Writables
Tips for optimizing map reduce codes:

Abhishek Mukherjee Utkarsh Srivastava
scobbyabhi9@gmail.com utkarshsrivastava538@gmail.com
No. 9629341857 No. 9629341221
CONTACT DETAILS

WELCOME TO BIG DATA TRANING

Related slideshows

More Related Content

WELCOME TO BIG DATA TRANING