Intro to Big Data using Hadoop

Intro to Big Data using Hadoop

Sergejus Barinovas
sergejus.blogas.lt
fb.com/ITishnikai
@sergejusb

Information is powerful…
but it is how we use it that will define us

Data Explosion

text
audio
video
images

relational

picture from Big Data Integration

Big Data (globally)

– creates over 30 billion pieces of content per day

– stores 30 petabytes of data

– produces over 90 million tweets per day

Big Data (our example)

– logs over 300 gigabytes of transactions per day

– stores more than 1,5 terabyte of aggregated data

4 Vs of Big Data

volume
volume
velocity
velocity
variety
variety
variability
variability

Big Data Challenges

Sort 10TB on 1 node = 2,5 days

100-node cluster = 35 mins

Big Data Challenges

“Fat” servers implies high cost
– use cheap commodity nodes instead

Large # of cheap nodes implies often failures
– leverage automatic fault-tolerance
fault-tolerance

Big Data Challenges

We need new data-parallel programming
model for clusters of commodity machines

MapReduce

Published in 2004 by Google
– MapReduce: Simplified Data Processing on Large Clusters

Popularized by Apache Hadoop project
– used by Yahoo!, Facebook, Twitter, Amazon, …

Word Count Example
Input Map Shuffle & Sort Reduce Output

the quick the, 3
brown Map brown, 2
fox Reduce fox, 2
how, 1
now, 1
the fox
ate the Map
mouse
quick, 1
ate, 1
Reduce mouse, 1
how now
brown Map cow, 1
cow

Word Count Example

the, 1 the, 1
the quick quick, 1 brown, 1
brown Map brown, 1 fox, 1
fox fox, 1 the, 1 Reduce
fox, 1
the, 1
the, 1 how, 1
the fox fox, 1 now, 1
ate the Map ate, 1 brown, 1
mouse the, 1
mouse, 1
quick, 1
ate, 1
how, 1 mouse, 1 Reduce
how now now, 1 cow, 1
brown Map brown, 1
cow cow, 1

Word Count Example

the, [1,1,1]
the quick brown, [1,1] the, 3
brown Map fox, [1,1] brown, 2
fox how, [1] Reduce fox, 2
now, [1] how, 1
now, 1
the fox
ate the Map
mouse
quick, [1] quick, 1
ate, [1] ate, 1
mouse, [1] Reduce mouse, 1
how now
cow, [1] cow, 1
brown Map
cow

MapReduce philosophy
– hide complexity

– make it scalable

– make it cheap

MapReduce popularized by

Apache Hadoop project

Hadoop Overview

Open source implementation of
– Google MapReduce paper

– Google File System (GFS) paper

First release in 2008 by Yahoo!
– wide adoption by Facebook, Twitter, Amazon, etc.

Hadoop Core

MapReduce (Job Scheduling / Execution System)

Hadoop Distributed File System (HDFS)

Hadoop Core (HDFS)



• Name Node stores file metadata
• files split into 64 MB blocks
• blocks replicated across 3 Data Nodes

Hadoop Core (HDFS)



Name Node Data Node

Hadoop Core (MapReduce)
• Job Tracker distributes tasks and handles failures
• tasks are assigned based on data locality
• Task Trackers can execute multiple tasks


Name Node Data Node

Hadoop Core (MapReduce)

Job Tracker Task Tracker


Name Node Data Node

Hadoop Core (Job submission)

Task Tracker
Client

Job Tracker

Name Node Data Node

Hadoop Ecosystem

Pig (ETL) Hive (BI) Sqoop (RDBMS)

Zookeeper

Avro
HBase


JavaScript MapReduce
var map = function (key, value, context) {
var words = value.split(/[^a-zA-Z]/);
for (var i = 0; i < words.length; i++) {
if (words[i] !== "") {
context.write(words[i].toLowerCase(), 1);
}
}
};
var reduce = function (key, values, context) {
var sum = 0;
while (values.hasNext()) {
sum += parseInt(values.next());
}
context.write(key, sum);
};

Pig

words = LOAD '/example/count' AS (
word: chararray,
count: int
);
popular_words = ORDER words BY count DESC;
top_popular_words = LIMIT popular_words 10;
DUMP top_popular_words;

Hive
CREATE EXTERNAL TABLE WordCount (
word string,
count int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
LOCATION "/example/count";

SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;

Über Demo
Demo
Hadoop in the Cloud

Intro to Big Data using Hadoop

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Intro to Big Data using Hadoop

Similar to Intro to Big Data using Hadoop (20)

More from Sergejus Barinovas

More from Sergejus Barinovas (15)

Recently uploaded

Recently uploaded (20)

Intro to Big Data using Hadoop

Editor's Notes