Hadoop introduction

Contents
Hadoop Introduction
Ecosystem
Architecture
HDFS
Map-Reduce
Characteristics
Flavors

Zookeeper
(Coordination)
Oozie
(Workflow)
Sqoop
(Data Exchange)
Hbase
(Columnar Store)
HIVE
(SQL-Query)
PIG
(Scripting)
Mahout
(Data Mining)
MapReduce
(Distributed Processing Framework)
HDFS
(Hadoop Distributed Framework)
Flume
(Data Exchange)
Avro
(Data
Serialization
System)
STORM
KAFKA
TAJO
SCALA
IMPALA
(Real Time
Processing)

Hadoop
An open source project from the Apache Software Foundation
It provides a software framework for distributing and running
applications on clusters of servers
It is inspired by Google's Map-Reduce programming model as well as its
file system (GFS)
Hadoop was originally written for the Nutch search engine project

Hadoop
Hadoop is open source framework written in Java
It efficiently process large volumes of data on a cluster of commodity
hardware
Hadoop can be setup on single machine, but real power of Hadoop
comes with a cluster of machines, it can be scaled from single machine
to thousand nodes on the fly
Hadoop consists of two key parts – Hadoop Distributed File System
(HDFS) and Map-Reduce

Architecture
Master
Secondary
Master
User
. . .
. . .
. . .
Slaves
Hadoop Cluster

HDFS
HDFS is a highly fault tolerant, distributed, reliable, scalable file system
for data storage
HDFS stores multiple copies of data on different nodes; a file is split up
into blocks (default 64 MB) and stored across multiple machines
Hadoop cluster typically has a single namenode and number of
datanodes to form the HDFS cluster

Map-Reduce
Map-Reduce is a programming model designed for processing large
volumes of data in parallel by dividing the work into a set of
independent tasks
Map-Reduce is a paradigm for distributed processing of large data set
over a cluster of nodes

Hadoop Daemons
Namenode (Master) (HDFS)
SecondaryNameNode (3rd system / Slave)
JobTracker (Master) (MR)
DataNode (Slave) (HDFS)
TaskTracker (Slave) (MR)

Hadoop Daemons
HADOOP CLUSTER
Storage Layer Computation Layer
Phase
HADOOP DAEMONS ARCHITECTURE
Map-Reduce Job Tracker Task Tracker Task tracker Task tracker
MapReduce jobs
are submitted on
jobtracker
HDFS NameNode DataNode DateNode DataNode
NameNode Stores
Meta-data

task
tracker
task
tracker
task
tracker
task
tracker
data
node
data
node
data
node
job
tracker
name
node
data
node
task
tracker
data
node
Master Slaves
Hadoop Cluster
Map-Reduce
Layer
Storage
Layer

Characteristics
Open-source
◦ Code can be modified according to business requirements
Distributed Processing
◦ Data is processed parallely on cluster of nodes in distributed manner
Fault Tolerance
◦ Failure of nodes or tasks are recovered automatically by the framework

Characteristics
Reliablity
◦ Data is reliably stored on the cluster of machine despite machine failures
High Availability
◦ Data is highly available and accessible despite hardware failure
Scalablility
◦ New hardware can be added to the nodes
◦ Horizontal Scalablility – new nodes can be added on the fly

Characteristics
Economic
◦ Runs on cluster of comodity hardware
Easy to use
◦ No need of client to deal with distributed computing , framework takes care
of all the things
Data Locality
◦ Move Computation to data instead of data to computation

Flavors
Apache
Cloudera
MapR
IBM
Pivotal
Connectors
◦ Almost all the databases have provided their connector with Hadoop for fast
data transfer

Hadoop introduction

Related slideshows

More Related Content

Hadoop introduction