SlideShare a Scribd company logo
A SOFT INTRODUCTION
Contents 
Hadoop Introduction 
Ecosystem 
Architecture 
HDFS 
Map-Reduce 
Characteristics 
Flavors
A large ecosystem
Zookeeper 
(Coordination) 
Oozie 
(Workflow) 
Sqoop 
(Data Exchange) 
Hbase 
(Columnar Store) 
HIVE 
(SQL-Query) 
PIG 
(Scripting) 
Mahout 
(Data Mining) 
MapReduce 
(Distributed Processing Framework) 
HDFS 
(Hadoop Distributed Framework) 
Flume 
(Data Exchange) 
Avro 
(Data 
Serialization 
System) 
STORM 
KAFKA 
TAJO 
SCALA 
IMPALA 
(Real Time 
Processing)
Who uses Hadoop ?
Hadoop 
An open source project from the Apache Software Foundation 
It provides a software framework for distributing and running 
applications on clusters of servers 
It is inspired by Google's Map-Reduce programming model as well as its 
file system (GFS) 
Hadoop was originally written for the Nutch search engine project
Hadoop 
Hadoop is open source framework written in Java 
It efficiently process large volumes of data on a cluster of commodity 
hardware 
Hadoop can be setup on single machine, but real power of Hadoop 
comes with a cluster of machines, it can be scaled from single machine 
to thousand nodes on the fly 
Hadoop consists of two key parts – Hadoop Distributed File System 
(HDFS) and Map-Reduce
Architecture 
Master 
Secondary 
Master 
User 
. . . 
. . . 
. . . 
Slaves 
Hadoop Cluster
HDFS 
HDFS is a highly fault tolerant, distributed, reliable, scalable file system 
for data storage 
HDFS stores multiple copies of data on different nodes; a file is split up 
into blocks (default 64 MB) and stored across multiple machines 
Hadoop cluster typically has a single namenode and number of 
datanodes to form the HDFS cluster
HDFS
Map-Reduce 
Map-Reduce is a programming model designed for processing large 
volumes of data in parallel by dividing the work into a set of 
independent tasks 
Map-Reduce is a paradigm for distributed processing of large data set 
over a cluster of nodes
Hadoop Daemons 
Namenode (Master) (HDFS) 
SecondaryNameNode (3rd system / Slave) 
JobTracker (Master) (MR) 
DataNode (Slave) (HDFS) 
TaskTracker (Slave) (MR)
Hadoop Daemons 
HADOOP CLUSTER 
Storage Layer Computation Layer 
Phase 
HADOOP DAEMONS ARCHITECTURE 
Map-Reduce Job Tracker Task Tracker Task tracker Task tracker 
MapReduce jobs 
are submitted on 
jobtracker 
HDFS NameNode DataNode DateNode DataNode 
NameNode Stores 
Meta-data
task 
tracker 
task 
tracker 
task 
tracker 
task 
tracker 
data 
node 
data 
node 
data 
node 
job 
tracker 
name 
node 
data 
node 
task 
tracker 
data 
node 
Master Slaves 
Hadoop Cluster 
Map-Reduce 
Layer 
Storage 
Layer
Characteristics 
Open-source 
◦ Code can be modified according to business requirements 
Distributed Processing 
◦ Data is processed parallely on cluster of nodes in distributed manner 
Fault Tolerance 
◦ Failure of nodes or tasks are recovered automatically by the framework
Characteristics 
Reliablity 
◦ Data is reliably stored on the cluster of machine despite machine failures 
High Availability 
◦ Data is highly available and accessible despite hardware failure 
Scalablility 
◦ New hardware can be added to the nodes 
◦ Horizontal Scalablility – new nodes can be added on the fly
Characteristics 
Economic 
◦ Runs on cluster of comodity hardware 
Easy to use 
◦ No need of client to deal with distributed computing , framework takes care 
of all the things 
Data Locality 
◦ Move Computation to data instead of data to computation
Flavors 
Apache 
Cloudera 
MapR 
IBM 
Pivotal 
Connectors 
◦ Almost all the databases have provided their connector with Hadoop for fast 
data transfer
QUESTIONS ??

More Related Content

Hadoop introduction

  • 2. Contents Hadoop Introduction Ecosystem Architecture HDFS Map-Reduce Characteristics Flavors
  • 4. Zookeeper (Coordination) Oozie (Workflow) Sqoop (Data Exchange) Hbase (Columnar Store) HIVE (SQL-Query) PIG (Scripting) Mahout (Data Mining) MapReduce (Distributed Processing Framework) HDFS (Hadoop Distributed Framework) Flume (Data Exchange) Avro (Data Serialization System) STORM KAFKA TAJO SCALA IMPALA (Real Time Processing)
  • 6. Hadoop An open source project from the Apache Software Foundation It provides a software framework for distributing and running applications on clusters of servers It is inspired by Google's Map-Reduce programming model as well as its file system (GFS) Hadoop was originally written for the Nutch search engine project
  • 7. Hadoop Hadoop is open source framework written in Java It efficiently process large volumes of data on a cluster of commodity hardware Hadoop can be setup on single machine, but real power of Hadoop comes with a cluster of machines, it can be scaled from single machine to thousand nodes on the fly Hadoop consists of two key parts – Hadoop Distributed File System (HDFS) and Map-Reduce
  • 8. Architecture Master Secondary Master User . . . . . . . . . Slaves Hadoop Cluster
  • 9. HDFS HDFS is a highly fault tolerant, distributed, reliable, scalable file system for data storage HDFS stores multiple copies of data on different nodes; a file is split up into blocks (default 64 MB) and stored across multiple machines Hadoop cluster typically has a single namenode and number of datanodes to form the HDFS cluster
  • 10. HDFS
  • 11. Map-Reduce Map-Reduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks Map-Reduce is a paradigm for distributed processing of large data set over a cluster of nodes
  • 12. Hadoop Daemons Namenode (Master) (HDFS) SecondaryNameNode (3rd system / Slave) JobTracker (Master) (MR) DataNode (Slave) (HDFS) TaskTracker (Slave) (MR)
  • 13. Hadoop Daemons HADOOP CLUSTER Storage Layer Computation Layer Phase HADOOP DAEMONS ARCHITECTURE Map-Reduce Job Tracker Task Tracker Task tracker Task tracker MapReduce jobs are submitted on jobtracker HDFS NameNode DataNode DateNode DataNode NameNode Stores Meta-data
  • 14. task tracker task tracker task tracker task tracker data node data node data node job tracker name node data node task tracker data node Master Slaves Hadoop Cluster Map-Reduce Layer Storage Layer
  • 15. Characteristics Open-source ◦ Code can be modified according to business requirements Distributed Processing ◦ Data is processed parallely on cluster of nodes in distributed manner Fault Tolerance ◦ Failure of nodes or tasks are recovered automatically by the framework
  • 16. Characteristics Reliablity ◦ Data is reliably stored on the cluster of machine despite machine failures High Availability ◦ Data is highly available and accessible despite hardware failure Scalablility ◦ New hardware can be added to the nodes ◦ Horizontal Scalablility – new nodes can be added on the fly
  • 17. Characteristics Economic ◦ Runs on cluster of comodity hardware Easy to use ◦ No need of client to deal with distributed computing , framework takes care of all the things Data Locality ◦ Move Computation to data instead of data to computation
  • 18. Flavors Apache Cloudera MapR IBM Pivotal Connectors ◦ Almost all the databases have provided their connector with Hadoop for fast data transfer