Hadoop introduction
- 4. Zookeeper
(Coordination)
Oozie
(Workflow)
Sqoop
(Data Exchange)
Hbase
(Columnar Store)
HIVE
(SQL-Query)
PIG
(Scripting)
Mahout
(Data Mining)
MapReduce
(Distributed Processing Framework)
HDFS
(Hadoop Distributed Framework)
Flume
(Data Exchange)
Avro
(Data
Serialization
System)
STORM
KAFKA
TAJO
SCALA
IMPALA
(Real Time
Processing)
- 6. Hadoop
An open source project from the Apache Software Foundation
It provides a software framework for distributing and running
applications on clusters of servers
It is inspired by Google's Map-Reduce programming model as well as its
file system (GFS)
Hadoop was originally written for the Nutch search engine project
- 7. Hadoop
Hadoop is open source framework written in Java
It efficiently process large volumes of data on a cluster of commodity
hardware
Hadoop can be setup on single machine, but real power of Hadoop
comes with a cluster of machines, it can be scaled from single machine
to thousand nodes on the fly
Hadoop consists of two key parts – Hadoop Distributed File System
(HDFS) and Map-Reduce
- 9. HDFS
HDFS is a highly fault tolerant, distributed, reliable, scalable file system
for data storage
HDFS stores multiple copies of data on different nodes; a file is split up
into blocks (default 64 MB) and stored across multiple machines
Hadoop cluster typically has a single namenode and number of
datanodes to form the HDFS cluster
- 11. Map-Reduce
Map-Reduce is a programming model designed for processing large
volumes of data in parallel by dividing the work into a set of
independent tasks
Map-Reduce is a paradigm for distributed processing of large data set
over a cluster of nodes
- 12. Hadoop Daemons
Namenode (Master) (HDFS)
SecondaryNameNode (3rd system / Slave)
JobTracker (Master) (MR)
DataNode (Slave) (HDFS)
TaskTracker (Slave) (MR)
- 13. Hadoop Daemons
HADOOP CLUSTER
Storage Layer Computation Layer
Phase
HADOOP DAEMONS ARCHITECTURE
Map-Reduce Job Tracker Task Tracker Task tracker Task tracker
MapReduce jobs
are submitted on
jobtracker
HDFS NameNode DataNode DateNode DataNode
NameNode Stores
Meta-data
- 14. task
tracker
task
tracker
task
tracker
task
tracker
data
node
data
node
data
node
job
tracker
name
node
data
node
task
tracker
data
node
Master Slaves
Hadoop Cluster
Map-Reduce
Layer
Storage
Layer
- 15. Characteristics
Open-source
◦ Code can be modified according to business requirements
Distributed Processing
◦ Data is processed parallely on cluster of nodes in distributed manner
Fault Tolerance
◦ Failure of nodes or tasks are recovered automatically by the framework
- 16. Characteristics
Reliablity
◦ Data is reliably stored on the cluster of machine despite machine failures
High Availability
◦ Data is highly available and accessible despite hardware failure
Scalablility
◦ New hardware can be added to the nodes
◦ Horizontal Scalablility – new nodes can be added on the fly
- 17. Characteristics
Economic
◦ Runs on cluster of comodity hardware
Easy to use
◦ No need of client to deal with distributed computing , framework takes care
of all the things
Data Locality
◦ Move Computation to data instead of data to computation
- 18. Flavors
Apache
Cloudera
MapR
IBM
Pivotal
Connectors
◦ Almost all the databases have provided their connector with Hadoop for fast
data transfer