Hadoop ecosystem

HADOOP ECOSYSTEM
STANLEY WANG
SOLUTION ARCHITECT, TECH LEAD
@SWANG68
http://www.linkedin.com/in/stanley-wang-a2b143b

What is Hadoop?
• Distributed, scalable system
on commodity hardware
• HDFS – Distributed file system
• MapReduce – Programming
Paradigm
• Tools: Hive, Pig, SQOOP,
HCatalog, HBase, Flume,
Mahout, YARN, Tez, Spark,
Stinger, Oozie, ZooKeeper,
Flume, Storm, etc
• Ideal for processing huge
volumes of data, is inadequate
for analyzing that data in real
time streaming;
• Low cost and robust
environment to support fault-
tolerant for extremely large
datasets
• Capable of capturing of
unstructured, semi-structured,
and structured in batch or
real-time
• Not necessary to create data
models, schema-on-read or
schema-on-demand data lake
• Provides scalable analytics via
distributed storage and
distributed processing

Not just HADOOP, but with Much More …
Manage & store huge
volume of any data
Hadoop File System
MapReduce
Manage streaming data Stream Computing
Analyze unstructured data Text Analytics Engine
Data WarehousingStructure and control data
Integrate and govern all
data sources
Integration, Data Quality, Security,
Lifecycle Management, MDM
Understand and navigate
federated big data sources
Federated Discovery and Navigation

1st Generation Hadoop: Batch Focus
HADOOP 1.0
Built for Web-Scale Batch Apps
Single App
BATCH
HDFS
Single App
INTERACTIVE
Single App
BATCH
HDFS
All other usage patterns
MUST leverage same
infrastructure
Forces Creation of Silos to
Manage Mixed Workloads
Single App
BATCH
HDFS
Single App
ONLINE

Hadoop 1 Architecture
JobTracker
Manage Cluster Resources & Job Scheduling
TaskTracker
Per-node agent
Manage Tasks

Hadoop 1 Limitations
Scalability
Max Cluster size ~5,000 nodes
Max concurrent tasks ~40,000
Coarse Synchronization in JobTracker
Availability
Failure Kills Queued & Running Jobs
Hard partition of resources into map and reduce slots
Non-optimal Resource Utilization
Lacks Support for Alternate Paradigms and Services
Iterative applications in MapReduce are 10x slower

Hadoop 2 - YARN Architecture
ResourceManager (RM)
Manages and allocates cluster resources
Central agent
NodeManager (NM)
Manage Tasks, Enforce Allocations
Per-Node Agent Resource
Manager
MapReduce Status
Job Submission
Client
Node
Manager
Node
Manager
Container
Node
Manager
App Mstr
Node Status
Resource Request

Data Processing Engines Run Natively IN Hadoop
BATCH
MapReduce
INTERACTIVE
Tez
STREAMING
Storm, S4, …
GRAPH
Giraph
MICROSOFT
REEF
SAS
LASR, HPA
ONLINE
HBase
OTHERS
Apache YARN
HDFS2: Redundant, Reliable Storage
YARN: Cluster Resource Management
Flexible
Enables other purpose-built data
processing models beyond
MapReduce (batch), such as
interactive and streaming
Efficient
Double processing IN Hadoop on
the same hardware while
providing predictable
performance & quality of service
Shared
Provides a stable, reliable,
secure foundation and
shared operational services
across multiple workloads
The Data Operating System for Hadoop 2.0

5 Key Benefits of YARN
1. New Applications & Services
2. Improved cluster utilization
3. Scale
4. Experimental Agility
5. Shared Services

Key Improvements in YARN
Framework supporting multiple applications
– Separate generic resource brokering from application logic
– Define protocols/libraries and provide a framework for custom
application development
– Share same Hadoop Cluster across applications
Application Agility and Innovation
– Use Protocol Buffers for RPC gives wire compatibility
– Map Reduce becomes an application in user space unlocking
safe innovation
– Multiple versions of an app can co-exist leading to
experimentation
– Easier upgrade of framework and applications

Key Improvements in YARN
Scalability
– Removed complex app logic from RM, scale further
– State machine, message passing based loosely coupled design
Cluster Utilization
– Generic resource container model replaces fixed Map/Reduce
slots. Container allocations based on locality, memory (CPU
coming soon)
– Sharing cluster among multiple applications
Reliability and Availability
– Simpler RM state makes it easier to save and restart (work in
progress)
– Application checkpoint can allow an app to be restarted.
MapReduce application master saves state in HDFS.

YARN as Cluster Operating System
NodeManager NodeManager NodeManager NodeManager
map 1.1
vertex1.2.2
map1.2
reduce1.1
Batch
vertex1.1.1
vertex1.1.2
vertex1.2.1
Interactive SQL
ResourceManager
Scheduler
Real-Time
nimbus0
nimbus1
nimbus2

YARN APIs & Client Libraries
Application Client Protocol: Client to RM interaction
–Library: YarnClient
–Application Lifecycle control
–Access Cluster Information
Application Master Protocol: AM – RM interaction
–Library: AMRMClient / AMRMClientAsync
–Resource negotiation
–Heartbeat to the RM
Container Management Protocol: AM to NM interaction
–Library: NMClient/NMClientAsync
–Launching allocated containers
–Stop Running containers
Use external frameworks like Weave/REEF/Spring

© Hortonworks Inc. 2013 - Confidential
YARN Application Flow
Application Client
Resource
Manager
Application Master
NodeManager
YarnClient
App
Specific API
Application Client
Protocol
AMRMClient
NMClient
Application Master
Protocol
Container
Management
Protocol
App
Container

YARN Best Practices
Use provided Client libraries
Resource Negotiation
–You may ask but you may not get what you want - immediately.
–Locality requests may not always be met.
–Resources like memory/CPU are guaranteed.
Failure handling
–Remember, anything can fail ( or YARN can pre-empt your
containers)
–AM failures handled by YARN but container failures handled by the
application.
Checkpointing
–Check-point AM state for AM recovery.
–If tasks are long running, check-point task state.

YARN Best Practices
Cluster Dependencies
–Try to make zero assumptions on the cluster.
–Your application bundle should deploy everything required using
YARN’s local resources.
Client-only installs if possible
–Simplifies cluster deployment, and multi-version support
Securing your Application
–YARN does not secure communications between the AM and its
containers.

YARN Future Work
ResourceManager High Availability and Work-preserving restart
–Work-in-Progress
Scheduler Enhancements
–SLA Driven Scheduling, Low latency allocations
–Multiple resource types – disk/network/GPUs/affinity
Rolling upgrades
Long running services
–Better support to running services like HBase
–Discovery of services, upgrades without downtime
More utilities/libraries for Application Developers
–Failover/Checkpointing

Hadoop ecosystem

Related slideshows

More Related Content

Hadoop ecosystem