Eagle from eBay at China Hadoop Summit 2015

HADOOP
Full-stack real-time monitoring framework for eBay Hadoop
Hao Chen | 陈浩
eBay Cloud Service

$ whoami
Hao Chen | 陈浩
Software Engineer
Analytics Data Infrastructure, Cloud Services
eBay Inc.
hchen9@ebay.com
linkedin.com/in/haozch
twitter.com/haozch
weibo.com/haochencn
2

3
eBay’s Challenges in Monitoring
10+ large hadoop clusters
10,000+ nodes
50,000+ jobs per day
50,000,000+ tasks per day
500+ types of hadoop/hbase metrics
Billions of audit events per day
Large Scale in Real Time Various Business Logic
Hadoop
Hbase
Spark
Data Security
Hardware
Cloud
Database
Complex and Scalable Policy
Join multiple data sources
Threshold based, windows based
Multiple metrics correlation
Metrics pre-aggregations
Machine learning based
Engineering Modularization
Varieties of data sources
Varieties of data collectors
Complex business logic
Alert rules can’t be hot deployed
Scalability issue with single process

What’s Eagle
4
The uniform monitoring and alerting framework to
monitor large-scale distributed system like hadoop,
spark, cloud, etc. in real time.
Eagle = Eagle Framework + Eagle Apps

Eagle Ecosystem
5
Apps
 DAM
 JPA
 HBase
 Spark
Interface
 Web Portal
 REST Services
 Ambari Plugin
Integration
 Kafka
 Storm
 HBase
 Druid
 Elastic Search
Eagle Framework
Provide full-stack monitoring framework for efficiently
developing highly scalable real-time monitoring applications.
Eagle Apps
Provide built-in monitoring applications for domains like hadoop,
spark, hbase, storm and cloud.
Eagle Integration
Integrate with distributed real-time execution environment like
storm, message bus like kafka and storage layer like hbase, and
also support extensions.
Eagle Interface
Allow to access or manage eagle through REST service, web UI
or Ambari plugin.
Eagle
Framework

6
Eagle App Highlights
JPA: Job Performance Analyzer
DAM: Security Data Activity Monitoring

7
JPA: Job Performance Analyzer
Historical job analysis
Running job analysis
Anomaly host detection
Job data skew detection
Job performance suggestion
Anomaly Prediction based on machine learning
Monitor and analyze job performance in real-time

8
Historical Job Analyzer
• Job historical performance trend
• Task and attempt distribution
• Various level (cluster/job/user/host) of
resource utilization
• Anomaly historical performance detection
• TooLowBytesConsumedPerCPUSecond
• JobStatisticLongDuration
• TooLargeReduceNumAlert
• TooLargeShuffleSizeAlert

9
Running Job Analyzer
• Monitoring running job in real time
• Minute-level job progress snapshots
• Minute-level resource usage
snapshots
• CPU, HDFS I/O, Disk I/O, slot seconds
• Roll up to user/queue/cluster level
• Anomaly running status detection
• TooLongJobDuration
• NoProgressForLong
• TooManyTaskFailure

Use Case Detect node anomaly by analyzing task failure ratio across all nodes
Assumption Task failure ratio for every node should be approximately equal
Algorithm Node by node compare (symmetry violation) and per node trend
10
Task Failure based Anomaly Host Detection

11
Task Failure based Anomaly Host Detection
Alerting: Anomaly Detection &
Alerting
Insight: Task failure drill-down Insight: Task failure drill-down

Counters & Features
Use Case Detect data skew by statistics and distributions for attempt execution durations and counters
Assumption Duration and counters should be in normal distribution
12
Real-time Data Skew Detection
mapDuration
reduceDuration
mapInputRecords
reduceInputRecords
combineInputRecords
mapSpilledRecords
reduceShuffleRecords
mapLocalFileBytesRead
reduceLocalFileBytesRead
mapHDFSBytesRead
reduceHDFSBytesRead
Modeling & Statistics
Avg
Min
Max
Distributions
Max z-score
Top-N
Correlation
Threshold & Detection
Counters
Correlation > 0.9
& Max(Z-Score) > 90%

13
Real-time Data Skew Detection

14
Anomaly Prediction based on Machine Learning
• Anomaly Metric Predictive Detection
• Offline: Analyzing and combining 500+ metrics together for causal anomaly
detections (IG -> PCA -> GMM -> MCC)
• Online: Predictively alert for anomaly metrics
Normal (Green) and Abnormal (Red)
Data and Probability Distribution and Threshold
Selection
PCA (Principal Component Analysis)

15
Anomaly Prediction based on Machine Learning
• Anomaly Metric Predictive Detection

16
DAM: Data Activity Monitoring
Secure hadoop in real-time
Security Use Cases
Security Architecture Overview
Security Components Highlights
Security Machine Learning Integration

17
Security Use Cases
Data Loss Prevention
Get alerted and stop a malicious user trying to copy, delete, move sensitive data from the Hadoop cluster.
Malicious Logins
Detect login when malicious user tries to guess password. Eagle creates user profiles using machine learning
algorithm to detect anomalies
Unauthorized access
Detect and stop a malicious user trying to access classified data without privilege.
Malicious user operation
Detect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle
user profiles. Eagle supports multiple native operation types.

Security Architecture Overview
18

19
Security Component Highlights
Policy Manager
Expressive language - create and modify policies for alerting and remediation on certain data activity
monitoring events.
Data classification
Integrate with Dataguise & Apache Ranger.
Policy-based Remediation
Ability to detect and stop a threat, improve operational efficiencies, and reduce regulatory compliance costs.
User Profiling
Based on Machine learning to automatically generate anomaly detection policy
User Activity Exploration
Ability to drill down into alert details to understand the data security threat

20
• User Activity Profiling
• Offline: Determine bandwidth from training dataset the kernel density
function parameters (KDE)
• Online: If a test data point lies outside the trained bandwidth, it is anomaly
(Policy)
PCs(Principle Components) in EVD
(Eigenvalue Value Decomposition)Kernel Density Function

21
• User Activity Profiling on Spark
Historical Audit
Events
Real-time Audit
Events
Batch Preprocess
User Proﬁle Model
Generation (KDE + EVD
Algorithm)
Eagle StorageHDFS
Stream
Preprocess
Policy Engine
Online detection on Storm
Ofﬂine training on Spark
Archived data
Real-time stream
Kafka
Persist model
Dynamically load models & policies
Alert Consumer
Persist alert
Eagle Security
Plugins

Eagle Monitoring Framework
22
Full-stack real time monitoring framework

23
• Data collector -> data processing -> metric pre-agg/alert engine -> storage -> dashboards
• We need create framework to cover full stack in monitoring system
Monitoring Programming Paradigm

Eagle Monitoring Framework Highlights
25
Lightweight Streaming Process Framework
Extensible & Scalable Policy Framework
Eagle Query Framework
Customizable Dashboards

26
Step 1: Task DAG graph setup
Eagle Stream Data Processing API
@Override
protected void buildDependency(FlowDef def, DataProcessConfig config) {
Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild();
Task uppertask = Task.newTask("uppercase").setExecutor(new
UppercaseExecutor()).connectFrom(header).completeBuild();
Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new
GroupbyCountExecutor()).connectFrom(uppertask).completeBuild();
def.endBy(groupbyUppercaseTask);
}
Step 2: Inter-task data exchange protocol
@Override
protected void buildDependency(FlowDef def, DataProcessConfig config) {
Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild();
Task uppertask = Task.newTask("uppercase").setExecutor(new
UppercaseExecutor()).connectFrom(header).completeBuild();
Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new
GroupbyCountExecutor()).connectFrom(uppertask).completeBuild();
def.endBy(groupbyUppercaseTask);
}

27
Execution Graph development, compile and deploy
Development / Compile Phase
Deployment / Runtime Phase

28
Extensible & Scalable Policy Framework
Usability
• Declarative Policy Definition Syntax
• Stream Metadata (event attribute name, attribute type, attribute value resolver, …)
Scalability
• Dynamic policy partitioning across compute nodes based on configurable partition class
• Dynamic policy deployment
• Event partitioning by storm and policy partitioning by Eagle (N events * M policies)
Extensibility
• Support new policy evaluation engine, for example Siddhi, Esper, Machine learning etc.

29
Usability of Policy Framework
Case HBase Region server high call queue length
Policy In the past 30 minutes, there are more than 20 times call queue length>2000
from RegionCallQueueLength[value>2000]#window.Extension:messageTimeWindow(30 min)
select host, value, avg(value) as avgValue, count(*) as count
group by host
having count >= 20
insert into HighRegionServerCallQueueLengthStream;

30
Scalability of Policy Evaluation
Dynamic Policy Partition
• N Users with 3 partitions, M
policies with 2 partitions, then 3*2
physical tasks
• Physical partition + Policy-level
partition

31
Extensibility of Policy Framework
public interface PolicyEvaluatorServiceProvider {
public String getPolicyType();
public Class<? extends PolicyEvaluator> getPolicyEvaluator();
public Class<? extends PolicyDefinitionParser> getPolicyDefinitionParser();
public Class<? extends PolicyEvaluatorBuilder> getPolicyEvaluatorBuilder();
public List<Module> getBindingModules();
}
Policy Evaluator Provider use SPI to register policy engine implementations
Built-in Supported Policy Engine
• Siddhi Complex Event Processing Engine
• Machine Learning based Policy Engine

Eagle Query Framework
32
Persistence
• Metric
• Event
• Metadata
• Alert
• Log
• Customized
Structure
• …
Query
• Search
• Filter
• Aggregation
• Sort
• Expression
• ….
The light-weight metadata-driven store layer to serve
commonly shared storage & query requirements of most monitoring system

33
• Interactive: IPython notebook-like
interactive visualization analysis and
troubleshooting.
• Dashboard: Customizable dashboard layout
and drill-down path, persist and share.
Customizable Dashboard
Provide real-time interactive visualization and analytics capability supporting variety of
data sources like eagle, druid and so on.

34
Eagle in Future
The general monitoring platform for large-scale system of eBay

35
Open Source
First Use Case
Eagle to secure Hadoop in real time based on Eagle framework
External Partners
Hortonworks, Dataguise, Paypal and Apache Ranger
Following Components to Open Source
JPA (“Job Performance Analyzer”), HBase and GC Monitoring and so on
is opening source soon

36
Reference
Eagle at Hadoop Summit 2015, San Jose
http://2015.hadoopsummit.org
Slides | Video
Eagle at Big Data Summit 2014, Shanghai
http://2014ebay.csdn.net/m/zone/ebay_en
Slides | Video

37
The End & Thanks
If you want to go fast, go alone.
If you want to go far, go together.
-- African Proverb
Hao Chen
hchen9@ebay.com | @haozch

38
We are Hiring Now
https://careers.ebayinc.com
Or contact me: hchen9@ebay.com

Eagle from eBay at China Hadoop Summit 2015

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Eagle from eBay at China Hadoop Summit 2015

Similar to Eagle from eBay at China Hadoop Summit 2015 (20)

Recently uploaded

Recently uploaded (20)

Eagle from eBay at China Hadoop Summit 2015

Editor's Notes