SlideShare a Scribd company logo
Full-stack real-time monitoring framework for eBay Hadoop
Hao Chen | 陈浩
eBay Cloud Service
$ whoami
Hao Chen | 陈浩
Software Engineer
Analytics Data Infrastructure, Cloud Services
eBay Inc.
eBay’s Challenges in Monitoring
10+ large hadoop clusters
10,000+ nodes
50,000+ jobs per day
50,000,000+ tasks per day
500+ types of hadoop/hbase metrics
Billions of audit events per day
Large Scale in Real Time Various Business Logic
Data Security
Complex and Scalable Policy
Join multiple data sources
Threshold based, windows based
Multiple metrics correlation
Metrics pre-aggregations
Machine learning based
Engineering Modularization
Varieties of data sources
Varieties of data collectors
Complex business logic
Alert rules can’t be hot deployed
Scalability issue with single process
What’s Eagle
The uniform monitoring and alerting framework to
monitor large-scale distributed system like hadoop,
spark, cloud, etc. in real time.
Eagle = Eagle Framework + Eagle Apps
Eagle Ecosystem
 HBase
 Spark
 Web Portal
 REST Services
 Ambari Plugin
 Kafka
 Storm
 HBase
 Druid
 Elastic Search
Eagle Framework
Provide full-stack monitoring framework for efficiently
developing highly scalable real-time monitoring applications.
Eagle Apps
Provide built-in monitoring applications for domains like hadoop,
spark, hbase, storm and cloud.
Eagle Integration
Integrate with distributed real-time execution environment like
storm, message bus like kafka and storage layer like hbase, and
also support extensions.
Eagle Interface
Allow to access or manage eagle through REST service, web UI
or Ambari plugin.
Eagle App Highlights
JPA: Job Performance Analyzer
DAM: Security Data Activity Monitoring
JPA: Job Performance Analyzer
Historical job analysis
Running job analysis
Anomaly host detection
Job data skew detection
Job performance suggestion
Anomaly Prediction based on machine learning
Monitor and analyze job performance in real-time
Historical Job Analyzer
• Job historical performance trend
• Task and attempt distribution
• Various level (cluster/job/user/host) of
resource utilization
• Anomaly historical performance detection
• TooLowBytesConsumedPerCPUSecond
• JobStatisticLongDuration
• TooLargeReduceNumAlert
• TooLargeShuffleSizeAlert
Running Job Analyzer
• Monitoring running job in real time
• Minute-level job progress snapshots
• Minute-level resource usage
• CPU, HDFS I/O, Disk I/O, slot seconds
• Roll up to user/queue/cluster level
• Anomaly running status detection
• TooLongJobDuration
• NoProgressForLong
• TooManyTaskFailure
Use Case Detect node anomaly by analyzing task failure ratio across all nodes
Assumption Task failure ratio for every node should be approximately equal
Algorithm Node by node compare (symmetry violation) and per node trend
Task Failure based Anomaly Host Detection
Task Failure based Anomaly Host Detection
Alerting: Anomaly Detection &
Insight: Task failure drill-down Insight: Task failure drill-down
Counters & Features
Use Case Detect data skew by statistics and distributions for attempt execution durations and counters
Assumption Duration and counters should be in normal distribution
Real-time Data Skew Detection
Modeling & Statistics
Max z-score
Threshold & Detection
Correlation > 0.9
& Max(Z-Score) > 90%
Real-time Data Skew Detection
Anomaly Prediction based on Machine Learning
• Anomaly Metric Predictive Detection
• Offline: Analyzing and combining 500+ metrics together for causal anomaly
detections (IG -> PCA -> GMM -> MCC)
• Online: Predictively alert for anomaly metrics
Normal (Green) and Abnormal (Red)
Data and Probability Distribution and Threshold
PCA (Principal Component Analysis)
Anomaly Prediction based on Machine Learning
• Anomaly Metric Predictive Detection
DAM: Data Activity Monitoring
Secure hadoop in real-time
Security Use Cases
Security Architecture Overview
Security Components Highlights
Security Machine Learning Integration
Security Use Cases
Data Loss Prevention
Get alerted and stop a malicious user trying to copy, delete, move sensitive data from the Hadoop cluster.
Malicious Logins
Detect login when malicious user tries to guess password. Eagle creates user profiles using machine learning
algorithm to detect anomalies
Unauthorized access
Detect and stop a malicious user trying to access classified data without privilege.
Malicious user operation
Detect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle
user profiles. Eagle supports multiple native operation types.
Security Architecture Overview
Security Component Highlights
Policy Manager
Expressive language - create and modify policies for alerting and remediation on certain data activity
monitoring events.
Data classification
Integrate with Dataguise & Apache Ranger.
Policy-based Remediation
Ability to detect and stop a threat, improve operational efficiencies, and reduce regulatory compliance costs.
User Profiling
Based on Machine learning to automatically generate anomaly detection policy
User Activity Exploration
Ability to drill down into alert details to understand the data security threat
Security Machine Learning Integration
• User Activity Profiling
• Offline: Determine bandwidth from training dataset the kernel density
function parameters (KDE)
• Online: If a test data point lies outside the trained bandwidth, it is anomaly
PCs(Principle Components) in EVD
(Eigenvalue Value Decomposition)Kernel Density Function
Security Machine Learning Integration
• User Activity Profiling on Spark
Historical Audit
Real-time Audit
Batch Preprocess
User Profile Model
Generation (KDE + EVD
Eagle StorageHDFS
Policy Engine
Online detection on Storm
Offline training on Spark
Archived data
Real-time stream
Persist model
Dynamically load models & policies
Alert Consumer
Persist alert
Eagle Security
Eagle Monitoring Framework
Eagle = Eagle Framework + Eagle Apps
Full-stack real time monitoring framework
• Data collector -> data processing -> metric pre-agg/alert engine -> storage -> dashboards
• We need create framework to cover full stack in monitoring system
Monitoring Programming Paradigm
Eagle Monitoring Framework
Eagle Monitoring Framework Highlights
Eagle = Eagle Framework + Eagle Apps
Lightweight Streaming Process Framework
Extensible & Scalable Policy Framework
Eagle Query Framework
Customizable Dashboards
Step 1: Task DAG graph setup
Eagle Stream Data Processing API
protected void buildDependency(FlowDef def, DataProcessConfig config) {
Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild();
Task uppertask = Task.newTask("uppercase").setExecutor(new
Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new
Step 2: Inter-task data exchange protocol
protected void buildDependency(FlowDef def, DataProcessConfig config) {
Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild();
Task uppertask = Task.newTask("uppercase").setExecutor(new
Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new
Execution Graph development, compile and deploy
Development / Compile Phase
Deployment / Runtime Phase
Extensible & Scalable Policy Framework
• Declarative Policy Definition Syntax
• Stream Metadata (event attribute name, attribute type, attribute value resolver, …)
• Dynamic policy partitioning across compute nodes based on configurable partition class
• Dynamic policy deployment
• Event partitioning by storm and policy partitioning by Eagle (N events * M policies)
• Support new policy evaluation engine, for example Siddhi, Esper, Machine learning etc.
Usability of Policy Framework
Case HBase Region server high call queue length
Policy In the past 30 minutes, there are more than 20 times call queue length>2000
from RegionCallQueueLength[value>2000]#window.Extension:messageTimeWindow(30 min)
select host, value, avg(value) as avgValue, count(*) as count
group by host
having count >= 20
insert into HighRegionServerCallQueueLengthStream;
Scalability of Policy Evaluation
Dynamic Policy Partition
• N Users with 3 partitions, M
policies with 2 partitions, then 3*2
physical tasks
• Physical partition + Policy-level
Extensibility of Policy Framework
public interface PolicyEvaluatorServiceProvider {
public String getPolicyType();
public Class<? extends PolicyEvaluator> getPolicyEvaluator();
public Class<? extends PolicyDefinitionParser> getPolicyDefinitionParser();
public Class<? extends PolicyEvaluatorBuilder> getPolicyEvaluatorBuilder();
public List<Module> getBindingModules();
Policy Evaluator Provider use SPI to register policy engine implementations
Built-in Supported Policy Engine
• Siddhi Complex Event Processing Engine
• Machine Learning based Policy Engine
Eagle Query Framework
• Metric
• Event
• Metadata
• Alert
• Log
• Customized
• …
• Search
• Filter
• Aggregation
• Sort
• Expression
• ….
The light-weight metadata-driven store layer to serve
commonly shared storage & query requirements of most monitoring system
• Interactive: IPython notebook-like
interactive visualization analysis and
• Dashboard: Customizable dashboard layout
and drill-down path, persist and share.
Customizable Dashboard
Provide real-time interactive visualization and analytics capability supporting variety of
data sources like eagle, druid and so on.
Eagle in Future
The general monitoring platform for large-scale system of eBay
Open Source
First Use Case
Eagle to secure Hadoop in real time based on Eagle framework
External Partners
Hortonworks, Dataguise, Paypal and Apache Ranger
Following Components to Open Source
JPA (“Job Performance Analyzer”), HBase and GC Monitoring and so on
is opening source soon
Eagle at Hadoop Summit 2015, San Jose
Slides | Video
Eagle at Big Data Summit 2014, Shanghai
Slides | Video
The End & Thanks
If you want to go fast, go alone.
If you want to go far, go together.
-- African Proverb
Hao Chen | @haozch
We are Hiring Now
Or contact me:

More Related Content

What's hot

Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONU
Jerome Boulon
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and Time
DataWorks Summit
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL Releases
DataWorks Summit
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache Oozie
DataWorks Summit/Hadoop Summit
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
Yahoo Developer Network
Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twill
Henry Saputra
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
DataWorks Summit
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
Security From The Big Data and Analytics Perspective
Security From The Big Data and Analytics PerspectiveSecurity From The Big Data and Analytics Perspective
Security From The Big Data and Analytics Perspective
All Things Open
Opal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific ApplicationsOpal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific Applications
Sriram Krishnan
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
Spark Summit
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixOmid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache Phoenix
DataWorks Summit
Enabling Modern Application Architecture using open government data
Enabling Modern Application Architecture using open government dataEnabling Modern Application Architecture using open government data
Enabling Modern Application Architecture using open government data
DataWorks Summit
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
DataWorks Summit/Hadoop Summit
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesmyHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
Sriram Krishnan
Future of Apache Storm
Future of Apache StormFuture of Apache Storm
Future of Apache Storm
DataWorks Summit/Hadoop Summit

What's hot (20)

Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONU
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and Time
Fast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL ReleasesFast and Reliable Apache Spark SQL Releases
Fast and Reliable Apache Spark SQL Releases
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twill
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Security From The Big Data and Analytics Perspective
Security From The Big Data and Analytics PerspectiveSecurity From The Big Data and Analytics Perspective
Security From The Big Data and Analytics Perspective
Opal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific ApplicationsOpal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific Applications
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache PhoenixOmid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache Phoenix
Enabling Modern Application Architecture using open government data
Enabling Modern Application Architecture using open government dataEnabling Modern Application Architecture using open government data
Enabling Modern Application Architecture using open government data
Improvements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba SearchImprovements to Flink & it's Applications in Alibaba Search
Improvements to Flink & it's Applications in Alibaba Search
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesmyHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
Future of Apache Storm
Future of Apache StormFuture of Apache Storm
Future of Apache Storm

Viewers also liked

Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
DataWorks Summit
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
DataWorks Summit/Hadoop Summit
Developing Distributed Web Applications, Where does REST fit in?
Developing Distributed Web Applications, Where does REST fit in?Developing Distributed Web Applications, Where does REST fit in?
Developing Distributed Web Applications, Where does REST fit in?
Srinath Perera
Siddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing ImplementationsSiddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing Implementations
Srinath Perera
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for All
DataWorks Summit
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
DataWorks Summit/Hadoop Summit
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
Turi, Inc.
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data
Turi, Inc.

Viewers also liked (8)

Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
Developing Distributed Web Applications, Where does REST fit in?
Developing Distributed Web Applications, Where does REST fit in?Developing Distributed Web Applications, Where does REST fit in?
Developing Distributed Web Applications, Where does REST fit in?
Siddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing ImplementationsSiddhi: A Second Look at Complex Event Processing Implementations
Siddhi: A Second Look at Complex Event Processing Implementations
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for All
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)Webinar - Pattern Mining Log Data - Vega (20160426)
Webinar - Pattern Mining Log Data - Vega (20160426)
Pattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log DataPattern Mining: Extracting Value from Log Data
Pattern Mining: Extracting Value from Log Data

Similar to Eagle from eBay at China Hadoop Summit 2015

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
Tomasz Kopacz
Shahbaz Sidhu
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
Nicolas Poggi
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
Bill Hayduk
Real time analytics
Real time analyticsReal time analytics
Real time analytics
Leandro Totino Pereira
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
Wei Ting Chen
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
Josh Patterson
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...
Maarten Balliauw
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Data Con LA

Similar to Eagle from eBay at China Hadoop Summit 2015 (20)

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
Real time analytics
Real time analyticsReal time analytics
Real time analytics
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software

Recently uploaded

Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Bert Blevins
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Matthew Sinclair
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Lidia A.
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Bert Blevins
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
Matthew Sinclair
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
Matthew Sinclair
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
Awais Yaseen
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
Stephanie Beckett

Recently uploaded (20)

Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Understanding Insider Security Threats: Types, Examples, Effects, and Mitigat...
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
WPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide DeckWPRiders Company Presentation Slide Deck
WPRiders Company Presentation Slide Deck
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
Best Programming Language for Civil Engineers
Best Programming Language for Civil EngineersBest Programming Language for Civil Engineers
Best Programming Language for Civil Engineers
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx

Eagle from eBay at China Hadoop Summit 2015

  • 1. HADOOP Full-stack real-time monitoring framework for eBay Hadoop Hao Chen | 陈浩 eBay Cloud Service
  • 2. $ whoami Hao Chen | 陈浩 Software Engineer Analytics Data Infrastructure, Cloud Services eBay Inc. 2
  • 3. 3 eBay’s Challenges in Monitoring 10+ large hadoop clusters 10,000+ nodes 50,000+ jobs per day 50,000,000+ tasks per day 500+ types of hadoop/hbase metrics Billions of audit events per day Large Scale in Real Time Various Business Logic Hadoop Hbase Spark Data Security Hardware Cloud Database Complex and Scalable Policy Join multiple data sources Threshold based, windows based Multiple metrics correlation Metrics pre-aggregations Machine learning based Engineering Modularization Varieties of data sources Varieties of data collectors Complex business logic Alert rules can’t be hot deployed Scalability issue with single process
  • 4. What’s Eagle 4 The uniform monitoring and alerting framework to monitor large-scale distributed system like hadoop, spark, cloud, etc. in real time. Eagle = Eagle Framework + Eagle Apps
  • 5. Eagle Ecosystem 5 Apps  DAM  JPA  HBase  Spark Interface  Web Portal  REST Services  Ambari Plugin Integration  Kafka  Storm  HBase  Druid  Elastic Search Eagle Framework Provide full-stack monitoring framework for efficiently developing highly scalable real-time monitoring applications. Eagle Apps Provide built-in monitoring applications for domains like hadoop, spark, hbase, storm and cloud. Eagle Integration Integrate with distributed real-time execution environment like storm, message bus like kafka and storage layer like hbase, and also support extensions. Eagle Interface Allow to access or manage eagle through REST service, web UI or Ambari plugin. Eagle Framework
  • 6. 6 Eagle App Highlights JPA: Job Performance Analyzer DAM: Security Data Activity Monitoring
  • 7. 7 JPA: Job Performance Analyzer Historical job analysis Running job analysis Anomaly host detection Job data skew detection Job performance suggestion Anomaly Prediction based on machine learning Monitor and analyze job performance in real-time
  • 8. 8 Historical Job Analyzer • Job historical performance trend • Task and attempt distribution • Various level (cluster/job/user/host) of resource utilization • Anomaly historical performance detection • TooLowBytesConsumedPerCPUSecond • JobStatisticLongDuration • TooLargeReduceNumAlert • TooLargeShuffleSizeAlert
  • 9. 9 Running Job Analyzer • Monitoring running job in real time • Minute-level job progress snapshots • Minute-level resource usage snapshots • CPU, HDFS I/O, Disk I/O, slot seconds • Roll up to user/queue/cluster level • Anomaly running status detection • TooLongJobDuration • NoProgressForLong • TooManyTaskFailure
  • 10. Use Case Detect node anomaly by analyzing task failure ratio across all nodes Assumption Task failure ratio for every node should be approximately equal Algorithm Node by node compare (symmetry violation) and per node trend 10 Task Failure based Anomaly Host Detection
  • 11. 11 Task Failure based Anomaly Host Detection Alerting: Anomaly Detection & Alerting Insight: Task failure drill-down Insight: Task failure drill-down
  • 12. Counters & Features Use Case Detect data skew by statistics and distributions for attempt execution durations and counters Assumption Duration and counters should be in normal distribution 12 Real-time Data Skew Detection mapDuration reduceDuration mapInputRecords reduceInputRecords combineInputRecords mapSpilledRecords reduceShuffleRecords mapLocalFileBytesRead reduceLocalFileBytesRead mapHDFSBytesRead reduceHDFSBytesRead Modeling & Statistics Avg Min Max Distributions Max z-score Top-N Correlation Threshold & Detection Counters Correlation > 0.9 & Max(Z-Score) > 90%
  • 14. 14 Anomaly Prediction based on Machine Learning • Anomaly Metric Predictive Detection • Offline: Analyzing and combining 500+ metrics together for causal anomaly detections (IG -> PCA -> GMM -> MCC) • Online: Predictively alert for anomaly metrics Normal (Green) and Abnormal (Red) Data and Probability Distribution and Threshold Selection PCA (Principal Component Analysis)
  • 15. 15 Anomaly Prediction based on Machine Learning • Anomaly Metric Predictive Detection
  • 16. 16 DAM: Data Activity Monitoring Secure hadoop in real-time Security Use Cases Security Architecture Overview Security Components Highlights Security Machine Learning Integration
  • 17. 17 Security Use Cases Data Loss Prevention Get alerted and stop a malicious user trying to copy, delete, move sensitive data from the Hadoop cluster. Malicious Logins Detect login when malicious user tries to guess password. Eagle creates user profiles using machine learning algorithm to detect anomalies Unauthorized access Detect and stop a malicious user trying to access classified data without privilege. Malicious user operation Detect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle user profiles. Eagle supports multiple native operation types.
  • 19. 19 Security Component Highlights Policy Manager Expressive language - create and modify policies for alerting and remediation on certain data activity monitoring events. Data classification Integrate with Dataguise & Apache Ranger. Policy-based Remediation Ability to detect and stop a threat, improve operational efficiencies, and reduce regulatory compliance costs. User Profiling Based on Machine learning to automatically generate anomaly detection policy User Activity Exploration Ability to drill down into alert details to understand the data security threat
  • 20. 20 Security Machine Learning Integration • User Activity Profiling • Offline: Determine bandwidth from training dataset the kernel density function parameters (KDE) • Online: If a test data point lies outside the trained bandwidth, it is anomaly (Policy) PCs(Principle Components) in EVD (Eigenvalue Value Decomposition)Kernel Density Function
  • 21. 21 Security Machine Learning Integration • User Activity Profiling on Spark Historical Audit Events Real-time Audit Events Batch Preprocess User Profile Model Generation (KDE + EVD Algorithm) Eagle StorageHDFS Stream Preprocess Policy Engine Online detection on Storm Offline training on Spark Archived data Real-time stream Kafka Persist model Dynamically load models & policies Alert Consumer Persist alert Eagle Security Plugins
  • 22. Eagle Monitoring Framework 22 Eagle = Eagle Framework + Eagle Apps Full-stack real time monitoring framework
  • 23. 23 • Data collector -> data processing -> metric pre-agg/alert engine -> storage -> dashboards • We need create framework to cover full stack in monitoring system Monitoring Programming Paradigm
  • 25. Eagle Monitoring Framework Highlights 25 Eagle = Eagle Framework + Eagle Apps Lightweight Streaming Process Framework Extensible & Scalable Policy Framework Eagle Query Framework Customizable Dashboards
  • 26. 26 Step 1: Task DAG graph setup Eagle Stream Data Processing API @Override protected void buildDependency(FlowDef def, DataProcessConfig config) { Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild(); Task uppertask = Task.newTask("uppercase").setExecutor(new UppercaseExecutor()).connectFrom(header).completeBuild(); Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new GroupbyCountExecutor()).connectFrom(uppertask).completeBuild(); def.endBy(groupbyUppercaseTask); } Step 2: Inter-task data exchange protocol @Override protected void buildDependency(FlowDef def, DataProcessConfig config) { Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild(); Task uppertask = Task.newTask("uppercase").setExecutor(new UppercaseExecutor()).connectFrom(header).completeBuild(); Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new GroupbyCountExecutor()).connectFrom(uppertask).completeBuild(); def.endBy(groupbyUppercaseTask); }
  • 27. 27 Execution Graph development, compile and deploy Development / Compile Phase Deployment / Runtime Phase
  • 28. 28 Extensible & Scalable Policy Framework Usability • Declarative Policy Definition Syntax • Stream Metadata (event attribute name, attribute type, attribute value resolver, …) Scalability • Dynamic policy partitioning across compute nodes based on configurable partition class • Dynamic policy deployment • Event partitioning by storm and policy partitioning by Eagle (N events * M policies) Extensibility • Support new policy evaluation engine, for example Siddhi, Esper, Machine learning etc.
  • 29. 29 Usability of Policy Framework Case HBase Region server high call queue length Policy In the past 30 minutes, there are more than 20 times call queue length>2000 from RegionCallQueueLength[value>2000]#window.Extension:messageTimeWindow(30 min) select host, value, avg(value) as avgValue, count(*) as count group by host having count >= 20 insert into HighRegionServerCallQueueLengthStream;
  • 30. 30 Scalability of Policy Evaluation Dynamic Policy Partition • N Users with 3 partitions, M policies with 2 partitions, then 3*2 physical tasks • Physical partition + Policy-level partition
  • 31. 31 Extensibility of Policy Framework public interface PolicyEvaluatorServiceProvider { public String getPolicyType(); public Class<? extends PolicyEvaluator> getPolicyEvaluator(); public Class<? extends PolicyDefinitionParser> getPolicyDefinitionParser(); public Class<? extends PolicyEvaluatorBuilder> getPolicyEvaluatorBuilder(); public List<Module> getBindingModules(); } Policy Evaluator Provider use SPI to register policy engine implementations Built-in Supported Policy Engine • Siddhi Complex Event Processing Engine • Machine Learning based Policy Engine
  • 32. Eagle Query Framework 32 Persistence • Metric • Event • Metadata • Alert • Log • Customized Structure • … Query • Search • Filter • Aggregation • Sort • Expression • …. The light-weight metadata-driven store layer to serve commonly shared storage & query requirements of most monitoring system
  • 33. 33 • Interactive: IPython notebook-like interactive visualization analysis and troubleshooting. • Dashboard: Customizable dashboard layout and drill-down path, persist and share. Customizable Dashboard Provide real-time interactive visualization and analytics capability supporting variety of data sources like eagle, druid and so on.
  • 34. 34 Eagle in Future The general monitoring platform for large-scale system of eBay
  • 35. 35 Open Source First Use Case Eagle to secure Hadoop in real time based on Eagle framework External Partners Hortonworks, Dataguise, Paypal and Apache Ranger Following Components to Open Source JPA (“Job Performance Analyzer”), HBase and GC Monitoring and so on is opening source soon
  • 36. 36 Reference Eagle at Hadoop Summit 2015, San Jose Slides | Video Eagle at Big Data Summit 2014, Shanghai Slides | Video
  • 37. 37 The End & Thanks If you want to go fast, go alone. If you want to go far, go together. -- African Proverb Hao Chen | @haozch
  • 38. 38 We are Hiring Now Or contact me:

Editor's Notes

  1. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  2. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  3. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  4. Inspired by TSDB, Ganglia, Nagios, Zabbix etc. Most of them focus on infrastructure level data collection and alert, but they don’t consider business logic complexity – how to prepare data
  5. Inspired by TSDB, Ganglia, Nagios, Zabbix etc. Most of them focus on infrastructure level data collection and alert, but they don’t consider business logic complexity – how to prepare data
  6. Inspired by TSDB, Ganglia, Nagios, Zabbix etc. Most of them focus on infrastructure level data collection and alert, but they don’t consider business logic complexity – how to prepare data
  7. Inspired by TSDB, Ganglia, Nagios, Zabbix etc. Most of them focus on infrastructure level data collection and alert, but they don’t consider business logic complexity – how to prepare data
  8. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  9. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  10. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  11. IG: Information Gain, 信息增益, 概率分布或者信息论,是非对称的,用以度量两种概率分布P和Q的差异。信息增益描述了当使用Q进行编码时,再使用P进行编码的差异。通常P代表样本或观察值的分布,也有可能是精确计算的理论分布。Q代表一种理论,模型,描述或者对P的近似。目的: 特征选择 PCA: 主成分分析(Principal Component Analysis,PCA), 将多个变量通过线性变换以选出较少个数重要变量的一种多元统计分析方法。又称主分量分析。 GMM: 高斯混合模型(或者混合高斯模型),也可以简写为MOG(Mixture of Gaussian)。用高斯概率密度函数(正态分布曲线)精确地量化事物,将一个事物分解为若干的基于高斯概率密度函数(正态分布曲线)形成的模型。 MCC: 马修相关系数,
  12. IG: Information Gain, 信息增益, 概率分布或者信息论,是非对称的,用以度量两种概率分布P和Q的差异。信息增益描述了当使用Q进行编码时,再使用P进行编码的差异。通常P代表样本或观察值的分布,也有可能是精确计算的理论分布。Q代表一种理论,模型,描述或者对P的近似。目的: 特征选择 PCA: 主成分分析(Principal Component Analysis,PCA), 将多个变量通过线性变换以选出较少个数重要变量的一种多元统计分析方法。又称主分量分析。 GMM: 高斯混合模型(或者混合高斯模型),也可以简写为MOG(Mixture of Gaussian)。用高斯概率密度函数(正态分布曲线)精确地量化事物,将一个事物分解为若干的基于高斯概率密度函数(正态分布曲线)形成的模型。 MCC: 马修相关系数,
  13. Data loss prevention Get alerted and stop a malicious user trying to copy, delete, move sensitive data from the hadoop cluster. Malicious Logins Detect login when malicious user tries to guess password. Eagle creates user profiles using machine learning algorithm to detect anomalies. This anomaly detection together with the policy for user logins would trigger an alert and block this user from accessing sensitive datasets. Unauthorized access Detect and stop a malicious user trying to access classified data without privilege. Unauthorized access is one fact of Eagle user profiles, and machine learning algorithm will detect anomaly. This anomaly detection together with the policy of user access to classified data without authorization would trigger an alert to user's manager. Malicious user operation Detect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle user profiles. Eagle supports multiple native operation types. Dataguise delivers data privacy protection and risk assessment analytics that allow organizations to safely leverage and share enterprise data. Our solutions simplify governance as they proactively locate sensitive data, automatically protect it with appropriate remediation polices, and provide actionable compliance intelligence to decision makers, in real-time. In Hadoop deployments, our solutions inspect incoming data and protect it before it is stored. These capabilities simplify risk management, improve operational efficiencies, and reduce regulatory compliance costs.
  14. Data loss prevention Get alerted and stop a malicious user trying to copy, delete, move sensitive data from the hadoop cluster. Malicious Logins Detect login when malicious user tries to guess password. Eagle creates user profiles using machine learning algorithm to detect anomalies. This anomaly detection together with the policy for user logins would trigger an alert and block this user from accessing sensitive datasets. Unauthorized access Detect and stop a malicious user trying to access classified data without privilege. Unauthorized access is one fact of Eagle user profiles, and machine learning algorithm will detect anomaly. This anomaly detection together with the policy of user access to classified data without authorization would trigger an alert to user's manager. Malicious user operation Detect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle user profiles. Eagle supports multiple native operation types. Dataguise delivers data privacy protection and risk assessment analytics that allow organizations to safely leverage and share enterprise data. Our solutions simplify governance as they proactively locate sensitive data, automatically protect it with appropriate remediation polices, and provide actionable compliance intelligence to decision makers, in real-time. In Hadoop deployments, our solutions inspect incoming data and protect it before it is stored. These capabilities simplify risk management, improve operational efficiencies, and reduce regulatory compliance costs.
  15. Data loss prevention Get alerted and stop a malicious user trying to copy, delete, move sensitive data from the hadoop cluster. Malicious Logins Detect login when malicious user tries to guess password. Eagle creates user profiles using machine learning algorithm to detect anomalies. This anomaly detection together with the policy for user logins would trigger an alert and block this user from accessing sensitive datasets. Unauthorized access Detect and stop a malicious user trying to access classified data without privilege. Unauthorized access is one fact of Eagle user profiles, and machine learning algorithm will detect anomaly. This anomaly detection together with the policy of user access to classified data without authorization would trigger an alert to user's manager. Malicious user operation Detect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle user profiles. Eagle supports multiple native operation types. Dataguise delivers data privacy protection and risk assessment analytics that allow organizations to safely leverage and share enterprise data. Our solutions simplify governance as they proactively locate sensitive data, automatically protect it with appropriate remediation polices, and provide actionable compliance intelligence to decision makers, in real-time. In Hadoop deployments, our solutions inspect incoming data and protect it before it is stored. These capabilities simplify risk management, improve operational efficiencies, and reduce regulatory compliance costs.
  16. Histogram Density Estimation: 直方密度估计 Kernel density estimation-核密度估计 EVD: 线性代数,特征值分解,矩阵之集,Eigenvalue Value Decomposition, 高斯混合模型更多的用于分类,Parzen等KDE方法更多的用于概率密度的估计
  17. Histogram Density Estimation: 直方密度估计 Kernel density estimation-核密度估计 EVD: 线性代数,特征值分解,矩阵之集,Eigenvalue Value Decomposition, 高斯混合模型更多的用于分类,Parzen等KDE方法更多的用于概率密度的估计
  18. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  19. As a framework, Eagle does not assume : Data source (where, what) Business logic execution path (how) Policy engine implementation (how) Data sink (where, what) As a framework, Eagle does the following: SQL-like service API High-performing query framework Lightweight streaming process java API Extensible policy engine implementation Scalable and distributed rule evaluation Metadata driven stream processing Data source extensibility Data sink extensibility Interactive dashboard
  20. Anomaly detection algorithm Continuously crawl job history files immediately after it is completed Calculate minute level job failure ratio for each node A node is identified to be anomalous when either of the following 2 conditions happen This node continuously fails tasks within this node This node has significant higher failure ratio than rest of nodes within the cluster
  21. Supports syntax: Search Aggregate Time Series Histogram Expression Filter Paginations Metadata definition ORM High performance RESTful API SQL-like declarative query syntax Supporting HBase and RDBMS as storage Logically partition by tags defined in annotation Co-processor support Secondary index support Generic service client library Supports syntax: Search Aggregate Time Series Histogram Expression Filter Paginations Metadata definition ORM High performance RESTful API SQL-like declarative query syntax Supporting HBase and RDBMS as storage Logically partition by tags defined in annotation Co-processor support Secondary index support Generic service client library
  22. eBay内部,随着越来越多的大型分布式系统在企业级平台中部署,monitoring for large-scale 分布式系统的需求尤其强烈,eagle 将给予eagle framework 为核心基础,不断结合business logic特性逐渐壮大其Eagle Apps的生态圈,同时不断优化核心框架本身。 同时我们相信不止是ebay,大部分企业级平台,部署和维护这些大型分布式系统时,都会遇到共同的问题,集群越大,各方面监控所面临的挑战也越大,我们相信Eagle这针对于大型分布式系统监控的优势也会越突出。我们也一直非常期待同大家进行相关的交流和探讨,因此作为抛砖引玉,我们会以开源的形式开放eagle的代码,一方面ebay在这方面的大型分布式系统监控方面的努力可以对那些需要解决类似的公司有所帮助或者参考,同时也希望得到业界的反馈,对于我们的解决方式上进行深入交流,我们自己也可以从中有所收获,甚至,大家可以一起合作创建一个定位与大型分布式系统的开源监控平台。