Internship_presentation

Online Anomaly Detection
Using Bayesian Changepoint, K mean Clustering and Hidden Markov Model
Aditya Gautam,
Masters Student at Carnegie Mellon University, Pittsburgh/ Silicon Valley.
Title : Summer 2016 Intern, Software Engineering
Team : Diagnostic, Visualization and Analytics team (DVA)
Manager : Paul Brown, VP, Software Engineering.
aditya.gautam@salesforce.com / aditya.gautam@sv.cmu.edu

What is time series Anomaly?
Any data pattern that has not been
seen before or not expected to be seen
next in the time series sequence.

Internship Goals :
To develop a robust online anomaly detection technique for time series data.
● Detect changepoints:
Use Bayesian Online changepoint detection for detection of changepoints.
● Build machine learning model:
To do data pre-processing, feature extraction and build a model using bayesian combined
existing machine learning/ probabilistic approaches.
● Scalable implementation:
Implement this model in an online streaming, distributed, scalable and fault tolerant manner
using big data tools like Kafka, Hadoop, Spark,Flink,HDFS etc.

Basics Techniques :
Running mean and standard deviation :
● Maintaining the running average and variance in
streaming states.
● Check the probability of the point belonging to the
Gaussian distribution and flag it as anomaly if
probability is too low.
● Pretty Straight forward and no brainer.
● Would work only in naïve cases. Not universal :(

But, how would we deal with these ?

Solution : Learn about the patterns in the data and how they are connected.
How ? Bayesian Changepoints + K mean Clustering + Hidden Markov Model

High level model overview
Time Series Data
Bayesian Changepoint
Detector
Detect changepoints in the data
Extract following features for chucks :
Length, Variance and Mean.
Learn the order to build HHM model.
Learn data patterns and order
Calculate the distance of new
distribution from the clusters
(gaussian distribution).
If > threshold for nearest
cluster => Anomaly
K mean clustering of distributions learned.
#of states in HHM model = # of clusters

Bayesian Changepoint Basics :
Changepoint : Point at which one distributions changes from one to another
Run length : Number of data points which the last changepoint occurred.
Underlying concept : Message passing algorithm to detect the
changepoint -> run length state change probability (HMM).
Changepoint Prior Probability :
Marginal Predictive Distribution :

If you are more curious about Bayesian :
Why Bayesian ?
The prediction of future changepoints is only dependent on the data points
since last changepoints, thus making it fast and efficient in terms of time
and memory.
Please check the reference for more details and related research paper.
Underlying Algorithm

How to detect anomaly on streaming data ?
KAFKA + FLINK + HDFS/Cassandra
Kafka :
● To get data points from the input source like
web server/data centers etc.
● Divide the stream based on topics.
● Feed the topic stream to different consumers.
Flink :
● To create window of data point.
● Run Anomaly detection.
● Find adaptive window size.
HDFS/Cassandra :
● To store the data points.
● To send alarms etc.

Why Apache Flink ?
Reasons for choosing Apache Flink over other streaming frameworks like Spark,Storm etc.
● Provide flexibility to create window based on number of data points or the time.
● Order of data is intact as that of the source (makes it faster -> No sorting needed)
● Online streaming (vs batch streaming)
● Efficient mechanism for state maintenance and transition.
● Low latency and high throughput as compared to other streaming platform.
Thus, Apache Flink is a best suited framework for online anomaly detection or any real time
streaming application.

Windowing of streaming data through Apache Flink :

● Tumbling Window of size 100 is used.
● Previous points i.e. Max(30, window_len - last_changepoint_pos), is retained by the previous state variable.
● Previous points are combined with new window points before feeding to the bayesian changepoint detector.
● After processing, new state is updated with number of points to retain for next window operation.
So, virtually, processing is done on a sliding window of variable length i.e 100 + max(30,window_len - last changepoint).

Overall System architecture

Flink Manager
New Window
(data points)
Old Window State :
Number of total data points
Number of overlap points.
Window Idx.
Last changepoints length etc.
Bayesian Changepoint :
Detect Changepoints and
update the state variable
Data Source :
Kafka/S3 etc.
Sliding window
New Window State :
Total data points.
Number of overlap points.
Window Idx -> Window Idx +1
Changepoint length etc.
Update previous state i.e.
cluster idx for HHM model.
Data Sink :
HDFS/Hbase etc
New distributions/
Anomaly detected
Dashboard
Alarm/Alerts
End user Output

Apache Flink Main Execution Code :

public static void main(String[] args) throws Exception {
// Creating a Flink Streaming Environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// Getting the Data from the File / Kafka / HDFS/ Amazon S3 or any other source
DataStreamSource<String> data = env.readTextFile("file:///Users/aditya.gautam/data.txt");
// Windowing the streaming data based on Window Size and applying Bayesian method on window
DataStream<String> ChangePoints = data
.flatMap(new PowerTextTupleSplitter())
.keyBy(1)
.countWindow(WindowSize)
.apply(new DetectBayesianChangepoints()); // This will return the distributions
// Create a sink and write the results (changepoints detected after every window)
ChangePoints.writeAsText("file:///Users/aditya.gautam/Desktop/Work/Bayesian/DataSet/result.txt");
// Execute the program. This is where the DAG is formed and Flink job manager starts the work.
env.execute("Window WordCount");
}

Results : Original Data Sequence (Toy Data)

Contain three type of data
sequence/ states/
distributions with different:
● Mean
● Variance
● Length
Generated in fashion similar
to multi-gaussian distribution.

Results : Data chunks after Bayesian changepoint
Streaming sequence is
divided into individual
distribution through
changepoints detected by
the bayesian algorithm
running on Apache Flink in a
sliding window fashion.
Window Size = 100.
Overlap Size = 30.

Results : Flink streaming result per window

Things happening with the processing of every
window :
1) Add the overlap points from previous window
i.e. preserved in state variables.
2) Find the changepoints.
3) Get the params from previous state.
4) Write the new distribution to Sink.
5) Check which cluster this distribution belongs to
and update prev_state for HHM building.
i.e. to learn state transition probability.
6) Update the new window state to be used for
the next window.

Results : Patterns learned from the data
Length = 51
Mean = 1.52
Variance = 0.11
Length = 79.5
Mean = 3.5
Variance = 0.12
Length = 184.3
Mean = 5.5
Variance = 0.11
State 1 State 2 State 3
Averaged over all the values for a particular state(cluster)

Results : Clustering of states/distributions

State 3
State 1
State 2

Results : Clustering of states/distributions

3 states/distribution
State 1
State 2
State 3

Results : Data with anomaly
Anomaly
S1
S2
S3
S1 -> state 1
S2 -> state 2
S3 -> state 3
S2
S3
S3
S2
S1
Length is abnormally high

Results : Anomaly detection
State 1
State 2
State 3
Anomaly
This point is corresponding to the sequence#4 highlighted
in the previous slide. Length is too much for the given
mean and variance.
This point is too far any cluster centroid.So, it will tagged
as anomaly at the moment it will be seen through
streaming data on Flink.

Results : HMM learned
State 1
State 2 State 3
0.99
0.99
0.99
0.01
0.01
0.01
● Showing probability of state transition learned
from the distribution sequences ordering.
● 0.01 is the smoothing factor.
0.01
0.01
0.01

Results : Why is HHM needed ?
● Just using Bayesian and K mean clustering in not enough. Why ? See the example below.
● Graphical model : To learn about the sequence ordering, and how states are connected to each other.
● Future Prediction : To predict the next sequence of data given some previous sequences.
State 3 is missing

Some use cases :

Following images are from Google search on “time series anomaly”

References :
❏ Anomaly techniques and research papers :
❏ https://hips.seas.harvard.edu/files/adams-changepoint-tr-2007.pdf
❏ http://cucis.ece.northwestern.edu/projects/DMS/publications/AnomalyDetection.pdf
❏ https://www.autonlab.org/tutorials/biosurv.html (Lecture on anomaly by Andrew Moore)
❏ http://reports-archive.adm.cs.cmu.edu/anon/ml2009/CMU-ML-09-101.pdf
❏ Flink and other big data related relevant document/links :
❏ https://flink.apache.org
❏ http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
❏ http://spark.apache.org/streaming/
❏ http://kafka.apache.org/documentation.html
❏ http://www.slideshare.net/sbaltagi/flink-vs-spark [and similar slides on slideshare]
❏ Source of some pictures used in this presentation :
❏ https://flink.apache.org
❏ Images from Google search on “time series anomaly”

Internship_presentation

Related slideshows

More Related Content

Internship_presentation