Database Performance Analysis with Time Series

Time Series – Data that is collected sequentially, usually in regular intervals.Time series are all around us – weather, stock, cpu, disk space…
Recognize abnormal data and send alerts Recognize changes and be proactive Analyze long term trends for planning Set Realistic SLAs
One question we’ll keep asking ourselves: Which techniques are really useful?
All kinds of data issues can prevent analysisYou can and sometimes should fix the data so analysis is possibleReplace missing data with average values (or maximum values where makes sense)Remove outliers when it makes sense.Analyze two sides of discontinuity separately
Linear trend. Easy to fit and use, but rarely makes sense in real life
Moving Average requires picking a window size and weights.Small window – matches data better, but may include noiseLarge window – more of a general trend, but will contain a delay
Remove trend to allow analyzing other components.
50 degrees Fahrenheit is cold for August but hot for January. How about 60% CPU? Is it always OK or always a problem?
Reminder: Correlation is a measure of the strength of the relation between two variables. How much do the variables change together?
How is data in our series correlates to itself? We see strong correlation between data points 24 hours away.
Average CPU for each hour. Similar to those average temperatures for each month charts you sometimes see in tour guides.
One chart to rule them all – data, trend, seasonality and all the rest.
“All the rest” is not completely random – there is still some auto-correlation. Data correlates to points with a lag of one and two.
R used the auto-correlations to model the data
We test the model.We can see that the residuals no longer have auto-correlationand the statistical test for the fit shows that the result is likely not random.
I added couple of hours with high CPU here. Can you spot them?
After removing seasonality and average, we can clearly see that data point that is an outlier. It stands out.
Calculate moving average of future by adding the moving average for the last 20 points as an additional point. Then using the last 19 real points and the new one to calculate another point… Obviously this gets less accurate the more you do it.Adding seasonality is a matter of adding the hourly average to the appropriate new points.
Red – Match the model to existing dataBlue – Predicted dataGreen – 99% probability that we will not get data outside these lines
A bit like moving average but with very specific weights.
Blue – Predicted dataGreen – 99% probability that we will not get data outside these lines
The redo data is very noisy, but adding a moving average trend allows us to see a point where redo generation drops. This happened to be Dec 20 where many users left for vacation.
Correlation every 6 hours and stronger correlation every 24. These are the times we recalculate materialized views. Few views every 6 hours and a bunch every 24.
Removing the seasonality allows us to notice abnormal data. Worth investigating – what was running at that time? Is it likely to happen again?
Not exactly trend, but we do have changing levels of data.
There are periodic correlations but they are not regular, so it is not seasonality.This graph does indicate extremely strong auto-correlation
Partial Autocorrelation graph. This is similar to autocorrelation, but when we calculate auto-correlation for lag 2, we remove the correlation already explained by lag 1 and so on.Using this graph we can see auto-correlation up to lag 17. Once the CPU climbs, it may take over 3 hours until it is back to normal!
Checking that the AR(17) model fits.

Database Performance Analysis with Time Series

Related slideshows

More Related Content

Database Performance Analysis with Time Series

Editor's Notes