9
$\begingroup$

I want to have information of http response times. The average, 95 and 99 percentil.

All the response times are collected in each web server, and every second is sent back to a central server that aggregates the info and generate a report every 2 seconds. That report includes the total number of requests, average response time, 95 and 99 percentile, but for size efficiency not all the data. That is stored and graph.

Now the problem: I want to have info for the last minute, hour, day, week. The idea is to aggregate the reports. Average response time with the weight of each sample is easy. But with the percentiles, I have the feeling that is not as easy as calculate the average in base of the number of request.

Questions:

  • Is there any exact way to calculate an aggregated 95 percentile?
  • Is there an approximation that could suit for this use case good enough?
$\endgroup$

3 Answers 3

15
$\begingroup$

There is no math for meaningfully aggregating percentiles. Once you've summarized things as percentiles (and discarded the raw data or histogram distribution behind them) there is no way to aggregate the summarized percentiles into anything useful for the same percentile levels. And yes, this means that those "average percentile" legend numbers that show in various percentile monitoring charts are completely bogus.

A simple way to demonstrate why any attempt at aggregating percentiles by averaging them (weighted or not) is useless, try it with a simple to reason about percentile: the 100%'ile (the max).

E.g. If I had the following 100%'iles reported for each one minute interval, each with the same overall event count: [1, 0, 3, 1, 601, 4, 2, 8, 0, 3, 3, 1, 1, 0, 2]

The (weighted or not) average of this sequence is 42. And it has as much relation to the overall 100%'ile as the phase of the moon does. No amount of fancy averaging (weighted or not) will produce a correct answer for "what is the 100%'ile of the overall 15 minute period?". There is only one correct answer: 601 was the 100%'ile seen during the 15 minutes period.

There are only two percentiles for which you can actually find math that works for accurate aggregation across intervals: - the 100%'ile (for which the answer is "the max is the max of the maxes") - 0%'ile (for which the answer is "the min is the min of the mins")

For all other percentiles, the only correct answer is "The aggregate N%'ile is somewhere between the lowest and highest N%'ile seen in any interval in the aggregate time period". And that's not a very useful answer. Especially when the range for those can cover the entire spectrum. In many real world data sets, it often amounts to something close to "it's somewhere between the overall min and overall max".

For more ranting on this subject: http://latencytipoftheday.blogspot.com/2014/06/latencytipoftheday-q-whats-wrong-with_21.html http://latencytipoftheday.blogspot.com/2014/06/latencytipoftheday-you-cant-average.html

$\endgroup$
4
$\begingroup$

No, there is no exact way to aggregate percentiles -- you need all data, or at least a histogram of all data.

Yes, there are approximations, Q-digest seems very promising.

In my spare time I hacked up histogram of quantised log-response-time, domain limited to some arbitrary min and max. That answers my queries in log(N) space, considering the imposed limits it's actually O(1). Roughly, the histogram looks like this:

buckets = 1ms|2ms|4ms|...|1h|2h|4h
counts = 0,0,12,145,...,1,0,0,0

In my case, I used series [2**(0.1*i) for i in range(100)], that gives nice 7% increments from 1 to about 1000; 1 being considered impossibly fast and 1000 unbearably slow.

Faster lookup is achieved by using cumulative histogram, as binary search can be used.

$\endgroup$
1
  • 1
    $\begingroup$ The log quantization is the approach I used in Maatkit's mk-query-digest (now Percona Toolkit's pt-query-digest) and it is both fast and useful. It is also not strictly accurate in all cases, but in most cases it does give exact answers. I did not invent this; I saw it previously in the sysbench system benchmarking tool. $\endgroup$
    – user381310
    Commented Dec 12, 2017 at 13:43
3
+50
$\begingroup$

It would be very nice to have data for more percentiles. If you had the $90,91,92,\dots,99$ percentiles for each block of time you could do very well. Say you had this data along with then number of requests for each of the $168$ hours of a week and want to report the aggregate $95$ percentile for the week. Average all the $95$ percentiles for a starting point. Say that time is the $93$ percentile for the first hour. Then $7\%$ of the requests that hour took longer. Total up the number of requests that took longer and compare to the total number of requests in the week. Say you find the this time is $96$ percentile. Shorten it down a bit and try again. You can use bisection to guide your search.

If you don't have the additional data, I don't see what you can do better than average. If the need is just to report a number, that will be good enough. It probably will also be good enough to spot trends, but may not be an accurate $95$ percentile. Do you need that?

$\endgroup$

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .