35
$\begingroup$

I want to estimate the quantile of some data. The data are so huge that they can not be accommodated in the memory. And data are not static, new data keep coming. Does anyone know any algorithm to monitor the quantiles of the data observed so far with very limited memory and computation? I find P2 algorithm useful, but it does not work very well for my data, which are extremely heavy-tailed distributed.

$\endgroup$
3

6 Answers 6

19
$\begingroup$

The P2 algorithm is a nice find. It works by making several estimates of the quantile, updating them periodically, and using quadratic (not linear, not cubic) interpolation to estimate the quantile. The authors claim quadratic interpolation works better in the tails than linear interpolation and cubic would get too fussy and difficult.

You do not state exactly how this approach fails for your "heavy-tailed" data, but it's easy to guess: estimates of extreme quantiles for heavy-tailed distributions will be unstable until a large amount of data are collected. But this is going to be a problem (to a lesser extent) even if you were to store all the data, so don't expect miracles!

At any rate, why not set auxiliary markers--let's call them $x_0$ and $x_6$--within which you are highly certain the quantile will lie, and store all data that lie between $x_0$ and $x_6$? When your buffer fills you will have to update these markers, always keeping $x_0 \le x_6$. A simple algorithm to do this can be devised from a combination of (a) the current P2 estimate of the quantile and (b) stored counts of the number of data less than $x_0$ and the number of data greater than $x_6$. In this fashion you can, with high certainty, estimate the quantile just as well as if you had the entire dataset always available, but you only need a relatively small buffer.

Specifically, I am proposing a data structure $(k, \mathbf{y}, n)$ to maintain partial information about a sequence of $n$ data values $x_1, x_2, \ldots, x_n$. Here, $\mathbf{y}$ is a linked list

$$\mathbf{y} = (x^{(n)}_{[k+1]} \le x^{(n)}_{[k+2]} \le \cdots \le x^{(n)}_{[k+m]}).$$

In this notation $x^{(n)}_{[i]}$ denotes the $i^\text{th}$ smallest of the $n$ $x$ values read so far. $m$ is a constant, the size of the buffer $\mathbf{y}$.

The algorithm begins by filling $\mathbf{y}$ with the first $m$ data values encountered and placing them in sorted order, smallest to largest. Let $q$ be the quantile to be estimated; e.g., $q$ = 0.99. Upon reading $x_{n+1}$ there are three possible actions:

  • If $x_{n+1} \lt x^{(n)}_{[k+1]}$, increment $k$.

  • If $x_{n+1} \gt x^{(n)}_{[k+m]}$, do nothing.

  • Otherwise, insert $x_{n+1}$ into $\mathbf{y}$.

In any event, increment $n$.

The insert procedure puts $x_{n+1}$ into $\mathbf{y}$ in sorted order and then eliminates one of the extreme values in $\mathbf{y}$:

  • If $k + m/2 \lt n q$, then remove $x^{(n)}_{[k+1]}$ from $\mathbf{y}$ and increment $k$;

  • Otherwise, remove $x^{(n)}_{[k+m]}$ from $\mathbf{y}$.

Provided $m$ is sufficiently large, this procedure will bracket the true quantile of the distribution with high probability. At any stage $n$ it can be estimated in the usual way in terms of $x^{(n)}_{[\lfloor{q n}\rfloor]}$ and $x^{(n)}_{[\lceil{q n}\rceil]}$, which will likely lie in $\mathbf{y}$. (I believe $m$ only has to scale like the square root of the maximum amount of data ($N$), but I have not carried out a rigorous analysis to prove that.) At any rate, the algorithm will detect whether it has succeeded (by comparing $k/n$ and $(k+m)/n$ to $q$).

Testing with up to 100,000 values, using $m = 2\sqrt{N}$ and $q=.5$ (the most difficult case) indicates this algorithm has a 99.5% success rate in obtaining the correct value of $x^{(n)}_{[\lfloor{q n}\rfloor]}$. For a stream of $N=10^{12}$ values, that would require a buffer of only two million (but three or four million would be a better choice). Using a sorted doubly linked list for the buffer requires $O(\log(\sqrt{N}))$ = $O(\log(N))$ effort while identifying and deleting the max or min are $O(1)$ operations. The relatively expensive insertion typically needs to be done only $O(\sqrt{N})$ times. Thus the computational costs of this algorithm are $O(N + \sqrt{N} \log(N)) = O(N)$ in time and $O(\sqrt{N})$ in storage.

$\endgroup$
9
  • $\begingroup$ This is an extended work of P2 algorithm. [link] sim.sagepub.com/content/49/4/159.abstract. The storage is still too much for my application, which runs on small sensors with a total of 10K RAM. I can consume a few hundred bytes at most for quantile estimation only. $\endgroup$ Commented Mar 7, 2011 at 18:50
  • $\begingroup$ @whuber Actually I implement the extended P2 and test it with generated samples from various distributions such as uniform and exponential, where it works great. But when I apply it against data from my application, whose distribution is unknown, sometimes it fails to converge and yields relative error (abs(estimation - actual) / actual) up to 300%. $\endgroup$ Commented Mar 7, 2011 at 18:55
  • 2
    $\begingroup$ @sino The quality of the algorithm compared to using all the data shouldn't depend on the heaviness of the tails. A fairer way to measure error is this: let $F$ be the empirical cdf. For an estimate $\hat{q}$ of the $q$ percentile, what is the difference between $F(\hat{q})$ and $F(q)$? If it is on the order of $1/n$ you're doing awfully well. In other words, just what percentile is the P2 algorithm returning for your data? $\endgroup$
    – whuber
    Commented Mar 7, 2011 at 19:20
  • 1
    $\begingroup$ You are right. I just measured the F(qˆ) and F(q) for the case I mentioned with relative error up to 300%. For q of 0.7, qˆ is almost 0.7, resulting in negligible error. However, for q of 0.9, qˆ seems to be around 0.95. I guess that's why I have huge error of up to 300%. Any idea why it's 0.95, not 0.9? BTW, can I post figure here and how can I post mathematical formula as you did? $\endgroup$ Commented Mar 8, 2011 at 2:35
  • 2
    $\begingroup$ @whuber I'm pretty confident that my implementation conforms to extended P2. 0.9 still goes to 0.95 or even larger when I simultaneously estimate 0.8, 0.85, 0.9, 0.95 quantiles. However, 0.9 goes very close to 0.9 if 0.8, 0.85, 0.9, 0.95 and 1.0 quantiles are tracked at the same time. $\endgroup$ Commented Mar 8, 2011 at 4:29
8
$\begingroup$

I think whuber's suggestion is great and I would try that first. However, if you find you really can't accomodate the $O(\sqrt N)$ storage or it doesn't work out for some other reason, here is an idea for a different generalization of P2. It's not as detailed as what whuber suggests - more like a research idea instead of as a solution.

Instead of tracking the quantiles at $0$, $p/2$, $p$, $(1+p)/2$, and $1$, as the original P2 algorithm suggests, you could simply keep track of more quantiles (but still a constant number). It looks like the algorithm allows for that in a very straightforward manner; all you need to do is compute the correct "bucket" for incoming points, and the right way to update the quantiles (quadratically using adjacent numbers).

Say you keep track of $25$ points. You could try tracking the quantile at $0$, $p/12$, $\dotsc$, $p \cdot 11/12$, $p$, $p + (1-p)/12$, $\dotsc$, $p + 11\cdot(1-p)/12$, $1$ (picking the points equidistantly in between $0$ and $p$, and between $p$ and $1$), or even using $22$ Chebyshev nodes of the form $p/2 \cdot (1 + \cos \frac{(2 i - 1)\pi}{22})$ and $p + (1 - p)/2 \cdot (1 + \cos \frac{(2i-1)\pi}{22})$. If $p$ is close to $0$ or $1$, you could try putting fewer points on the side where there is less probability mass and more on the other side.

If you decide to pursue this, I (and possibly others on this site) would be interested in knowing if it works...

$\endgroup$
1
  • 1
    $\begingroup$ +1 I think this is a great idea given the OP's constraints. All one can hope for is an approximation, so the trick is to pick bins that have a high likelihood of being narrow and containing the desired quantile. $\endgroup$
    – whuber
    Commented Mar 7, 2011 at 21:51
6
$\begingroup$

Press et al., Numerical Recipes 8.5.2 "Single-pass estimation of arbitrary quantiles" p. 435, give a c++ class IQAgent which updates a piecewise-linear approximate cdf.

$\endgroup$
1
  • $\begingroup$ books.google.com/… for a version that doesn't require Flash. $\endgroup$
    – ZachB
    Commented Jun 13, 2018 at 22:05
3
$\begingroup$

I'd look at quantile regression. You can use it to determine a parametric estimate of whichever quantiles you want to look at. It make no assumption regarding normality, so it handles heteroskedasticity pretty well and can be used one a rolling window basis. It's basically an L1-Norm penalized regression, so it's not too numerically intensive and there's a pretty full featured R, SAS, and SPSS packages plus a few matlab implementations out there. Here's the main and the R package wikis for more info.

Edited:

Check out the math stack exchange crosslink: Someone sited a couple of papers that essentially lay out the very simple idea of just using a rolling window of order statistics to estimate quantiles. Literally all you have to do is sort the values from smallest to largest, select which quantile you want, and select the highest value within that quantile. You can obviously give more weight to the most recent observations if you believe they are more representative of actual current conditions. This will probably give rough estimates, but it's fairly simple to do and you don't have to go through the motions of quantitative heavy lifting. Just a thought.

$\endgroup$
2
$\begingroup$

This can be adapted from algorithms that determine the median of a dataset online. For more information, see this stackoverflow post - https://stackoverflow.com/questions/1387497/find-median-value-from-a-growing-set

$\endgroup$
1
  • $\begingroup$ The computational resources required of the algorithm you link to are unnecessarily large and do not meet the requirements of this question. $\endgroup$
    – whuber
    Commented Feb 13, 2019 at 12:59
2
$\begingroup$

It is possible to estimate (and track) quantiles on an on-line basis (the same applies to the parameters of a quantile regression). In essence, this boils down to stochastic gradient descent on the check-loss function which defines quantile-regression (quantiles being represented by a model containing only an intercept), e.g. updating the unknown parameters as and when observations arrive.

See the Bell Labs paper "Incremental Quantile Estimation for Massive Tracking" ( ftp://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/kdd/p516-chen.pdf)

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.