0
$\begingroup$

Marcos López de Prado writes the following in his book Advances in Financial Machine Learning:

In general, we need at least \frac{1}{2} N (N+1) independent and identically distributed (IID) observations in order to estimate a covariance matrix of size N that is not singular. For example, estimating an invertible covariance matrix of size 50 requires, at the very least, 5 years of daily IID data.

What is the reasoning for that number of observations? Where can I find some sources related to this that I can cite?

$\endgroup$
5
  • 1
    $\begingroup$ Could he be referring to something like this? A covariance matrix of $N$ assets has $N^2$ entries but because the matrix is symmetric, you need to estimate fewer covariances. There are $N$ variances (the ones on the main diagonal) and $\sum\limits_{i=1}^{N} (i-1) = \frac{N(N-1)}{2}$ covariances (either upper or lower triangular matrix). In total, you thus have $N+ \frac{N(N-1)}{2}=\frac{N(N+1)}{2}$ variance-covariance terms. $\endgroup$
    – Kevin
    Commented Aug 24, 2021 at 17:14
  • $\begingroup$ @Kevin I don't think so. See the sentence that I have added to the quote in my question. Thank you. $\endgroup$
    – Nick
    Commented Aug 24, 2021 at 17:40
  • $\begingroup$ While there are N(N+1)/2 "unknowns" in a NxN covar matrix, it's not like they can take any old value. There's obvious limits over how low the entries in a correlation matrix can be and still have a "physical" meaning. So in general, I'd expect valid covar matrices actually require a smaller number than N(N+1)/2 observations. $\endgroup$ Commented Aug 24, 2021 at 17:57
  • 2
    $\begingroup$ Check out these answers and remember that a matrix needs to be of full rank to be invertible. $\endgroup$
    – Bob Jansen
    Commented Aug 24, 2021 at 18:13
  • 1
    $\begingroup$ @BobJansen Cool, never though about it that way, but it does make sense. Thanks a lot:) $\endgroup$
    – Kevin
    Commented Aug 24, 2021 at 18:18

2 Answers 2

1
$\begingroup$

The covariance matrix of $N$ stocks (or whatever) consists of $N(N+1)/2$ distinct elements, so, to statistically measure these elements reasonably well, your number of independent observations $ND$ ($D$ being the number of days) should be well over $O(N^2)$, or $D\gg N$. This requirement is more stringent than the covariance $C_{ij}=\sum_dR_{di}R_{dj}$ being full rank. The latter needs just $D=N$ days to accumulate.

But the catch is elsewhere: Random matrix theory (RMT) indicates that the distribution of eigenvalues, which are relevant to the covariance condition number and inversion/solving/optimization tasks, converges to the "true" distribution very slowly, with the error decreasing only as $\sqrt{N/D}$. De Prado book actually discusses RMT aspects as well. If the observations are not exactly independent, the required number of days is further increased: one can introduce a statistic for the effective number of independent observations.

$\endgroup$
3
$\begingroup$

Let $f(N) = \frac{1}{2} N (N + 1)$ then $f(50) = 1275$. A year has approximately 255 trading days. So you need at least 1275 / 255 = 5 years.

I believe the rule above is used in practice but I think the text is not quite correct (which surprises me, maybe I should have a ☕). If the returns are IID, 51 observations ought to be enough, see the proof in this answer or in "Improved estimation of the covariance matrix of stock returns with an application to portfolio selection" (Ledoit and Wolf, 2003) for something citeable but short.

However, empirically, daily returns of multiple stocks are not IID and it is helpful to have more data, a more advanced estimation covariance scheme or both. To illustrate, a plot of condition numbers

set.seed(42)
sampleSizes <- c(50:1500)
conditionNumbers <- sapply(
  sampleSizes, 
  function(x) {
    kappa(
      cov(matrix(rnorm(n = 50 * x, mean = 0, sd = 0.2), ncol = 50)),
      exact = TRUE
    )
  }
)
plot(sampleSizes, log10(conditionNumbers))

kappa calculates the condition number.

Plot of condition numbers of N(0, .2) 50 returns of lenghts 50 to 1500

$\endgroup$
1
  • 1
    $\begingroup$ As Bob shows here, the (required) number of observations is linked to the invertibility of the covariance matrix in downstream operations. If the matrix is (nearly) rank deficient, it may still be invertible (numerically speaking), but the result will nonsensical. $\endgroup$ Commented Aug 25, 2021 at 6:12

Not the answer you're looking for? Browse other questions tagged or ask your own question.