5
$\begingroup$

How can I show that the variance of local polynomial regression is increasing with the degree of the polynomial (Exercise 6.3 in Elements of Statistical Learning, second edition)?

This question has been asked before but the answer just states it follows easliy.

More precisely, we consider $y_{i}=f(x_{i})+\epsilon_{i}$ with $\epsilon_{i}$ being independent with standard deviation $\sigma.$

The estimator is given by

$$ \hat{f}(x_{0})=\left(\begin{array}{ccccc} 1 & x_{0} & x_{0}^{2} & \dots & x_{0}^{d}\end{array}\right)\left(\begin{array}{c} \alpha\\ \beta_{1}\\ \vdots\\ \beta_{d} \end{array}\right) $$ for $\alpha,\beta_{1},\dots,\beta_{d}$ solving the following weighted least squares problem $$ \min\left(y_{d}-\underbrace{\left(\begin{array}{ccccc} 1 & x_{1} & x_{1}^{2} & \dots & x_{1}^{d}\\ \vdots\\ 1 & & & & x_{n}^{d} \end{array}\right)}_{X}\left(\begin{array}{c} \alpha\\ \beta_{1}\\ \vdots\\ \beta_{d} \end{array}\right)\right)^{t}W\left(y-\left(\begin{array}{ccccc} 1 & x_{1} & x_{1}^{2} & \dots & x_{1}^{d}\\ \vdots\\ 1 & & & & x_{n}^{d} \end{array}\right)\left(\begin{array}{c} \alpha\\ \beta_{1}\\ \vdots\\ \beta_{d} \end{array}\right)\right) $$ for $W=\text{diag}\left(K(x_{0},x_{i})\right)_{i=1\dots n}$ with $K$ being the regression kernel. The solution to the weighted least squares problem can be written as $$ \left(\begin{array}{cccc} \alpha & \beta_{1} & \dots & \beta_{d}\end{array}\right)=\left(X^{t}WX\right)^{-1}X^{t}WY. $$ Thus, for $l(x_{0})=\left(\begin{array}{ccccc} 1 & x_{0} & x_{0}^{2} & \dots & x_{0}^{d}\end{array}\right)\left(X^{t}WX\right)^{-1}X^{t}W$ we obtain $$ \hat{f}(x_{0})=l(x_{0})Y $$ implying that $$ \text{Var }\hat{f}(x_{0})=\sigma^{2}\left\Vert l(x_{0})\right\Vert ^{2}=\left(\begin{array}{ccccc} 1 & x_{0} & x_{0}^{2} & \dots & x_{0}^{d}\end{array}\right)\left(X^{t}WX\right)^{-1}X^{t}W^{2}X\left(X^{t}WX\right)^{-1}\left(\begin{array}{ccccc} 1 & x_{0} & x_{0}^{2} & \dots & x_{0}^{d}\end{array}\right)^{t}. $$ My approach: An induction using the formula for the inverse of a block matrix but I did not succeed.

The paper Multivariate Locally Weighted Least Squares Regression by D. Ruppert and M. P. Wand derives an asymptotic expression for the variance for $n\rightarrow\infty$ in Theorem 4.1 but it is not clear that is increasing in the degree.

$\endgroup$

2 Answers 2

4
+100
$\begingroup$

If the variance increases for every weighting matrix $W$, then this also holds true for $W=I$. Henceforth, I will use the notation of OLS. We have $$y=X\beta+u\textrm{,} \qquad \textrm{with} \qquad X\in\mathbb{R}^{n\times k};\, y,u\in\mathbb{R}^{n};\, \beta\in\mathbb{R}^{k}$$ and with the standard assumptions. For a polynomial regression, let $\begin{pmatrix}x_{1},x_{2},\ldots,x_{n}\end{pmatrix}^{T}=:x\in\mathbb{R}^{n}$ be some vector, then we have $$X:=\begin{bmatrix}x^{0},x^{1},x^{2},\ldots,x^{k-1}\end{bmatrix}\textrm{,}$$ where exponenation is understood element-wise.

The OLS estimate for the polynomial weights is $$\hat{\beta} := \left(X^{T}X\right)^{-1}X^{T}y\textrm{.}$$ For any $t\in\mathbb{R}$ we can set $$z:=\begin{pmatrix}t^{0}\\t^{1}\\t^{2}\\\vdots\\t^{k-1}\end{pmatrix}\in\mathbb{R}^{k}\textrm{.}$$ An estimate of $y$ at $t$ is then given by $\hat{y}_{t}:=z^{T}\hat{\beta}$. For the variance of $\hat{y}_{t}$ w need to know its expected value: \begin{align} \mathbb{E}\left[\hat{y}_{t}\right] &= \mathbb{E}\left[z^{T}\hat{\beta}\right] =\mathbb{E}\left[z^{T}\left(X^{T}X\right)^{-1}X^{T}y\right] \\ &=\mathbb{E}\left[z^{T}\left(X^{T}X\right)^{-1}X^{T}X\beta + z^{T}\left(X^{T}X\right)^{-1}X^{T}u\right] \\ &=z^{T}\beta + z^{T}\left(X^{T}X\right)^{-1}X^{T}\mathbb{E}\left[u\right] = z^{T}\beta \end{align} From this calculation we see that $$\hat{y}_{t}-\mathbb{E}\left[\hat{y}_{t}\right] = z^{T}\left(X^{T}X\right)^{-1}X^{T}u\textrm{.}$$ Now we can calculate the variance of $\hat{y}_{t}$: \begin{align} \operatorname{Var}\left[\hat{y}_{t}\right] &= \mathbb{E}\left[\left(\hat{y}_{t}-\mathbb{E}\left[\hat{y}_{t}\right]\right)\left(\hat{y}_{t}-\mathbb{E}\left[\hat{y}_{t}\right]\right)^{T}\right] \\ &= \mathbb{E}\left[\left(z^{T}\left(X^{T}X\right)^{-1}X^{T}u\right)\left(z^{T}\left(X^{T}X\right)^{-1}X^{T}u\right)^{T}\right] \\ &= \mathbb{E}\left[\left(z^{T}\left(X^{T}X\right)^{-1}X^{T}u\right)\left(u^{T}X\left(X^{T}X\right)^{-1}z\right)\right] \\ &= z^{T}\left(X^{T}X\right)^{-1}X^{T}\mathbb{E}\left[uu^{T}\right]X\left(X^{T}X\right)^{-1}z \\ &= \sigma^{2}z^{T}\left(X^{T}X\right)^{-1}z \textrm{.} \end{align} If we increase $k\mapsto k+1$, we will have $$X_{*}:=\begin{bmatrix}x^{0},x^{1},x^{2},\ldots,x^{k-1},x^{k}\end{bmatrix}\in\mathbb{R}^{n\times\left(k+1\right)}\textrm{,}$$ and therefore $\hat{\beta_{*}}\in\mathbb{R}^{k+1}$ and $$z_{*}:=\begin{pmatrix}t^{0}\\t^{1}\\t^{2}\\\vdots\\t^{k-1}\\t^{k}\end{pmatrix}\in\mathbb{R}^{k+1}\textrm{.}$$ The variance of $\hat{y}_{t}^{*}$ is now a $\left(k+1\right)\times\left(k+1\right)$ matrix \begin{equation} \operatorname{Var}\left[\hat{y}_{t}^{*}\right]=\sigma^{2}z_{*}^{T}\left(X_{*}^{T}X_{*}\right)^{-1}z_{*}\textrm{,} \end{equation} which we need to compare to the $k\times k$ matrix $\operatorname{Var}\left[\hat{y}_{t}\right]$. Since we have inverses, the Schur complement will help: \begin{equation} \begin{pmatrix}A & B\\C & D\end{pmatrix}^{-1} = \begin{pmatrix} \left(A-B D^{-1} C \right)^{-1} & -\left(A-B D^{-1} C \right)^{-1} B D^{-1} \\ -D^{-1}C\left(A-B D^{-1} C \right)^{-1} & D^{-1}+ D^{-1} C \left(A-B D^{-1} C \right)^{-1} B D^{-1} \end{pmatrix} \end{equation} Since we have $$X_{*} := \begin{bmatrix}X,x^{k}\end{bmatrix}$$ and $$z_{*} := \begin{pmatrix}z\\t^{k}\end{pmatrix}$$ we can write with the abbreviation $q:=x^{k}$ \begin{equation} \operatorname{Var}\left[\hat{y}_{t}\right] = \sigma^{2} \begin{pmatrix}z^{T},t^{k}\end{pmatrix} \begin{pmatrix}X^{T}X & X^{T}q \\ q^{T}X & q^{T}q\end{pmatrix}^{-1} \begin{pmatrix}z\\t^{k}\end{pmatrix} \textrm{.} \end{equation} We can now invert this block matrix using the aforementioned Schur complement and get \begin{align} \operatorname{Var}\left[\hat{y}_{t}\right] &= \sigma^{2} \begin{pmatrix}z^{T},t^{k}\end{pmatrix} \begin{pmatrix} \left(X^{T}X - X^{T}q \left(q^{T}q\right)^{-1} q^{T}X \right)^{-1} & B_{*} \\ B_{*}^{T} & D_{*} \end{pmatrix} \begin{pmatrix}z\\t^{k}\end{pmatrix} \\ &= \sigma^{2} \left( z^{T}\left(X^{T}X - X^{T}q \left(q^{T}q\right)^{-1} q^{T}X \right)^{-1} z + t^{k}z^{T}B_{*} + t^{k}B_{*}^{T}z + t^{2k}D_{*} \right) \textrm{.} \end{align} The matrix $X^{T}q \left(q^{T}q\right)^{-1} q^{T}X$ is positive semi-definit, because it can be written as $$X^{T}q \left(q^{T}q\right)^{-1} q^{T}X = \left(q^{T}q\right)^{-1}X^{T}qq^{T}X$$ and $qq^{T}$ is a rank $1$ matrix with the only non-vanishing eigen value equal to $q^{T}q$. The matrix $q \left(q^{T}q\right)^{-1} q^{T}$ is the projection on the subspace spanned by $q=x^{k}$, so $X^{T}X \succeq X^{T}q \left(q^{T}q\right)^{-1} q^{T}X$, i.e. the difference $X^{T}X - X^{T}q \left(q^{T}q\right)^{-1} q^{T}X$ is positive semi-definite. If we invert, the resulting matrix stays positive semi-definit, but it follows that $$X^{T}X \succeq X^{T}q \left(q^{T}q\right)^{-1}X \implies \left(X^{T}X\right)^{-1} \preceq \left(X^{T}q \left(q^{T}q\right)^{-1}X\right)^{-1} $$ So in \begin{equation} \operatorname{Var}\left[\hat{y}_{t}^{*}\right] = \sigma^{2}z^{T}\left(X^{T}X - X^{T}q \left(q^{T}q\right)^{-1} q^{T}X \right)^{-1} z + 2\sigma^{2}t^{k}z^{T}B_{*} + \sigma^{2}t^{2k}D_{*} \textrm{.} \end{equation} We can calculate each of the terms and conclude that with increasing polynomial degree the variance is non-decreasing.

$\endgroup$
3
  • $\begingroup$ Dear Marco, thank you for your answer. I am bit confused about you saying that qqT is rank 1. With your notation q is just a scalar. $\endgroup$
    – warsaga
    Commented Mar 2, 2015 at 12:14
  • $\begingroup$ I adopted the convention that every vector $v$ is a column vector and every vector $v^{T}$ is a row vector. So $qq^{T}$ is a rank one matrix with only non-trivial eigenvalue $q^{T}q$. But you are right, when I first defined $x$, I did not write the transpose sign. I will add it in my answer. $\endgroup$ Commented Mar 3, 2015 at 6:41
  • 2
    $\begingroup$ I thought your arguments would trivially generalise to the weighted case, but because of the $W^2$ in the expression for $ \text{Var }\hat{f}(x_{0})$ it doesn’t. Do you have any advice on that? $\endgroup$
    – warsaga
    Commented Mar 8, 2015 at 9:49
0
$\begingroup$

I'd share my incomplete solution, seems to work for $W=I$, but I failed to prove for a general matrix.

Different from Marco's answer, we can calculate from the estimator for a higher degree, say $d$, and treat the estimator for degree $d-1$ as a constrained least square estimator for degree $d$.

For simplicity, consider $W = I$ first. Let $\hat \beta$ be the least square estimator of degree $d$, i.e.,

$$ X = \begin{bmatrix} 1 & x_1 & \cdots & x_1^d\\ 1 & x_2 & \cdots & x_2^d\\ \vdots & \vdots & \ddots & \vdots\\ 1 & x_n & \cdots & x_n^d \end{bmatrix} $$

and let $\hat\beta_c$ be the constrained least square estimator under the condition

$$ A\beta:= \begin{bmatrix} 0 & 0 &\cdots & 0 & 1 \end{bmatrix}\beta = 0 $$

In a word, $\hat\beta_c$ has the same dimension with $\hat\beta$, but the last element is 0.

By Lagrange multiplier, it can be shown that (such as in https://en.wikipedia.org/wiki/Ordinary_least_squares#Constrained_estimation),

$$ \hat\beta_c = [I - (X^TX)^{-1}A^T(A(X^TX)^{-1}A^T)^{-1}A]\hat\beta\triangleq (I-\Delta)\hat\beta\,. $$

Denote

$$ (X^TX)^{-1} = \begin{bmatrix} * & * & \cdots & \sigma_{0d}\\ * & * & \cdots & \sigma_{1d}\\ \vdots & \vdots & \ddots & \vdots\\ \sigma_{d0} & \sigma_{d1} & \cdots & \sigma_{dd} \end{bmatrix}_{(d+1)\times (d+1)} $$

then $A(X^TX)^{-1}A^T=\sigma_{dd}$\, and

$$ A^T(A(X^TX)^{-1}A^T)^{-1}A = \begin{bmatrix} 0 & 0 & \cdots & 0\\ 0 & 0 & \cdots & 0\\ \vdots & \vdots & \ddots & 0\\ 0 & 0 & \cdots & \frac{1}{\sigma_{dd}} \end{bmatrix} $$

where only the last diagonal element is nonzero, so only the last column of $\Delta$ is nonzero, with values

$$ \Delta_d = \begin{bmatrix} \frac{\sigma_{0d}}{\sigma_{dd}} & \frac{\sigma_{1d}}{\sigma_{dd}} & \cdots & 1 \end{bmatrix}^T\,. $$

Note that

$$ \newcommand\Var{\mathrm{Var}} \newcommand\E{\mathrm{E}} \begin{align} \Var(\hat\beta_c) &= \E(\hat\beta_c-\beta_c)(\hat\beta_c-\beta_c)^T\\ &=(I-\Delta)\E(\hat\beta-\beta)(\hat\beta-\beta)^T(I-\Delta^T)\\ &=(I-\Delta)\Var(\hat\beta)(I-\Delta^T) \end{align} $$

then with new observation $z = [1, x, \cdots, x^d]$, and since $\Var(\hat\beta)=\sigma^2(X^TX)^{-1}$, then $$ \begin{align} \Var(z^T\hat\beta_c) &= z^T\Var(\hat\beta_c)z\\ &=z^T\Var(\hat\beta)z-2z^T\Delta\Var(\hat\beta)z + z^T\Delta\Var(\hat\beta)\Delta^Tz\\ &=\Var(z^T\hat\beta)-2\sigma^2z^T\Delta(X^TX)^{-1}z + \sigma^2z^T\Delta(X^TX)^{-1}\Delta^Tz \end{align}\,, $$

where

$$ \begin{align} z^T\Delta(X^TX)^{-1}z &= \begin{bmatrix}0,0,\ldots,\sum\limits_{k=0}^dx^k\dfrac{\sigma_{kd}}{\sigma_{dd}}\end{bmatrix}(X^TX)^{-1}z\\ &= \sum_{k=0}^dx^k\frac{\sigma_{kd}}{\sigma_{dd}}\begin{bmatrix}\sigma_{d0},\sigma_{d1},\ldots,\sigma_{dd}\end{bmatrix}z \\ &=\left(\sum_{k=0}^dx^k\frac{\sigma_{kd}}{\sigma_{dd}}\right)\cdot \left(\sum_{k=0}^dx^k\sigma_{kd}\right)\\ &=\left(\sum_{k=0}^dx^k\frac{\sigma_{kd}}{\sigma_{dd}}\right)^2\sigma_{dd} \end{align} $$ and $$ z^T\Delta(X^TX)^{-1}\Delta^Tz = \left(\sum_{k=0}^dx^k\frac{\sigma_{kd}}{\sigma_{dd}}\right)^2\sigma_{dd} $$ thus, $$ \Var(z^T\hat\beta_c) = \Var(z^T\hat\beta) - z^T\Delta(X^TX)^{-1}\Delta^Tz \le \Var(z^T\hat\beta)\,, $$ which means the variance increases with the degree of the local polynomial.

For general weight matrix $W$, the constrained least square can be obtained easily via Lagrange multiplier,

$$ \hat\beta_c = [I - (X^TWX)^{-1}A^T(A(X^TWX)^{-1}A^T)^{-1}A]\hat\beta\triangleq (I-\Delta)\hat\beta\,, $$

and continue the above procedure, but since

$$ \Var(\hat\beta) = (X^TWX)^{-1}(X^TW^2X)(X^TWX)^{-1}\,, $$

is more complex than $W=I$, I failed to compare $2z^T\Delta\Var(\hat\beta)z$ and $z^T\Delta\Var(\hat\beta)\Delta^Tz$. I wish someone can give me some hints.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.