Basic properties of the kernel density estimator

Question

This is a question from a mathematical statistics textbook, used at the first and most basic mathematical statistics course for undergraduate students. This exercise follows the chapter on nonparametric inference. Part 1 is quite straightforward, however, with part 2-6 I am stuck. I've looked into van der Vaart's book Asymptotic Statistics pages 344-346, but this seems to be the course book in a more advanced course. An attempt at a solution is given. Any help is appreciated.

Exercise

Suppose $x_1, ..., x_n$ are independent and identically distributed (i.i.d.) observations of a random variable $X$ with unknown distribution function $F$ and probability density function $f\in C^m$, for some $m>1$ fixed. Let $$f_n(t)=\frac{1}{n}\sum_{i=1}^n \frac{1}{h}k\left(\frac{t-x_i}{h}\right)$$ be a kernel estimator of $f$, with $k\in C^{m+1}$ a given fixed function such that $k\geq 0$, $\int_{\mathbb{R}} k(u)\mathrm{d}u=1$, $\mathrm{supp} (k)=[-1,1]$ and bandwidth $h=h(u)$ (for the time being unspecified).

Show that $\mathbb{E}[f_n(t)]=\int_{\mathbb{R}} k(u) f(t-hu)\mathrm{d}u$.

Make a series expansion of $f$ around $t$ in terms of $hu$ in the expression for $\mathbb{E}[f_n(t)]$. Suppose that $k$ satisfies $\int_{\mathbb{R}} k(u)\mathrm{d}u=1$, $\int_{\mathbb{R}} k(u)u^l\mathrm{d}u=0$ for all $1<l<m$ and $\int_{\mathbb{R}} k(u)u^m\mathrm{d}u<\infty$. Determine the bias $\mathbb{E}[f_n(t)]-f(t)$ as a function of $h$.

Suppose that $\mathrm{Var}[k(X_1)]<\infty$ and determine $\mathrm{Var}[f_n(t)]$ as a function of $h$.

Determine the mean square error $\mathrm{mse}[f_n(t)]$ from 2 and 3 as a function of $h$.

For what value of $h$, as a function of $n$, is $\mathrm{mse}[f_n(t)]$ smallest?

For the value of $h=h(n)$ obtained from 5, how fast does $\mathrm{mse}[f_n(t)]$ converge to 0, when $n$ converges to $\infty$?

Note; $h=h(u)$ and $1<l<m$ are most likely typos for $h=h(n)$ and $1\leq l<m$ respectively.

Attempt

By linearity of the expectation, identical distribution of $x_1,...,x_n$, the law of the unconscious statistician and the change of variables $u=(t-x)/h$, \begin{align} \mathbb{E}[f_n(t)]&=\frac{1}{n}\sum_{i=1}^n \mathbb{E}\left[\frac{1}{h}k\left(\frac{t-x_i}{h}\right)\right]\\ &=\mathbb{E}\left[\frac{1}{h}k\left(\frac{t-x}{h}\right)\right]\\ &=\int_{\mathbb{R}}\frac{1}{h}k\left(\frac{t-x}{h}\right)f(x)\mathrm{d}x\\ &=\int_{\mathbb{R}}\frac{1}{h}k(u)f(t-hu)h\mathrm{d}u\\ &=\int_{\mathbb{R}}k(u)f(t-hu)\mathrm{d}u. \end{align}
From $f\in C^m$, it follows that $$f(t-hu)=\sum_{l=0}^m \frac{f^{(l)}(t)}{l!} (-hu)^l+o((hu)^m).$$ Then from part 1 and linearity of integration, \begin{align} \mathbb{E}[f_n(t)]&=\int_{\mathbb{R}}k(u)\left(\sum_{l=0}^m \frac{f^{(l)}(t)}{l!} (-hu)^l+o((hu)^m)\right)\mathrm{d}u \\ &=\sum_{l=0}^m\int_{\mathbb{R}}k(u)\frac{f^{(l)}(t)(-hu)^l}{l!}\mathrm{d}u+\int_{\mathbb{R}}k(u)o((hu)^m)\mathrm{d}u. \label{remain} \end{align} From the given conditions on $k$, the $l=0$ term reads \begin{equation} \int_{\mathbb{R}} k(u)f(t)\mathrm{d}u=f(t)\int_{\mathbb{R}} k(u) \mathrm{d}u=f(t). \end{equation} The $1\leq l<m$ terms are $$\int_{\mathbb{R}} k(u)\frac{f^{(l)}(t)}{l!} (-hu)^l\mathrm{d}u=\frac{f^{(l)}(t)(-h)^l}{l!}\int_{\mathbb{R}} k(u)u^l\mathrm{d}u=0.$$ Finally, the $l=m$ term is $$ \frac{f^{(m)}(t)(-h)^m}{m!}\int_{\mathbb{R}} k(u)u^m\mathrm{d}u<\infty.$$ The remainder term is given in Misius's answer (+1). Putting it all together: $$\mathbb{E}[f_n(t)] = f(t) + \frac{f^{(m)}(t)(-h)^m}{m!} \int_{\mathbb{R}}k(u)u^m \mathrm{d}u + o(h^m),$$ and thus $$\mathbb{E}[f_n(t)]-f(t)=\frac{f^{(m)}(t)(-h)^m}{m!} \int_{\mathbb{R}}k(u)u^m \mathrm{d}u + o(h^m)=A(t)h^m+o(h^m),$$ where $A(t)=\frac{f^{(m)}(t)(-1)^m}{m!} \int_{\mathbb{R}}k(u)u^m \mathrm{d}u<\infty.$
See Misius's answer.

\begin{align} \mathrm{mse}[f_n(t)]&=\mathrm{Var}[f_n(t)]+\mathrm{Bias}^2[f_n(t)] \\ &=\left(\frac{f(t)}{nh}\int_{\mathbb{R}}k^2(u)\mathrm{d}u+o\left(\frac{1}{nh}\right)\right)+ \left(A(t)h^m+o(h^m)\right)^2 \\ &=\left(\frac{f(t)}{nh}\int_{\mathbb{R}}k^2(u)\mathrm{d}u+o\left(\frac{1}{nh}\right)\right)+\left(A(t)^2h^{2m}+2A(t)h^mo(h^m)+o(h^{2m})\right) \\ &=\left(\frac{f(t)}{nh}\int_{\mathbb{R}}k^2(u)\mathrm{d}u+o\left(\frac{1}{nh}\right)\right)+\left(A(t)^2h^{2m}+o(h^{2m})+o(h^{2m})\right)\\ &=\left(\frac{f(t)}{nh}\int_{\mathbb{R}}k^2(u)\mathrm{d}u+o\left(\frac{1}{nh}\right)\right)+\left(A(t)^2h^{2m}+o(h^{2m})\right)\\ &\approx \frac{f(t)}{nh}\int_{\mathbb{R}}k^2(u)\mathrm{d}u+A(t)^2h^{2m}. \end{align}

From the approximation obtained in part 4, it follows that $\mathrm{mse}[f_n(t)](h)$ has an absolute minimum for $h\in(0,\infty)$, since $\mathrm{mse}[f_n(t)](h)\to\infty$ for $h\to 0$ and $h\to \infty$. The absolute minimum is found by differentiating $\mathrm{mse}[f_n(t)](h)$ and solving for $h$ when the derivative equals $0$, that is \begin{equation} h=\left(\frac{f(t)\int_{\mathbb{R}}k^2(u)\mathrm{d}u}{A^2(t)2mn}\right)^{1/(2m+1)}. \end{equation}
Plugging in the value of $h$ obtained in part 5 into the approximation obtained in part 4, one finds that \begin{equation} \mathrm{mse}[f_n(t)]\propto n^{-2m/(2m+1)}, \end{equation} since both $1/nh$ and $h^{2m}$ reduce to $n^{-2m/(2m+1)}$ for $h\propto n^{-1/(2m+1)}$.

May you share the name of the book where you got those exercises from? — Guilherme Marthe, Commented Jun 16, 2021 at 22:46
Also, what does $o$ stand for, in the Taylor expansion of the first line in the 2nd exercise attempt? I think it's a notation I'm not familiar with. — Guilherme Marthe, Commented Jun 16, 2021 at 22:49
@GuilhermeMarthe studentlitteratur.se/kurslitteratur/matematik-och-statistik/… — psie, Commented Jun 17, 2021 at 19:14
@schn You are correct, the bandwidth should depend on the sample size. Moreover, in 6. it is already stated as $h = h(n)$. You are also correct in $1\leq l < m$, however, the equality for $l =1$ should follow from the symmetry of the kernel function: en.wikipedia.org/wiki/… But I do not see symmetry in your assumptions... — Misius, Commented Jul 1, 2021 at 10:50

Misius · Accepted Answer · 2021-07-01 10:42:42Z

You did everything correctly, you are just missing the last step. First of all, you can write $o((uh)^m) = u^m o(h^m)$, which will lead to $$ \int_\mathbb{R} k(u) o((uh)^m)du = o(h^m)\int_\mathbb{R} k(u) u^mdu = o(h^m)$$ by the properties of the kernel function given in the assignment. Hence, $$\mathbb{E}[f_n(t)] = f(t) + \frac{f^{(m)}(t)(-h)^m}{m!} \int_{\mathbb{R}}k(u)u^m du + o(h^m).$$

In the simplest case, when the underlying density is twice continuously differentiable, we have

$$Bias(f_n(t)) = \mathbb{E}[f_n(t)] - f(t) = h^2 \frac{\nu_2(k)}{2}{}f^{\prime\prime}(t) + o(h^2),$$ where $\nu_2(k) = \int_{\mathbb{R}}k(u)u^2$. This is the most common expression for the bias and for many usual kernel functions.

Edit:

Note that the Bias depends on $t$. Hence, in your notation it should be $A(t)$, not just $A$. It exists and is finite, because $f^{(m)}$ is $m$-times continuously differentiable, and $f^{(m)}(t)$ is just a number.

3. \begin{align*} \mathrm{Var}[f_n(t)]&=\frac{1}{nh}\left(h\int_{\mathbb{R}}k^2(u)f(t-hu)\mathrm{d}u-\left(h\int_{\mathbb{R}}k(u)f(t-hu)\mathrm{d}u\right)^2\right)\\ &=\frac{1}{nh^2}\left(h\int_{\mathbb{R}}k^2(u)\left(f(t)-(hu)f^{(1)}(t)+o(hu)\right)\mathrm{d}u +O(h^2) \right)\\ &=\frac{1}{nh^2}\left(h\cdot f(t)\int_{\mathbb{R}}k^2(u)du - h^2 f^{(1)}(t)\int_{\mathbb{R}}k^2(u)udu + o(h^2) +O(h^2) \right)\\ &=\frac{1}{nh^2}\left(h\cdot f(t)\int_{\mathbb{R}}k^2(u)du +O(h^2) \right)\\ &=\frac{f(t)}{nh}\int_{\mathbb{R}}k^2(u)du +o\left(\frac{1}{nh}\right). \end{align*}

Here, we used the following facts:

a. $o(h^2) + O(h^2) = O(h^2)$

b. $\int |k^2(u)u|du \le \int k^2(u)du< \infty$ because $k(u)$ has support on $[-1, 1]$. In the kernel density estimation, it is usually assumed that $\int k^2(u)du$ is finite and sometimes is denoted by $R(k)$ (roughness of the kernel, though the term is rarely used). It is possible that this condition can vbe derived from the assumptions that are given in the exercise but I have not tried it.

c. $O(h^2) \cdot \frac{1}{nh^2} = O(h) \cdot\frac{1}{nh} = o(1)\cdot \frac{1}{nh} = o\left(\frac{1}{nh}\right)$.

Everything is correct, just do not forget that $A$ depends on $t$.
You will have different order for the variance. If I understand the question correctly, you can "forget" about the remainder terms and find $h$ such that the main part of MSE is the smallest. The sum of these two terms are often called asymptotic mean squared error.
And here you will need also to substitute the optimal bandwidth from 5. into the formula for MSE from 4.

You are almost there. I will have a closer look later today. — Misius, Commented Jun 16, 2021 at 19:35
@schn I saw your previous comments, but was finishing up the paper so had zero spare time to answer it. You are correct, it does not hold always. I corrected the explanation accordingly — Misius, Commented Jul 1, 2021 at 10:43
@schn $u$ is not $u(x)$, $u$ is the variable that the integral is taken with respect to — Misius, Commented Jul 2, 2021 at 16:52
@schn You are making it much more complicated than it is. Have a look at en.wikipedia.org/wiki/Big_O_notation#Properties (small o behaves very similarly to big o with respect to the properties). The limit in the definition should be with respect to those things that are going to infiinity, so $n\to\infty$. The relation $f(n)=o(g(n))$ is equivalent to $\lim_{n\to\infty} \frac{f(n)}{g(n)} = 0$. In your case, $g(n) = u^m h(n)^m$, where $u$ is a constant. — Misius, Commented Jul 3, 2021 at 11:10

Stack Exchange Network

Basic properties of the kernel density estimator

Exercise

Attempt

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
self-study
nonparametric
asymptotics
kernel-smoothing
or ask your own question.

Linked

Hot Network Questions

Basic properties of the kernel density estimator

Exercise

Attempt

1 Answer 1

Not the answer you're looking for? Browse other questions tagged self-studynonparametricasymptoticskernel-smoothing or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
self-study
nonparametric
asymptotics
kernel-smoothing
or ask your own question.