12
$\begingroup$

I am using the Kalman filter in a very standard way. The system is represented by the state equation $x_{t+1}=Fx_{t}+v_{t+1}$ and the observation equation $y_{t}=Hx_{t}+Az_{t}+w_{t}$.

Textbooks teach that after applying the Kalman filter and getting the "one-step-ahead forecasts" $\hat{x}_{t|t-1}$ (or "filtered estimate"), we should use them to compute the likelihood function:

$f_{y_{t}|\mathcal{I}_{t-1},z_{t}}\left(y_{t}|\mathcal{I}_{t-1},z_{t}\right)=\det\left[2\pi\left(HP_{t|t-1}H^{\prime}+R\right)\right]^{-\frac{1}{2}}\exp\left\{ -\frac{1}{2}\left(y_{t}-H\hat{x}_{t|t-1}-Az_{t}\right)^{\prime}\left(HP_{t|t-1}H^{\prime}+R\right)^{-1}\left(y_{t}-H\hat{x}_{t|t-1}-Az_{t}\right)\right\}$

My question is: Why is the likelihood function computed using the "filtered estimate" $\hat{x}_{t|t-1}$ and not the "smoothed estimate" $\hat{x}_{t|T}$? Isn't $\hat{x}_{t|T}$ a better estimate of the state vector?

$\endgroup$
1
  • $\begingroup$ I edited the title to be more informative. $\endgroup$ Commented Oct 19, 2017 at 16:53

3 Answers 3

5
$\begingroup$

To answer your question: you can use the smoothing density. But you don't have to. Jarle Tufto's answer has the decomposition that you're using. But there are others.

Using the Kalman Recursions

Here you're evaluating the likelihood as $$ f(y_1, \ldots, y_n) = f(y_1)\prod_{i=2}^nf(y_i|y_1, \ldots, y_{i-1}). $$

However, means and variances don't always fully define probability distributions in general. The following is the decomposition that you're using to go from filtering distributions $f(x_{i-1}|y_1,\ldots,y_{i-1})$ to conditional likelihoods $f(y_i|y_1,\ldots,y_{i-1})$:

$$ f(y_i|y_1, \ldots, y_{i-1}) = \iint f(y_i|x_i)f(x_i|x_{i-1})f(x_{i-1}|y_1, \ldots, y_{i-1})dx_{i} dx_{i-1} \tag{1}. $$

Here $f(x_i|x_{i-1})$ is the state transition density...part of the model, and $f(y_i|x_i)$ is the observation density...part of the model again. In your question you write these as $x_{t+1}=Fx_{t}+v_{t+1}$ and $y_{t}=Hx_{t}+Az_{t}+w_{t}$ respectively. It's the same thing.

When you get the one step ahead state prediction distribution, that's computing $\int f(x_i|x_{i-1})f(x_{i-1}|y_1, \ldots, y_{i-1}) dx_{i-1}$. When you integrate again, you obtain (1) completely. You write that density out completely in your question, and it's the same thing.

Here you're only using decompositions of probability distributions, and assumptions about the model. This likelihood calculation is an exact calculation. There isn't anything discretionary that you can use to do this better or worse.

Using the EM Algorithm

To my knowledge, there is no other way to evaluate the likelihood directly in this kind of state space model. However, you can still do maximum likelihood estimation by evaluating a different function: you can use the EM algorithm. In the Expectation step (E-Step) you would compute $$ \int f(x_1, \ldots, x_n|y_1,\ldots y_n) \log f(y_1,\ldots,y_n,x_1, \ldots,x_n) dx_{1:n} = E_{smooth}[\log f(y_1,\ldots,y_n,x_1, \ldots,x_n)]. $$ Here $f(y_1,\ldots,y_n,x_1, \ldots,x_n)$ is the "complete data" likelihood, and you're taking the expectation of the log of that with respect to the joint smoothing density. What often happens is that, because you're taking the log of this complete data likelihood, the terms split up into sums, and because of linearity of the expectation operator, you're taking expectations with respect to the marginal smoothing distributions (the ones you mention in your question).

Other things

I've read in places that the EM is a "more stable" way to maximize the likelihood, but I've never really seen this point argued well, nor have I seen this word "stable" defined at all, but also I haven't really examined this further. Neither of these algorithms get around the local/global maxima ordeal. I personally tend to use the Kalman more often just out of habit.

It's true that smoothed estimates of the state have smaller variance typically than filtering, so I guess you're right to have some intuition about this, but you're not really using the states. The likelihood you're trying to maximize is not a function of the states.

$\endgroup$
2
  • $\begingroup$ How different are KF and EM? They end up doing the same thing in vaguely similar manners. $\endgroup$
    – Mitch
    Commented Aug 7, 2017 at 14:18
  • 1
    $\begingroup$ @Mitch that's probably something that deserves more than a comment. It will depend on what general purpose optimizer you use with the KF, and what type of EM you use. I'm not going to be too sure without looking into it. $\endgroup$
    – Taylor
    Commented Aug 7, 2017 at 17:47
9
$\begingroup$

In general, by the product rule, the exact likelihood can be written $$ f(y_1,\dots,y_n)=f(y_1)\prod_{i=2}^n f(y_i|y_1,\dots,y_{i-1}). $$ From the assumption of the state space model, it follows that the expectation vector and variance matrix of each $y_i$ conditional on past observations can be expressed as \begin{align} E(y_i|y_1,\dots,y_{i-1}) &= E(Hx_{t}+Az_{t}+w_{t}|y_1,\dots,y_{i-1}) \\&= HE(x_{t}|y_1,\dots,y_{i-1})+Az_{t}+Ew_{t} \\&= H\hat x_{t|t-1}+Az_{t}, \end{align} and \begin{align} \mathrm{Var}(y_i|y_1,\dots,y_{i-1}) &= \mathrm{Var}(Hx_{t}+Az_{t}+w_{t}|y_1,\dots,y_{i-1}) \\&= H\mathrm{Var}(x_{t}|y_1,\dots,y_{i-1})H'+ \mathrm{Var}w_t \\&= HP_{t|t-1}H'+R. \end{align} So this gives you the exact likelihood without computing any smoothed estimates.

While you of course could use the smoothed estimates which indeed are better estimates of the unknown states, this would not give you the likelihood function. In effect you would be using the observed value of $y_i$ to estimate its own expected value so it seems likely that this would lead to some bias in the resulting estimates.

$\endgroup$
0
$\begingroup$

I think a better answer as to "why" the smoothing distribution is not used (typically) is efficiency. It is in principle straightforward to calculate the (smoothing) marginal likelihood in a leave-one-out sense as follows. Delete observation j, run the Kalman smoother on the remaining data. Then evaluate the likelihood of the unseen y(j). Repeat this for all j. Sum up the log-likelihoods. Faster versions of this works with (randomized) blocks of held-out samples (like k-fold CV). Notice that this scheme requires a more general implementation of the Kalman filter/smoother which can arbitrarily skip measurement updates where required. The backward/smoothing pass does not access the measurements (RTS algorithm anyway) and remains the same.

If the time-series is "long enough" there is likely little useful benefit in doing this since the filtering likelihood "burns off" its initial transient. But if the dataset is short the more expensive smoothing likelihood may be worth it. A fixed-lag smoother could be an in-between solution.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.