10
$\begingroup$

I have implemented Viterbi and Forward algorithm, alas strangely I can't understand how does Backward algorithm work. Intuitively I feel like I need to do the same thing as in Forward only backwards, using the values calculated during Forward propagation.

Is my intuition correct?

I have read a lot of slides and sick of math notation at this point. It does not help. I need something that in plain English explains the difference between Backward and Forward algorithms.

Can you provide a short explanation how is Backward algorithm done?

Assume the following small HMM and the results of Forward algorithm for "BB" sequence below:

START -> 1
H: 0.5 * 0.8 = 0.4
L: 0.5 * 0.6 = 0.3

1 -> 2
H: 0.4 * 0.2 * 0.8 + 0.3 * 0.6 * 0.8 = 0.208
L: 0.4 * 0.8 * 0.6 + 0.3 * 0.4 * 0.6 = 0.264

2 -> END
END: 0.208 * 0.3 + 0.264 * 0.7 = 0.2472

enter image description here

$\endgroup$

1 Answer 1

7
$\begingroup$

Looks like your observation sequence is B,B. Let's denote the observation at time $t$ as $z_t$ and hidden state at time $t$ as $x_t$. If we denote $\alpha_t(i)$ as the forward values and $\beta_t(i)$ as the backward values, ($i$ is one of the possible hidden states)

$\alpha_t(i)=P(x_t=i,z_{1:t})$

This means $\alpha_t(i)$ is the probability of arriving to state $i$ at time $t$ emitting the observations up to time $t$. Then,

$\beta_t(i) = P(z_{t+1:T}\mid x_t=i)$ which is the probability of emitting the remaining sequence from $t+1$ until the end of time after being at hidden state $i$ at time $t$.

To do the recursion on $\beta_t(i)$ we can write,

$P(z_{t+1:T}\mid x_t=i)=\sum\limits_jP(x_{t+1}=j,z_{t+1:T}\mid x_{t}=i)$

Using chain rule, $P(x_{t+1}=j,z_{t+1:T}\mid x_{t}=i) = P(z_{t+2:T},z_{t+1},x_{t+1}=j\mid x_{t}=i)\\ =P(z_{t+2:T}\mid z_{t+1},x_{t+1}=j, x_{t}=i)P(z_{t+1}\mid x_{t+1}=j, x_{t}=i)P(x_{t+1}=j\mid x_{t}=i)$

From conditional independencies of HMM the above probabilities simplifies to $P(z_{t+2:T}\mid x_{t+1}=j)P(z_{t+1}\mid x_{t+1}=j)P(x_{t+1}=j\mid x_{t}=i)$

Note that $P(z_{t+2:T}\mid x_{t+1}=j) =\beta_{t+1}(j) $ from our definition.

Substituting to $P(z_{t+1:T}\mid x_t=i)$ we get,

$\beta_t(i) = P(z_{t+1:T}\mid x_t=i) = \sum\limits_j \beta_{t+1}(j)P(z_{t+1}\mid x_{t+1}=j)P(x_{t+1}=j\mid x_t=i)$

Now you have a recursion for beta. Last two terms of the last equation you know from your model. Here starting from end of the chain (T) we go backward calculating all $\beta_t$, hence the backward algorithm. In forward you have to start from the beginning and you go to the end of chain.

In your model you have to initialize $\beta_T(i) = P(\emptyset \mid x_T=i)=1$ for all $i$. This is the probability of not emitting observations after $T=2$.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.