Definition of a probability in forward-backward method for HMM

Question

I am confused about the way that the Jurafsky and Martin book (Appendix A, page 6) explains the relationship between the observations and hidden states:

Each cell of the forward algorithm trellis $α_t(j)$ represents the probability of being in state $j$ after seeing the first $t$ observations, given the automaton $λ$

Then they define $\alpha_t(j)$ as $P(o_1,o_2,\dots,o_t,q_t = j|λ)$.

I am not sure why the quoted text refers to the joint probability between the observations and the state at time $t$. I'd rather define it as $\alpha t(j) = P(q_t = j|o_1,o_2,\dots,o_t,λ)$. I think the conditional distribution makes more sense to use here, as we first see the first $t$ observations then we find the probability of being in state $j$.

A separate question: Although I am not convinced that the joint distribution is what we should use here, I think they marginalized over all previous states:

\begin{align} \alpha t(j) &= \sum_{q_1,\dots,q_{t-1}}P(q_1, q_2,\dots,q_t=j,o_1,o_2,\dots,o_t|λ) \\ &=P(o_1,o_2,\dots,o_t,q_t = j|λ) \end{align}

Is this correct?

Arya McCarthy · Accepted Answer · 2021-10-07 14:58:02Z

There are two questions here - I'll give the short answers, then elaborate.

Should $\alpha_t$ be defined in terms of the joint distribution of the state and observations, or should it be conditioned? It should be joint.
Are the two formulas you wrote equivalent? Yes, by the law of total probability.

You may prefer to define $\alpha_t$ conditionally, but the J&M book is right. In the forward algorithm, you marginalize over all of the possible previous states to determine what the next state should be. (That's your separate question, but it's relevant here.) Your conditional distribution can be defined in terms of the joint distribution:

$$ P(q_t = j|o_1,o_2,\dots,o_t,λ) = \frac{ P(o_1,o_2,\dots,o_t,q_t = j|λ) }{ P(o_1,o_2,\dots,o_t|λ) } $$

But then—how do you compute the denominator? It's only in terms of what the forward–backward algorithm computes: you have to marginalize the joint distribution over all hidden states. You've got a problem of circularity. We instead use the joint distribution in order to compute the conditional probabilities. The forward algorithm lets us do this efficiently.

So defining $\alpha_t$ conditionally is intrinsically right but we opt to the joint to overcome the problem of circularity. Right? — rando, Commented Oct 7, 2021 at 15:05
I don't know that I agree with "not intrinsically right" (which is more a question of your modeling aesthetics), but it does avoid the circularity problem—yes. — Arya McCarthy, Commented Oct 7, 2021 at 15:06

Adam Kells · Accepted Answer · 2021-10-08 09:45:49Z

0

One reason why it is more intuitive to define $α_t(j)$ as conditional on just $\lambda$ and not also the observations $o_t$ is that ultimately we are building up to finding the model parameterization $\lambda$ which gives the highest probability of generating the observations.

See for example the final equation on page 7 which lays out some pseudocode for the algorithm. The quantity we are interested in at the end is $P(O|\lambda)$.

answered Oct 8, 2021 at 9:45

Adam Kells

1,2863 silver badges13 bronze badges

Add a comment |

Stack Exchange Network

Definition of a probability in forward-backward method for HMM

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
conditional-probability
joint-distribution
hidden-markov-model
forward-backward
or ask your own question.

Hot Network Questions

Definition of a probability in forward-backward method for HMM

2 Answers 2

Not the answer you're looking for? Browse other questions tagged conditional-probabilityjoint-distributionhidden-markov-modelforward-backward or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
conditional-probability
joint-distribution
hidden-markov-model
forward-backward
or ask your own question.