2
$\begingroup$

I am confused about the way that the Jurafsky and Martin book (Appendix A, page 6) explains the relationship between the observations and hidden states:

Each cell of the forward algorithm trellis $α_t(j)$ represents the probability of being in state $j$ after seeing the first $t$ observations, given the automaton $λ$

Then they define $\alpha_t(j)$ as $P(o_1,o_2,\dots,o_t,q_t = j|λ)$.

I am not sure why the quoted text refers to the joint probability between the observations and the state at time $t$. I'd rather define it as $\alpha t(j) = P(q_t = j|o_1,o_2,\dots,o_t,λ)$. I think the conditional distribution makes more sense to use here, as we first see the first $t$ observations then we find the probability of being in state $j$.

A separate question: Although I am not convinced that the joint distribution is what we should use here, I think they marginalized over all previous states:

\begin{align} \alpha t(j) &= \sum_{q_1,\dots,q_{t-1}}P(q_1, q_2,\dots,q_t=j,o_1,o_2,\dots,o_t|λ) \\ &=P(o_1,o_2,\dots,o_t,q_t = j|λ) \end{align}

Is this correct?

$\endgroup$

2 Answers 2

3
$\begingroup$

There are two questions here - I'll give the short answers, then elaborate.

  1. Should $\alpha_t$ be defined in terms of the joint distribution of the state and observations, or should it be conditioned? It should be joint.
  2. Are the two formulas you wrote equivalent? Yes, by the law of total probability.

You may prefer to define $\alpha_t$ conditionally, but the J&M book is right. In the forward algorithm, you marginalize over all of the possible previous states to determine what the next state should be. (That's your separate question, but it's relevant here.) Your conditional distribution can be defined in terms of the joint distribution:

$$ P(q_t = j|o_1,o_2,\dots,o_t,λ) = \frac{ P(o_1,o_2,\dots,o_t,q_t = j|λ) }{ P(o_1,o_2,\dots,o_t|λ) } $$

But then—how do you compute the denominator? It's only in terms of what the forward–backward algorithm computes: you have to marginalize the joint distribution over all hidden states. You've got a problem of circularity. We instead use the joint distribution in order to compute the conditional probabilities. The forward algorithm lets us do this efficiently.

$\endgroup$
2
  • $\begingroup$ So defining $\alpha_t$ conditionally is intrinsically right but we opt to the joint to overcome the problem of circularity. Right? $\endgroup$
    – rando
    Commented Oct 7, 2021 at 15:05
  • $\begingroup$ I don't know that I agree with "not intrinsically right" (which is more a question of your modeling aesthetics), but it does avoid the circularity problem—yes. $\endgroup$ Commented Oct 7, 2021 at 15:06
0
$\begingroup$

One reason why it is more intuitive to define $α_t(j)$ as conditional on just $\lambda$ and not also the observations $o_t$ is that ultimately we are building up to finding the model parameterization $\lambda$ which gives the highest probability of generating the observations.

See for example the final equation on page 7 which lays out some pseudocode for the algorithm. The quantity we are interested in at the end is $P(O|\lambda)$.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.