In performing MCMC simulations, it is standard practice to 'equilibriate' or 'thermalize' the system and then discard the initial data before useful sampling is done.
My question is about the concept of thermalization itself, not its implementation. In any such MCMC simulation, discarding the first few states amounts to choosing a new initial condition for the Markov Chain. Loosely, what are the features that we expect this initial condition to possess?
I have some ideas:
- It is not an unlikely configuration. Starting from an unlikely configuration could bias the first several measurements heavily, especially in local update algorithms.
- The update algorithm should be able to 'easily transit' out of the state. This is worded poorly, but what I mean is that the algorithm shouldn't be stuck in this state for a long time. That would also bias the measurements.
So it looks like being highly likely/highly unlikely is bad. In the example of the Ising model, this means neither the 'chessboard' configuration nor the all-aligned configuration is a good starting point. I surmise that states having energy closer to the average energy would be better for the job.
Am I on the right track here? If the 'average energy' conclusion works, I don't understand how the thermalization procedure is supposed to return such a state with high likelihood (unless the thermal variance of the energy is small (?)).
I couldn't find any references discussing this (at least the way I've worded it). Sorry if I'd missed something obvious.
Edit: The bulk of my confusion came from thinking of each step of the Markov chain as a single configuration rather than a random variable, a distinction pointed out by Yvan Velenik in his answer below.
Let $X_t$ be the random variable corresponding to step $t$ of a chain. Starting a chain with configuration $S_0$ is not the same as equilibriating a chain after $\tau$ steps and discarding the first $\tau$ (even if one obtains $S_{\tau} = S_0$). The latter chain is expected to produce samples according to the stationary distribution $D$ because $X_{\tau} \sim D$, but the former isn't, because $X_0 \sim \delta(S_0)$, where $\delta(S_0)$ is the distribution in which $P(S_0) = 1$. This is what I'd missed.
A helpful reference is Alan Sokal's notes on Monte Carlo methods, where he also makes the useful distinction between the exponential and integrated autocorrelation times.