0
$\begingroup$

In performing MCMC simulations, it is standard practice to 'equilibriate' or 'thermalize' the system and then discard the initial data before useful sampling is done.

My question is about the concept of thermalization itself, not its implementation. In any such MCMC simulation, discarding the first few states amounts to choosing a new initial condition for the Markov Chain. Loosely, what are the features that we expect this initial condition to possess?

I have some ideas:

  1. It is not an unlikely configuration. Starting from an unlikely configuration could bias the first several measurements heavily, especially in local update algorithms.
  2. The update algorithm should be able to 'easily transit' out of the state. This is worded poorly, but what I mean is that the algorithm shouldn't be stuck in this state for a long time. That would also bias the measurements.

So it looks like being highly likely/highly unlikely is bad. In the example of the Ising model, this means neither the 'chessboard' configuration nor the all-aligned configuration is a good starting point. I surmise that states having energy closer to the average energy would be better for the job.

Am I on the right track here? If the 'average energy' conclusion works, I don't understand how the thermalization procedure is supposed to return such a state with high likelihood (unless the thermal variance of the energy is small (?)).

I couldn't find any references discussing this (at least the way I've worded it). Sorry if I'd missed something obvious.


Edit: The bulk of my confusion came from thinking of each step of the Markov chain as a single configuration rather than a random variable, a distinction pointed out by Yvan Velenik in his answer below.

Let $X_t$ be the random variable corresponding to step $t$ of a chain. Starting a chain with configuration $S_0$ is not the same as equilibriating a chain after $\tau$ steps and discarding the first $\tau$ (even if one obtains $S_{\tau} = S_0$). The latter chain is expected to produce samples according to the stationary distribution $D$ because $X_{\tau} \sim D$, but the former isn't, because $X_0 \sim \delta(S_0)$, where $\delta(S_0)$ is the distribution in which $P(S_0) = 1$. This is what I'd missed.

A helpful reference is Alan Sokal's notes on Monte Carlo methods, where he also makes the useful distinction between the exponential and integrated autocorrelation times.

$\endgroup$
2
  • $\begingroup$ I'm going to guess that "thermal" depends on the nature of the thing you are trying to simulate. (By which I am hinting to please say what you are simulating.) A nuclear reactor, for example, can have thermal neutrons. A photon gas can be thermal. It will be quite a different situation. $\endgroup$
    – Dan
    Commented Mar 4, 2022 at 18:43
  • $\begingroup$ @Dan In the present context, the term thermalization has a pretty clear meaning: reaching thermal equilibrium, which is what I address in my answer. I don't think that the answer to the aspects raised in the OP depends much on the system to be simulated (of course, there are all kinds of intricate implementation issues for realistic, complex systems, but the basic issues remain the same). $\endgroup$ Commented Mar 4, 2022 at 18:56

1 Answer 1

2
$\begingroup$

One does not discard "the first few states amounts to choos[e] a new initial condition". One discards them because the goal of the simulation is to sample configurations from the Gibbs measure.

The Markov chain is constructed in such a way that its unique stationary distribution is given by the Gibbs measure. The ergodic theorem for Markov chains then guarantees that if one lets the chain run for infinitely many steps, then it will be found in a given configuration with a probability given by the Gibbs measure.

Of course, one cannot let the chain run for an infinite time, so one stops it after a finite number of steps and one hopes that one has waited long enough for the chain to be distributed approximately according to its stationary distribution.

A second, closely related, reason for discarding steps is that, even if the chain has reached equilibrium, you have to wait long enough between measurements if you want the latter to be approximately independent. More accurately, if you wish to compute the expected value of some observable under the Gibbs measure by averaging along a trajectory of the Markov chain, then you need a long piece of the trajectory, because the chain has to explore a large part of the state space (this is less of a problem for macroscopic observables, since the latter usually take nearly constant values over most of the state space, as measure by the Gibbs probability measure). Measuring the observables over successive steps will give you highly correlated values, not much better than what you get from a single measurement. The ergodic theorem does again guarantee that the average over an infinite trajectory of the Markov chain will almost surely coincide with the expectation under the Gibbs measure. But you only have access to a finite piece of the trajectory, so it has to be long enough for the approximation to be reasonable...

That being said, you can indeed accelerate convergence (to a good approximation of the equilibrium measure) by starting from a configuration that is not too far from equilibrium. For instance, if you consider the Ising model with a small positive magnetic field, but start with a configuration consisting entirely of $-$ spins, then the time to relax will be much longer than if you start with a configuration consisting of only $+$ spins. Indeed, the former will first relax to the metastable - phase and you will have to wait for the creation of a supercritical droplet in order to reach true equilibrium.

In particular, it is true that if you start with a configuration sampled from the Gibbs measure, then your chain is immediately distributed according to the stationary distribution (that's precisely why it is called a stationary distribution). In that case, you don't have to wait for thermalization (you are already distributed correctly), but you still have to wait between successive measurements for the reasons explained above. Of course, the whole point is that you need the MC to sample the initial configuration according to the Gibbs measure, so such an approach is practically useless (and is replaced by the initial run of the Markov chain).

Let me mention, to conclude, that there exists a special type of implementation of MCMC algorithms, the so-called perfect simulation algorithms, that run for a random but finite amount of time and are guaranteed to output a configuration distributed exactly according to the Gibbs measure. But such algorithms are not always available in a form that is computationally efficient. The best known such algorithm is coupling from the past.

$\endgroup$
11
  • $\begingroup$ You said that one stops the chain after a finite number of steps in the hope that the resulting state is distributed according to the Gibbs distribution. Does this waiting need to be done between every measurement/sample, then? Most references I know only mention "waiting for a long time" at the very beginning of the chain's run. I don't understand how that is any different from picking a 'good' initial condition. $\endgroup$
    – Grifter
    Commented Mar 4, 2022 at 17:53
  • $\begingroup$ Well, how do you define "good"? The only reasonable way of getting a "good" configuration is to sample it from the Gibbs measure, which is precisely what you are trying to do in the first place. Moreover, even starting from a "good" configuration does not remove the need to wait before successive measurements in order to remove unwanted statistical dependencies between them. $\endgroup$ Commented Mar 4, 2022 at 18:10
  • $\begingroup$ Concerning the time you need to wait between successive measurements: you don't need stricto sensu to wait at all between the measurements, since the ergodic theorem again guarantees that the time-average of the quantity you measure will converge to the average under the Gibbs measure. However, if you measure without waiting long enough, then you will get strong correlations. In order to get the correct equilibrium value, you need to explore a large part of the state space and this requires letting the chain explore the latter, which takes time. $\endgroup$ Commented Mar 4, 2022 at 18:14
  • $\begingroup$ Re. how to define a 'good' state, that's what I've attempted to do in points 1 and 2 of my question. Thanks for the clarification about correlations, I think I get the post-thermalization picture now. $\endgroup$
    – Grifter
    Commented Mar 5, 2022 at 3:05
  • $\begingroup$ But I'm still missing the significance of the thermalization. Suppose I initialize my system (say 2D Ising) to a state $S_0$ and then run it until it has thermalized. I now have a state $S_t$ (which should be distributed according to Gibbs) and discard everything prior to it. Is there not a non-zero likelihood that $S_0 == S_t$? $$ $$ If so, and $S_0$ is indeed the same as $S_t$, isn't it the same as not having thermalized at all? $\endgroup$
    – Grifter
    Commented Mar 5, 2022 at 3:09

Not the answer you're looking for? Browse other questions tagged or ask your own question.