LR statistics add up for nested models. What about the Wald test?

Question

Consider models M0, M1, M2. Let M0 $\subset$ M1 $\subset$ M2, i.e. let the models nest each other. I test the following pairs of models using the likelihood-ratio (LR) test: M0 vs. M2, M0 vs. M1, M1 vs. M2. I obtain test statistics $\chi^2_{\text{LR},02}$, $\chi^2_{\text{LR},01}$, $\chi^2_{\text{LR},12}$, respectively. Obviously, $\chi^2_{\text{LR},02} = \chi^2_{\text{LR},01} + \chi^2_{\text{LR},12}$.

If I were to use Wald test instead, would a similar relationship hold for its test statistics $\chi^2_{\text{Wald},ij}$? What would it be, exactly?

Idk, but these answers (of mine...sorry) may be a starting point (in linear models): stats.stackexchange.com/questions/449344/… stats.stackexchange.com/questions/449494/… stats.stackexchange.com/questions/276192/… — Christoph Hanck, Commented Apr 25 at 14:57
@ChristophHanck, thank you. I had forgotten about these (but I see I had upvoted them, probably long ago). I was implementing some tests in R and using the equality $\chi^2_{\text{LR},02} = \chi^2_{\text{LR},01} + \chi^2_{\text{LR},12}$ as a check for coding errors. It served me well. Continuing, I switched from LR to Wald and started wondering if a similar check was available. — Richard Hardy, Commented Apr 25 at 16:09

Christoph Hanck · Accepted Answer · 2024-04-27 15:33:30Z

Partial answer for linear models (I do not argue things are different for nonlinear models, just do not present results):

In linear models, when using the chi-square version of the Wald test, it is nearly the case for large $n$ when the null is true (and, afaik, under local alternatives). For the F-test version it is not true but requires (which is precisely what the chi-square version does) multiplication with a correction for the number of hypotheses tested.

Remaining differences stem from different ways of estimating the error variances as e.g. discussed in the links. (So to return to your purpose of validating code, "close" is maybe too vague to be helpful.) More specifically, the Wald statistics in the linear model for $M1$ or $M0$ vs. $M2$ use the error variance estimate for the large model $M2$, whereas the statistic for $M0$ vs. $M1$ uses the error variance estimate based on model $M1$.

In formula, from Ranking of Wald, LR and score statistic in the normal linear regression model, we have (with $X_j$ the regressor matrix of model $j$) that we can write, for a residual maker matrix $M_{X_j}=I-X_j(X_j'X_j)^{-1}X_j'$, the Wald statistic in the linear model for $M1$ vs. $M2$ as $$ \begin{eqnarray} \mathcal{W}_{12}&=&n\frac{y'M_{X_1}y-y'M_{X_2}y}{y'M_{X_2}y}\label{waldproj} \end{eqnarray} $$ Correspondingly, the Wald statistic for $M0$ vs. $M1$ is (and analogously for $M0$ vs. $M2$) $$ \begin{eqnarray} \mathcal{W}_{01}&=&n\frac{y'M_{X_0}y-y'M_{X_1}y}{y'M_{X_1}y}\label{waldproj2} \end{eqnarray} $$ such that, even though addiding the numerator terms of $\mathcal{W}_{01}$ and $\mathcal{W}_{12}$ would indeed allow us to cancel out $y'M_{X_1}y$, we cannot simply do so as the test statistics do not share the same denominator.

Hence, we do not have $\mathcal{W}_{01}+\mathcal{W}_{12}=\mathcal{W}_{02}$.

Remark 1: If we knew the error variance, the result would be true, as Wald and LR then are identical - no differences can arise from estimating the variance in different ways. See also Exact equivalence of LR and Wald in linear regression under known error variance)

Remark 2: If the null is not true, the sum need not even be close to $\mathcal{W}_{02}$: from Ranking of Wald, LR and score statistic in the normal linear regression model consider $M1$ and $M2$, letting $$x_{12}:=\frac{y'M_{X_1}y/n}{y'M_{X_2}y/n}=\hat\sigma^2_{M1}/\hat\sigma^2_{M2}.$$

Similarly, $$x_{01}:=\frac{y'M_{X_0}y/n}{y'M_{X_1}y/n}=\hat\sigma^2_{M0}/\hat\sigma^2_{M1}.$$

Then, $n(x_{12}-1)=\mathcal{W}_{12}$ and $n(x_{01}-1)=\mathcal{W}_{01}$. If the null $M0$ is true, $x_{ij}\to1$ as all models consistently estimate the error variance $\sigma^2$.

One might then go on to prove that the sums of the $\chi^2$ random variables of the two submodels is indeed asymptotically equivalent to the random variable associated with $M02$ (not done here).

If, however, $M2$ is true, $\hat\sigma^2_{M0}$ and $\hat\sigma^2_{M1}$ will not be consistent estimators of $\sigma^2$ anymore. Denote $\sigma_0^2=\text{plim}\hat\sigma^2_{M0}$ and $\sigma_1^2=\text{plim}\hat\sigma^2_{M1}$.

Hence, $$x_{01}-1\to_p \frac{\sigma_0^2}{\sigma_1^2}-1$$ and $$x_{12}-1\to_p \frac{\sigma_1^2}{\sigma^2}-1,$$ so that $$\mathcal{W}_{01}+\mathcal{W}_{12}\sim n\left(\frac{\sigma_0^2}{\sigma_1^2}+\frac{\sigma_1^2}{\sigma^2}-2\right)$$ while $$ \mathcal{W}_{02}\sim n \left(\frac{\sigma_0^2}{\sigma^2}-1\right) $$

Remark 3: That the result is true under the null (s.th. $\mathcal{W}_{ij}=O_p(1)$) when $n$ is large can also be motivated via asymptotic equivalence of $\mathcal{W}$ and $LR$: from $\mathcal{W}_{ij}=n(x_{ij}-1)$ and $LR_{ij}=n\log(x_{ij})$, write $LR_{ij}=n\log(1+\mathcal{W}_{ij}/n)$. A Taylor expansion around 1 yields $$ \begin{eqnarray*} LR_{ij}&=&n[\mathcal{W}_{ij}/n+o_p(\mathcal{W}_{ij}/n)]\\ &=&n[\mathcal{W}_{ij}/n+o_p(O_p(1)/n)]\\ &=&n[\mathcal{W}_{ij}/n+o_p(n^{-1})]\\ &=&\mathcal{W}_{ij}+o_p(1) \end{eqnarray*} $$

Remark 4: For those, like me, needed a second to see why the result is "obviously" correct for LR: note that we can rewrite $$ LR_{02}=n\log(x_{02})=n\log(x_{01}x_{12})=n\log(x_{01})+n\log(x_{12})=LR_{01}+LR_{12} $$

Remark 5: $\mathcal{W}_{ij}=O_p(1)$ also under local alternatives, as the statistics are noncentral chi-square distributed (see e.g. Sampling distribution of Coefficient of determination in general for a related discussion for the F statistic). Therefore, Remark 3 should also go through under such local alternatives.

Here is an illustration where the null is true:

library(lmtest)
library(sandwich)

n <- 3000
X1 <- rnorm(n)
X2 <- rnorm(n)
y <- rnorm(n)

M0 <- lm(y~1)
M1 <- lm(y~X1)
M2 <- lm(y~X1+X2)

Wald.M1M0 <- waldtest(M1,M0)$F[2]
Wald.M2M1 <- waldtest(M2,M1)$F[2]
Wald.M2M0 <- waldtest(M2,M0)$F[2]

Wald.M2M0
Wald.M1M0 + Wald.M2M1 # only close if we multiply the previous line by 2

LR.M1M0 <- lrtest(M1,M0)$Chisq[2]
LR.M2M1 <- lrtest(M2,M1)$Chisq[2]
LR.M2M0 <- lrtest(M2,M0)$Chisq[2]

LR.M2M0
LR.M1M0 + LR.M2M1 # the same

Wald.M1M0 <- waldtest(M1,M0, test="Chisq")$Chisq[2]
Wald.M2M1 <- waldtest(M2,M1, test="Chisq")$Chisq[2]
Wald.M2M0 <- waldtest(M2,M0, test="Chisq")$Chisq[2]

Wald.M2M0
Wald.M1M0 + Wald.M2M1 # close

Thank you! I was expecting this sort of asymptotic behavior due to the asymptotic equivalence of LR and Wald (and LM) tests. In small samples the results can be quite far off, though. By the way, I got some answers to my gmm questions from Pierre Chausse. They are on point but quite brief. I could use some help deciphering this one, if/when you find some time. — Richard Hardy, Commented Apr 26 at 6:40
GMM theory tells us that the optimal weighting matrix is the inverse of the variance (-covariance matrix) of the moment conditions. E.g., under homoskedasticity, we get the weighting matrix $[\hat\sigma^2(Z'Z)/n]^{-1}$ (e.g. stats.stackexchange.com/questions/89378/…). If you, say, have near-multicollinearity among your instruments, this matrix would be "badly conditioned" in the terminology of the answer. — Christoph Hanck, Commented Apr 26 at 10:38

Sextus Empiricus · Accepted Answer · 2024-04-27 12:51:37Z

For linear models with Gaussian distribution, the Wald test is similar to the LR test because both use the sum of squares to compute standard error (wald test) ot likelihood (LR test). The LR can be directly computed from the F-statistic and t-statistic.

However, the standard error used for the Wald test is not necessarily computed based on the sum of squares of the residuals. Other estimation methods are possible. Especially for other distributions this may be the case. So already among the the Wald test there may be variations.

An example are GLM models where the deviance may be estimated based on the method of moments.

In general the relationship of the likelihood ratio and the Wald statistic being equivalent (and chi-squared distributed) is because approximately $\log \mathcal{L}(\theta_{ML}) - \log \mathcal{L}(\theta_0) \propto (\theta - \theta_0)^2$. But that is only approximately.

Distributions from the natural exponential family can show this easily. Then the log-likelihood is

$$\log \mathcal{L}(\theta;\boldsymbol{x}) = \theta \cdot {T}(\boldsymbol{x}) + A(\theta)$$

and the minimum is defined by

$$\frac{\partial}{\partial \theta} \log \mathcal{L}(\theta;\boldsymbol{x}) = {T}(\boldsymbol{x}) + A'(\theta) = 0$$

And with ${T}(\boldsymbol{x})$ approaching a normal distribution with variance approaching zero, we can use a following linear approximation around $\theta = \theta_0$

$$\frac{\partial}{\partial \theta} \log \mathcal{L}(\theta;\boldsymbol{x}) \approx {T}(\boldsymbol{x}) - A'(\theta_0) + A''(\theta_0) (\theta-\theta_0)$$

For this approximately linear function we have

the root is the maximum likelihood estimate
the square of the intercept is the likelihood ratio.

So the relationship between these two is due to a linear approximation.

$\begingroup$ Thank you! Good points. $\endgroup$
– Richard Hardy
Commented Apr 27 at 10:57 — Richard Hardy, Commented Apr 27 at 10:57

Stack Exchange Network

LR statistics add up for nested models. What about the Wald test?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
regression
hypothesis-testing
likelihood-ratio
nested-models
wald-test
or ask your own question.

Linked

Hot Network Questions

LR statistics add up for nested models. What about the Wald test?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged regressionhypothesis-testinglikelihood-rationested-modelswald-test or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
regression
hypothesis-testing
likelihood-ratio
nested-models
wald-test
or ask your own question.