7
$\begingroup$

Consider models M0, M1, M2. Let M0 $\subset$ M1 $\subset$ M2, i.e. let the models nest each other. I test the following pairs of models using the likelihood-ratio (LR) test: M0 vs. M2, M0 vs. M1, M1 vs. M2. I obtain test statistics $\chi^2_{\text{LR},02}$, $\chi^2_{\text{LR},01}$, $\chi^2_{\text{LR},12}$, respectively. Obviously, $\chi^2_{\text{LR},02} = \chi^2_{\text{LR},01} + \chi^2_{\text{LR},12}$.

If I were to use Wald test instead, would a similar relationship hold for its test statistics $\chi^2_{\text{Wald},ij}$? What would it be, exactly?

$\endgroup$
2

2 Answers 2

3
$\begingroup$

Partial answer for linear models (I do not argue things are different for nonlinear models, just do not present results):

In linear models, when using the chi-square version of the Wald test, it is nearly the case for large $n$ when the null is true (and, afaik, under local alternatives). For the F-test version it is not true but requires (which is precisely what the chi-square version does) multiplication with a correction for the number of hypotheses tested.

Remaining differences stem from different ways of estimating the error variances as e.g. discussed in the links. (So to return to your purpose of validating code, "close" is maybe too vague to be helpful.) More specifically, the Wald statistics in the linear model for $M1$ or $M0$ vs. $M2$ use the error variance estimate for the large model $M2$, whereas the statistic for $M0$ vs. $M1$ uses the error variance estimate based on model $M1$.

In formula, from Ranking of Wald, LR and score statistic in the normal linear regression model, we have (with $X_j$ the regressor matrix of model $j$) that we can write, for a residual maker matrix $M_{X_j}=I-X_j(X_j'X_j)^{-1}X_j'$, the Wald statistic in the linear model for $M1$ vs. $M2$ as $$ \begin{eqnarray} \mathcal{W}_{12}&=&n\frac{y'M_{X_1}y-y'M_{X_2}y}{y'M_{X_2}y}\label{waldproj} \end{eqnarray} $$ Correspondingly, the Wald statistic for $M0$ vs. $M1$ is (and analogously for $M0$ vs. $M2$) $$ \begin{eqnarray} \mathcal{W}_{01}&=&n\frac{y'M_{X_0}y-y'M_{X_1}y}{y'M_{X_1}y}\label{waldproj2} \end{eqnarray} $$ such that, even though addiding the numerator terms of $\mathcal{W}_{01}$ and $\mathcal{W}_{12}$ would indeed allow us to cancel out $y'M_{X_1}y$, we cannot simply do so as the test statistics do not share the same denominator.

Hence, we do not have $\mathcal{W}_{01}+\mathcal{W}_{12}=\mathcal{W}_{02}$.

Remark 1: If we knew the error variance, the result would be true, as Wald and LR then are identical - no differences can arise from estimating the variance in different ways. See also Exact equivalence of LR and Wald in linear regression under known error variance)

Remark 2: If the null is not true, the sum need not even be close to $\mathcal{W}_{02}$: from Ranking of Wald, LR and score statistic in the normal linear regression model consider $M1$ and $M2$, letting $$x_{12}:=\frac{y'M_{X_1}y/n}{y'M_{X_2}y/n}=\hat\sigma^2_{M1}/\hat\sigma^2_{M2}.$$

Similarly, $$x_{01}:=\frac{y'M_{X_0}y/n}{y'M_{X_1}y/n}=\hat\sigma^2_{M0}/\hat\sigma^2_{M1}.$$

Then, $n(x_{12}-1)=\mathcal{W}_{12}$ and $n(x_{01}-1)=\mathcal{W}_{01}$. If the null $M0$ is true, $x_{ij}\to1$ as all models consistently estimate the error variance $\sigma^2$.

One might then go on to prove that the sums of the $\chi^2$ random variables of the two submodels is indeed asymptotically equivalent to the random variable associated with $M02$ (not done here).

If, however, $M2$ is true, $\hat\sigma^2_{M0}$ and $\hat\sigma^2_{M1}$ will not be consistent estimators of $\sigma^2$ anymore. Denote $\sigma_0^2=\text{plim}\hat\sigma^2_{M0}$ and $\sigma_1^2=\text{plim}\hat\sigma^2_{M1}$.

Hence, $$x_{01}-1\to_p \frac{\sigma_0^2}{\sigma_1^2}-1$$ and $$x_{12}-1\to_p \frac{\sigma_1^2}{\sigma^2}-1,$$ so that $$\mathcal{W}_{01}+\mathcal{W}_{12}\sim n\left(\frac{\sigma_0^2}{\sigma_1^2}+\frac{\sigma_1^2}{\sigma^2}-2\right)$$ while $$ \mathcal{W}_{02}\sim n \left(\frac{\sigma_0^2}{\sigma^2}-1\right) $$

Remark 3: That the result is true under the null (s.th. $\mathcal{W}_{ij}=O_p(1)$) when $n$ is large can also be motivated via asymptotic equivalence of $\mathcal{W}$ and $LR$: from $\mathcal{W}_{ij}=n(x_{ij}-1)$ and $LR_{ij}=n\log(x_{ij})$, write $LR_{ij}=n\log(1+\mathcal{W}_{ij}/n)$. A Taylor expansion around 1 yields $$ \begin{eqnarray*} LR_{ij}&=&n[\mathcal{W}_{ij}/n+o_p(\mathcal{W}_{ij}/n)]\\ &=&n[\mathcal{W}_{ij}/n+o_p(O_p(1)/n)]\\ &=&n[\mathcal{W}_{ij}/n+o_p(n^{-1})]\\ &=&\mathcal{W}_{ij}+o_p(1) \end{eqnarray*} $$

Remark 4: For those, like me, needed a second to see why the result is "obviously" correct for LR: note that we can rewrite $$ LR_{02}=n\log(x_{02})=n\log(x_{01}x_{12})=n\log(x_{01})+n\log(x_{12})=LR_{01}+LR_{12} $$

Remark 5: $\mathcal{W}_{ij}=O_p(1)$ also under local alternatives, as the statistics are noncentral chi-square distributed (see e.g. Sampling distribution of Coefficient of determination in general for a related discussion for the F statistic). Therefore, Remark 3 should also go through under such local alternatives.

Here is an illustration where the null is true:

library(lmtest)
library(sandwich)

n <- 3000
X1 <- rnorm(n)
X2 <- rnorm(n)
y <- rnorm(n)

M0 <- lm(y~1)
M1 <- lm(y~X1)
M2 <- lm(y~X1+X2)

Wald.M1M0 <- waldtest(M1,M0)$F[2]
Wald.M2M1 <- waldtest(M2,M1)$F[2]
Wald.M2M0 <- waldtest(M2,M0)$F[2]

Wald.M2M0
Wald.M1M0 + Wald.M2M1 # only close if we multiply the previous line by 2

LR.M1M0 <- lrtest(M1,M0)$Chisq[2]
LR.M2M1 <- lrtest(M2,M1)$Chisq[2]
LR.M2M0 <- lrtest(M2,M0)$Chisq[2]

LR.M2M0
LR.M1M0 + LR.M2M1 # the same

Wald.M1M0 <- waldtest(M1,M0, test="Chisq")$Chisq[2]
Wald.M2M1 <- waldtest(M2,M1, test="Chisq")$Chisq[2]
Wald.M2M0 <- waldtest(M2,M0, test="Chisq")$Chisq[2]

Wald.M2M0
Wald.M1M0 + Wald.M2M1 # close 
$\endgroup$
4
  • 1
    $\begingroup$ Thank you! I was expecting this sort of asymptotic behavior due to the asymptotic equivalence of LR and Wald (and LM) tests. In small samples the results can be quite far off, though. By the way, I got some answers to my gmm questions from Pierre Chausse. They are on point but quite brief. I could use some help deciphering this one, if/when you find some time. $\endgroup$ Commented Apr 26 at 6:40
  • $\begingroup$ GMM theory tells us that the optimal weighting matrix is the inverse of the variance (-covariance matrix) of the moment conditions. E.g., under homoskedasticity, we get the weighting matrix $[\hat\sigma^2(Z'Z)/n]^{-1}$ (e.g. stats.stackexchange.com/questions/89378/…). If you, say, have near-multicollinearity among your instruments, this matrix would be "badly conditioned" in the terminology of the answer. $\endgroup$ Commented Apr 26 at 10:38
  • 1
    $\begingroup$ Regarding the present question, I added some subtleties. $\endgroup$ Commented Apr 26 at 16:50
  • $\begingroup$ Got it, thank you! $\endgroup$ Commented Apr 26 at 17:01
3
$\begingroup$

For linear models with Gaussian distribution, the Wald test is similar to the LR test because both use the sum of squares to compute standard error (wald test) ot likelihood (LR test). The LR can be directly computed from the F-statistic and t-statistic.

However, the standard error used for the Wald test is not necessarily computed based on the sum of squares of the residuals. Other estimation methods are possible. Especially for other distributions this may be the case. So already among the the Wald test there may be variations.

An example are GLM models where the deviance may be estimated based on the method of moments.

In general the relationship of the likelihood ratio and the Wald statistic being equivalent (and chi-squared distributed) is because approximately $\log \mathcal{L}(\theta_{ML}) - \log \mathcal{L}(\theta_0) \propto (\theta - \theta_0)^2$. But that is only approximately.

Distributions from the natural exponential family can show this easily. Then the log-likelihood is

$$\log \mathcal{L}(\theta;\boldsymbol{x}) = \theta \cdot {T}(\boldsymbol{x}) + A(\theta)$$

and the minimum is defined by

$$\frac{\partial}{\partial \theta} \log \mathcal{L}(\theta;\boldsymbol{x}) = {T}(\boldsymbol{x}) + A'(\theta) = 0$$

And with ${T}(\boldsymbol{x})$ approaching a normal distribution with variance approaching zero, we can use a following linear approximation around $\theta = \theta_0$

$$\frac{\partial}{\partial \theta} \log \mathcal{L}(\theta;\boldsymbol{x}) \approx {T}(\boldsymbol{x}) - A'(\theta_0) + A''(\theta_0) (\theta-\theta_0)$$

For this approximately linear function we have

  • the root is the maximum likelihood estimate
  • the square of the intercept is the likelihood ratio.

So the relationship between these two is due to a linear approximation.

$\endgroup$
1
  • $\begingroup$ Thank you! Good points. $\endgroup$ Commented Apr 27 at 10:57

Not the answer you're looking for? Browse other questions tagged or ask your own question.