5
$\begingroup$

We usually get an estimate of $\beta$ in the logistic regression by finding the $MLE$ of the observed random samples of $X_1,X_2....,X_N$. Then we use Wald's test i.e. ${[\hat \beta / S.E.(\hat \beta)]}^2$ to test whether that variable is significant or not.

From what I have read, this Wald's test is based on two facts (or assumptions, I am not sure).

  1. $\hat \beta$ follows a normal distribution
  2. Standard Deviation of this normal distribution is given by the inverse of Fisher's Matrix

Can someone explain the proofs behind these two facts/ (or assumptions). I have read this notes but most of the intuition went over my head.

$\endgroup$
1
  • 3
    $\begingroup$ The two assumptions that you're referring to are properties of a Maximum Likelihood Estimator, that's all. More information here $\endgroup$
    – call-in-co
    Commented Sep 14, 2017 at 16:58

1 Answer 1

8
$\begingroup$

Let's just say we have one parameter $\theta$ and univariate data $x_1, \ldots, x_n$.

  1. The likelihood estimates are obtained by solving the score equations: $$ \sum_i l'(\hat\theta,x_i) = 0 $$ where $l(\theta,x_i)$ is the log-likelihood associated with $i$-th observation, evaluated at parameter value $\theta$.
  2. Near the true value $\theta_0$, we can have a Taylor expansion of those scores: $$ \sum_i \bigl[ l'(\hat\theta,x_i)-l'(\theta_0,x_i) \bigr] = - \sum_i l'(\theta_0,x_i) = \sum_i l''(\theta_0,x_i)(\hat\theta-\theta_0) + o(|\hat\theta-\theta_0|) $$ where the first equality is due to the definition of $\hat\theta$ in the first step.
  3. Asymptotics means we are ignoring the small term $o(|\hat\theta-\theta_0|)$.
  4. Asymptotics means we are approximating $\sum_i l''(\theta_0,x_i)$ with what we know, $\sum_i l''(\hat\theta,x_i)$, assuming that $l''(\theta,x_i)$ is a sufficiently smooth function of both $\theta$ and $x$ and does not bounce around unpredictably. Or with $\mathbb{E} \, l''(\hat\theta,x)$ by plugging $x$ and integrating over its distribution.
  5. Asymptotics means that the most interesting remaining term $\sum_i l'(\theta_0,x_i)$ is a sum of i.i.d. random variables, and hence asymptotically normal. It has a mean of zero and some sort of variance that smart books derive to be Fisher information. The proper scaling, according to CLT, would then be $\sqrt{n} \sum_i l'(\theta_0,x_i) \to N(0,\omega^2)$ for some $\omega$.
  6. Our interest is actually in $\hat\theta-\theta_0$. Let's express it out of step 2, with these approximations in mind: $$ \hat\theta-\theta_0 \approx - \sum_i l'(\theta_0,x_i) \Bigl/ \sum_i l''(\theta_0,x_i) $$ The numerator is asymptotically normal with mean 0 and known (sort of) variance. The denominator is a non-zero quantity, and in large samples is supposed to be a reasonably stable thing (see above about bouncing around).
  7. We thus conclude that $\sqrt{n} (\hat\theta-\theta_0) \to N(0,\sigma^2)$ where $\sigma^2$ is a function of the asymptotic variance of the scores and something like $\mathbb{E} \, l''(\theta_0,x)$. Turns out they cancel each other when the model is true (and if not, you get the sandwich variance estimator instead).

And that gives you Wald test, more or less. In multivariate case, you need to track vectors and matrices and multiplication on the left and on the right, but that's the gist of it.

$\endgroup$
0

Not the answer you're looking for? Browse other questions tagged or ask your own question.