1
$\begingroup$

I have $n$ observations of pairs $(x, y)$ and three different models I would like to compare. Model0 is nested within Model1. Model0 is also nested within Model2. I would like to do hypothesis tests for each of models 1 & 2, using Model0 as the null hypothesis. My original approach was to use a likelihood ratio test but I am getting inflated type 1 error rates. I believe the reasons for it is explained in this stack overflow answer. Thus I am trying to use the corrections suggested in that answer such as the Satorra and Bentler 2010 correction or the Bartlett correction, but to no avail as the corrections all seem to be for differently specified problems than my own. Thus, I need help either converting the corrections into formulations for my problem, reformulating my problem in a way that is compatible with the corrections, and honestly, I wouldn't mind some one double checking my analysis that the cited answer actually is related to my problem, because maybe I am mistaken about that.

Thus my three models are as follows:

First, they are all $(2n x 2n)$ multivariate normal distributions which for the vector of observations $(x_{1},...,x_{n},y_{1},...,y_{n})$ gives the covariance between observations i and j in the ith row, jth column (matrix is positive semidefinite).

Model0: $$ \begin{bmatrix} \sigma_{1}^{2}A_{n}+\sigma_{2}^{2}\mathbb{I}_{n} & 0_{n x n} \\ 0_{n x n} & \sigma_{3}^{2}A_{n}+\sigma_{4}^{2}\mathbb{I}_{n} \\ \end{bmatrix}, $$ where $A_{n}$ and $\mathbb{I}_{n}$ are $(n x n)$ known matrices ($\mathbb{I}_{n}$ is the identity matrix), and $\sigma_{1}^2, \sigma_{2}^2, \sigma_{3}^2, \sigma_{4}^2$ are unknown variance scaling factors.

Model1: $$ \begin{bmatrix} \sigma_{1}^{2}A_{n}+\sigma_{2}^{2}\mathbb{I}_{n} & c_{1}C_{n} \\ c_{1}C_{n} & \sigma_{3}^{2}A_{n}+\sigma_{4}^{2}\mathbb{I}_{n} \\ \end{bmatrix}, $$ where $A_{n}$, $\mathbb{I}_{n}$, and $C_{n}$ are $(n x n)$ known matrices, $\sigma_{1}^2, \sigma_{2}^2, \sigma_{3}^2, \sigma_{4}^2$ are unknown variance scaling factors, and $c_{1}$ is an unknown correlation in $[-1,1]$.

Model2: $$ \begin{bmatrix} \sigma_{1}^{2}A_{n}+\sigma_{2}^{2}\mathbb{I}_{n} & c_{1}K_{n} \\ c_{1}K_{n} & \sigma_{3}^{2}A_{n}+\sigma_{4}^{2}\mathbb{I}_{n} \\ \end{bmatrix}, $$ where $A_{n}$, $\mathbb{I}_{n}$, and $K_{n}$ are $(n x n)$ known matrices, $\sigma_{1}^2, \sigma_{2}^2, \sigma_{3}^2, \sigma_{4}^2$ are unknown variance scaling factors, and $c_{1}$ is an unknown correlation in $[-1,1]$.

In summary, I have a few different hypotheses about the type of correlation that could exist between X and Y which I'd like to be able to favor by rejecting the null hypothesis. I greatly appreciate any help, thank you.

EDIT:

More Information about $A_{n}$: It is derived by measuring distances in space between the points. The closer together two points in space are, the higher their correlation, thus the diagonal of $A_{n}$ is all 1s because an observation $x_{i}$ always has perfect correlation with itself. $A_{n}$ is specifically derived from measuring the distances on a network/graph relative to a fixed point. The more path $x_{i}$ and x_{j} have in common from the fixed point to themselves, the higher their correlation. In a related question from some months back, I drew out an example of such a network and give the exact formulation of all the matrices for that specific problem. Although, in that question the rows and columns are rearranged to that $(2nx2n)$ matrix describes the vector of observations $vec(x_{1},y_{1},...,x_{n},y_{n})$ rather than $vec(x_{1},...x_{n},y_{1},...,y_{n})$. I think it's important to note that in the formulation for this question, I do not have any of the mins or maxs or anything like that as in the prior formulation. This is a different model of evolution I'm studying where all the matrices are completely known.

$\endgroup$
6
  • 1
    $\begingroup$ Most standard arguments are asymptotic, for $n\to\infty$. In your problem there are $A_n, C_n$ and $K_n$, and as long as it is not clear how they behave for $n\to\infty$, asymptotic arguments aren't possible. If we knew what they are, asymptotic arguments may or may not apply (the $n$-dimensional unit matrix for example can be handled asymptotically). If you only can specify them for a certain fixed $n$ that you have in your application, asymptotic arguments are off. $\endgroup$ Commented Jun 22 at 9:33
  • 1
    $\begingroup$ Did you try bootstrapping the distribution of the likelihood ratio statistic? $\endgroup$ Commented Jun 22 at 9:35
  • $\begingroup$ I should add to my first comment that as everything is normally distributed here, you may well get away without asymptotics, but it may take some time to figure out mathematically whether and if so how it works. $\endgroup$ Commented Jun 22 at 9:37
  • $\begingroup$ I like your idea of bootstrapping, but I don't think I can do it correctly. For example, I found this approach (geostatisticslessons.com/lessons/spatialbootstrap) which looks very promising, but it involves monte carlo simulation, which I already posted an unanswered question about here (stats.stackexchange.com/q/649407/398015) the summary is: say I want to study the difference in means between two groups, the obvious choice to look at the distribution of mean differences is wrong, how do I know other obvious choices won't be similarly wrong? $\endgroup$ Commented Jun 22 at 16:35
  • $\begingroup$ also, one other thing with bootstrapping is the main computational cost I have is fitting the parameter values by maximum likelihood which involves thousands of guess and check steps over the parameter space. Actually simulating data is cheap in comparison, therefore could it be better to do monte carlo simulation rather than bootstrap? Or I guess they are equivalent? $\endgroup$ Commented Jun 22 at 16:37

1 Answer 1

1
$\begingroup$

A simple parametric bootstrap would estimate the parameters of Model0 from the data, and then simulate many data sets from the estimated distribution. If you want to test against Model1, you would then for each bootstrapped data set compute the likelihood ratio between Model1 and Model0. That would be your estimated null distribution, and you'd reject Model0 at level $\alpha$ if the likelihood ratio on the real data is larger than a $(1-\alpha)$-quantile of the bootstrapped null distribution.

In some literature this is called Monte Carlo, but the term parametric bootstrap is more precise, as standard Monte Carlo would not use parameters estimated from the same data on which the test is run.

I can't give you a theoretical guarantee that this will work though. The situation is nonstandard. Also, of course if you have computational difficulty to compute the likelihood ratio (which requires ML estimators), this will be computationally expensive, although looking at your model I'd be surprised if it weren't possible to do it more efficiently than by running "thousands of guess and check steps over the parameter space". Maybe there is even an analytical solution, although I won't have the time to check that.

$\endgroup$
8
  • $\begingroup$ this looks promising thank you. can i ask why there is not a theoretical guarantee? You write the situation is "nonstandard". But why? Is it that I have finite sample instead of infinite? Is it that the likelihood ratio test is only valid when the data are iid? Is is that the likelihood ratio test statistic is unbounded? (what does that even mean?) Is it some fourth thing? what kind of things could theoretically go wrong so that having a likelihood ratio in the top 5% would not actually "guarantee" a significant finding at the 5% level? $\endgroup$ Commented Jun 22 at 18:45
  • $\begingroup$ as for the existence of an analytical solution. I'm not sure. this is an extremely well studied field in biology, so yeah. you definitely should not waste time trying to develop a solution yourself until I at least finish reviewing the literature on analytical solutions. It is complicated, but I think you are right that solutions are known for special certain situations. I will keep researching this. :) $\endgroup$ Commented Jun 22 at 18:51
  • $\begingroup$ i think at least one problem is that for a lot of situations, at least one of the variance components (e.g. $\sigma_{1}^{2}$) will be zero, and thus the ML fit will be on the boundary rather than inside?? although I do not actually think this is the problem because the null hypothesis is that $c_{i}=0$, not that any of the variance components do or do not equal zero. $\endgroup$ Commented Jun 22 at 19:47
  • 1
    $\begingroup$ @AFriendlyFish "Based on what you said about LR always being correct, would it make sense..." I don't really understand what you propose here. Neither do I understand how you want to diagnose that "the bootstrap method doesn't work", or what you mean by "parameters which perform better" - in what sense? Personally I don't have a better method on offer than what I have proposed here, but that doesn't necessarily mean that none exists. $\endgroup$ Commented Jun 23 at 10:13
  • 1
    $\begingroup$ @AFriendlyFish Correct in principle, but I'd think you really wouldn't want to try to explore systematically the space of possible parameters in this way. $\endgroup$ Commented Jun 24 at 9:17