20
$\begingroup$

My hypothesis concerns intervention versus control in a randomised controlled trial (between-subjects, n=500 per group, online survey experiment). I pre-registered that my primary test would be a regression model with group as variable of interest and also including covariates to explain variance and increase power. My pre-registered test did not detect an effect (perhaps the model is a mess, with too many covariates), but a simple two-sample permutation test does (p<0.05). The effect size is small (bootstrapped 95% CI for Cohen's d is 0.07 to 0.33) but even a small effect is of great interest. I can't claim detection, that would be p-hacking - I have to collect more data.

My question is whether it would be legitimate to include my first sample in any way in my analysis after having collected a second sample. I can collect another 500+500 sample, but power is never going to be good for this effect. It feels very conservative not to be able to take into account the first sample and go right back to the drawing board. But leaning too much on the first sample in a new analysis would be a second bite of the same apple, also inappropriate.

Are there suggestions for an appropriate compromise in this kind of situation?

EDITS TO CLARIFY

(1) the question is only about what analysis would be justifiable to conduct when I have a second sample. For example, would it be valid to conduct an analysis that includes (in some way) both the first and second sample? I would not try to argue that my non-preregistered first-sample analysis is primary - the bottom line is that an effect was not detected in the first sample, even if we have strong hints.

(2) yes I tend to think frequentist but if there is a Bayesian solution to this difficult dilemma, that could be enough to make me go Bayesian - I'd appreciate tips on where to start with a Bayesian two-sample test (preferably non-parametric, e.g. permutation) that could incorporate the first sample results in the prior, for example.

$\endgroup$
6
  • 1
    $\begingroup$ not exactly the same scenario but you might look at some of the literature around 'internal pilots' $\endgroup$ Commented Apr 29 at 8:56
  • $\begingroup$ Thanks @GeorgeSavva that was useful. I just read the first paper I found that looked relevant (journals.sagepub.com/doi/full/10.1177/0962280217696116). It quickly got too technical for me but I get the general idea. I think this method is not appropriate for me now because I should have pre-registered the intention to operate like this, but this could be useful for future studies when I'm chasing small effects. $\endgroup$
    – Amorphia
    Commented Apr 29 at 10:19
  • 2
    $\begingroup$ Perhaps others can comment on the viability of the following approach. Pre-register a second trial with a two-sample test with a non-positive effect null. In short, you are using the first sample only to justify changing to a one-sided test. $\endgroup$ Commented Apr 29 at 12:03
  • $\begingroup$ 1. > I can't claim detection, that would be p-hacking - I have to collect more data. Responding to a non-rejection by collecting a new sample and testing again ... is also a kind of p-hacking, unless you collect all the non-reject results along with any rejections. 2. Omitting potentially important covariates will lead to omitted variable bias, so a rejection in the second model doesn't necessarily help your case; an alternative explanation for your results is simply that you're now picking up the biasing effects your covariates were there to remove. $\endgroup$
    – Glen_b
    Commented May 1 at 23:54
  • 1
    $\begingroup$ @Glen_b, point 1 is fair and made me think again, thanks; point 2 is also fair but I don't think applies here. I have n = 500 per group so it's very unlikely that randomisation hasn't smoothed away effects of covariates, and I think I just went over the top in my preregistration - the model has like 20 terms and almost none of them are significant. So many terms mainly because I have a multi-category geographical region variable that isn't actually doing much. What's it called when a model has too many terms each soaking up small amounts of variance without really helping? That's what I have. $\endgroup$
    – Amorphia
    Commented May 3 at 9:20

5 Answers 5

17
$\begingroup$

I will stay within a frequentist framework for this answer because that clearly informs your planning, but should note that some of these decisions are different (and arguably simpler) in a Bayesian framework.


Analysing only the second sample with the new model would provide unbiased p-values, because the decision about the model to analyse is not contingent on the data within that sample. However, the decision is dependent on the the data from the first sample, so it would be p-hacking to include this data.

But even if you did not collect a second sample, it is perfectly acceptable to report both analyses as long as you are very clear in all communications (including in the abstract, for a talk or paper) that

  1. this permutation test deviated from the pre-registration plan, and
  2. you decided to do it after your planned analysis did not show the results you were looking for.

Basically, you need to give people the information they need to fairly evaluate the analyses.


UPDATE:

I'm afraid switching to a Bayesian approach now would be yet another data-dependent decision and violate your pre-registration plan, so switching is not a simple solution to the basic problem (also, a Bayesian may not even see the need for a second sample, as they rarely follow a strict dichotomous rule like "p < 0.05 is evidence in favour while p > 0.05 is not"). There's no getting around the need for clearly reporting your choices.

Also, I think a principled Bayesian approach would involve sticking with the full model (with appropriate priors) and applying it to the combined dataset, not switching to a reduced model out of a desire for narrower standard errors for one parameter. Dropping terms is sometimes reasonable but doing so can introduce biases. I don't think we can advise about how justifiable it would be in your case without knowing a lot more about the experiment and data.

$\endgroup$
9
  • 3
    $\begingroup$ There's quite a heated debate between frequentists and Bayesians and "optional stopping" (that's the name, I believe, of what the you (the OP) describe) is one of the contentious issues. As a practitioner of statistics myself I don't see a neat, automatic solution from either camp. And either camp gives you options to torture the data, if you want to. Ultimately is down to fair reporting and judgment calls (I'm glad to see that mkt's answer is consistent with mine, more or less). $\endgroup$
    – dariober
    Commented Apr 29 at 10:24
  • 3
    $\begingroup$ @Amorphia It is true AFAIK that optional stopping isn't an issue for Bayesians (see for e.g. pubmed.ncbi.nlm.nih.gov/24659049) and so you could legitimately include both samples in a Bayesian model. But I'm afraid switching to a Bayesian approach now would be yet another data-dependent decision and violate your pre-registration plan, so it doesn't fully solve the basic problem. As dariober says (+1 to their answer), the need here is for reporting of data-dependent decisions and actions. $\endgroup$
    – mkt
    Commented Apr 29 at 11:06
  • 2
    $\begingroup$ @Amorphia , mkt: "you could legitimately include both samples in a Bayesian model" - what you cannot legitimately do is specify a prior that depends on what you see in the first sample, then use this for analysing data from both samples. What you can do is to have the first sample inform a prior that then is applied to second sample data only. $\endgroup$ Commented May 1 at 10:34
  • 2
    $\begingroup$ @ChristianHennig If (a big If) the first sample is used to create a prior strongly supporting a positive effect for use for the second sample then we can almost guarantee a strong posterior positive effect. Thus the Bayesian approach is highly susceptible to bad practice: prior-hacking can be as dangerous as p-hacking. $\endgroup$ Commented May 2 at 0:15
  • 3
    $\begingroup$ Having thought about this for a bit, my proposal in the previous comment is subject to mkt's correct remark that "I'm afraid switching to a Bayesian approach now would be yet another data-dependent decision and violate your pre-registration plan". If you only decide to do this after having seen the first sample, in the hope that this gives you some "unbiased claim of Bayesian significance", the data dependent decision to go Bayesian will actually bias it. (Some Bayesians may not agree with this but that'd be worrying...) $\endgroup$ Commented May 7 at 10:11
12
$\begingroup$

In my opinion it is more important how you report results than making dichotomous claims about an effect being detectable or not. If you have the option to collect more data you may go for it and combine it with the previous one. However, it is important that in a publication you state what lead you to collect more data and analyse them differently than what you pre-registered and I would certainly report the pre-registered analysis.

It is likely that your p-values will be inflated by deciding what to do after you have seen the results but as long as you present it to your audience in a fair way it should be ok. Also, can you justify a deviation from what you pre-registered? If the answer is simply because you get a smaller p-value, then that would a dubious reasoning. But if after seeing the data you realise that some assumptions definitely don't hold, then it may be more acceptable to deviate.

In other words, p-hacking occurs when you try many things and you report only the findings you like and you make it appear as if you decided before seeing the data. But if you are open about your reasoning and reporting then it's acceptable.

At the end of the day, it may be that the limitations of an online survey overwhelm the difference between p = 0.04 and p = 0.06 so I wouldn't get too obsessed about cutoffs anyway.

$\endgroup$
1
  • 2
    $\begingroup$ Agree with all of this, thanks (including the cutoff issue but my discipline is obsessed with cutoffs). Bottom line is, I didn't detect. But my question is more forward looking. I don't want to find excuses to p-hack, I want to know how to analyse once I have a second sample - can the first sample also be involved at that point? Question edited to clarify. $\endgroup$
    – Amorphia
    Commented Apr 29 at 10:08
6
$\begingroup$

I think the first thing you need to check is why you now believe this simple test might be more valid inferential-wise than your preregistered one. Maybe you have done that already, but questioning the own (causal) assumptions is a good idea before going forward, especially in light of surprising positive results. Preregistration is a good tool to estimate and communicate test severity and error rates, but it is not a good tool to judge validity of inference, because the latter is not dependent on temporal occurrence of data and analysis - which is why improving inference is one of the good reasons to deviate from an analysis plan. So when you are quite sure that there a very good theoretical reasons for this switch, you might not even need more data.

Assuming you still want more data:

What you seem to want in general is some kind of analysis that throws out a p-value that incorporates your decision to bite the apple a second time (great title btw). There is a method for this, and that is sequential analysis. It is usually conceptualized the other way around, though: You want to find a significant effect with as little subjects as possible, so you devise a plan that lets you look multiple times at interim time points, by adjusting the error rate for each test. This can keep the overall error rate constant, while allowing you to stop when significance is found (https://lakens.github.io/statistical_inferences/10-sequential.html). It is however important to prespecify how often to look. The issue is that you cannot get your alpha threshold lower in subsequent looks, so when you already did a test at your desired threshold, I do not think you can use all your data and keep the error rate the same.*

In general, I am not convinced that these considerations will do any good for what counts, and this is your research question and how to answer it best. At some point you will have to make assumptions anyway, and be prepared to defend them, and this is in my opinion more about the question why a certain model is a good way of answering the question, and why and under what conditions it might be the other one. Especially if the goal is practical recommendations, this might be important to consider.

Look at it from this perspective: In science, someone finds something, reports it, and someone else builds on top of it, and in the long-run, there are conclusions about theories. In this case, both of these persons is you (and it is very laudable that you do not want to just report this single result and move on), but I do not think someone else would try to adjust their p-values based on the fact that they run a new study on your results which might have been a false-positive - so I would not expect that from you either. Instead, they build on it because they find it plausible that such an effect might exist, and that your model was a useful one. And they might test it again indirectly with their new study, or do a meta-analysis of related findings. It is not all hypothesis testing, or statistical considerations in general. Coming back to my first paragraph, when you can run two or more different models for the same research question (which is basically what created this multiple-comparisons-like problem), I think it is better to spend more time on the theoretical side of things compared to fiddling with statistical methods.

So depending on you resources, I think I would do these things (in that order):

  1. (You might have done that already, but for reference). Justify your theoretical model.

  2. Run a second study, but incorporate knowledge from the first. Define a smallest effect size of interest and do an equivalence test (Daniel Lakens published on this, in case you don't know the concept), use the first study as a measure for the expected variability, and use that for a quite precise power calculation.

  3. Extend the second study to corroborate the main finding(s) in a different way, if possible. Maybe also in contrast to the original model with the covariates.

  4. Do a meta-analysis with both studies to estimate the overall effect.


* Side note: You should be aware that even the decision to collect more data is in principle a data-dependent one that affects your error rates, just like p-hacking is. This is because you would not have thought about a new study, had the first not come out significant. But not all studies with a true effect come out significant (especially with small effect sizes), so just running replications of significant ones will miss a bunch of stuff (there might be more research around that in the reproducibility crisis in psychology because you essentially ask how to judge replications of studies and how they are selected, which is a big thing there). So you are right with phrasing the approach of only using the second data set as conservative because it has less power compared to a combined analysis. But even if you use all data from both studies, the overall power is lower compared to the scenario where you always did a replication because you do not see a lot of results compared to a world where you always run a second study.

So taking the problem of data-contingent analysis very sincerely, your situation might even have an additional layer of complexity, because in principle you would need to figure out what you would have done in other similar situations: For instance, you might have asked the same question here at CV when the model with just one covariate, or two, etc came out significant, but maybe not when the p-value was even smaller than the one you found now, and so on... I think you could easily go down a rabbit hole here and will never come to an exactly valid conclusion because it is impossible to know all hypotheticals. And even if you would, they might be hard to sell to others. Again, in my opinion the problem is more in the fact that multiple models might be valid answers to a research question.

$\endgroup$
1
  • 1
    $\begingroup$ Many thanks - I learned a lot from reading this reply. Although the main thing I am learning (from this reply and the others) is that there are no easy answers here! $\endgroup$
    – Amorphia
    Commented Apr 30 at 7:07
6
$\begingroup$

The current answers seem to contain good advice, but they omit an important perspective, and they do not fully address the underlying weakness of the framework that likely dominates your area of research. Critically, they do not ask you for clarification on important things:

  1. Is this a preliminary study?
  2. Assuming this is not a preliminary study, was it designed with a well-informed power study to determine sample size?
  3. What was the actual p-value(s) for the registered analysis?

You do not give a p-value, presumably because you 'know' that it is 'best' to rely on pre-specified or conventional less-thans. The result of that is that we have an all-or-none dichotomisation of the results into 'significant' and 'not significant' (positive and negative). Your result was p>0.05 and so it is a nothing. To do a different (more powerful, perhaps) analysis at this time increases the overall rate of false positive errors implicit in the design and so it is effectively forbidden. You 'know' that and yet you feel that there should be something more reasonable possible. And more reasonable approaches are available.

Do you know that the dichotomisation into 'significant' and 'not significant' forces you to privilege long run global error rate above the evidence in your data? Even though your p-value is above the pre-registered cutoff there is evidence in your study and if you are a scientist then you can (should!) react to that evidence. And you might find that there is evidence in the data concerning other hypotheses.

The preregistration process forces you towards a one-and-done attitude: your results are 'negative' and you cannot do anything about it without p-hacking. That is a serious flaw in the whole pre-registration movement. (In my opinion a lot of the impetus towards setting up preregistration came from a well-placed mistrust of drug company reporting of clinical trials, but that is a different topic.)

1. Preliminary studies Preliminary (pilot) studies inform future studies by the group doing them, and, if published, by others. They are critical for developing both scientific hypotheses and for devising statistical hypotheses that can help with the desired scientific inferences. They allow you to know what interventions and analyses are most likely to yield valid and interesting inferences. Sometimes published work of others can be used in place of a pilot study, but that is not optimal practice, in my opinion.

If you use a dichotomous 'significant'/'not significant' type of analysis for a preliminary study then you will have an elevated risk of false negative results because a preliminary study is rarely well-designed and powered. A 'negative' result in a pilot study will often preclude any further data gathering, but a false positive result will be corrected when the subsequent, designed, study is analysed.

2. Study design with preliminary results If you have a preliminary set of data that led you to choose a sample size of 500 per group then you should look at the design and implementation of the two stages of your study to see if there are things that might have diluted the effect.

Given that you say "a small effect is of great interest" can you justify the sample size of 500? Did the preliminary study show a larger effect, or less variability? That will be critical for the design of any follow-one experiments. (Do not think in terms of the Cohen's effect size scale. After all, the actual effect size in real-world units is what is relevant to conclusions about the real world.)

3. Actual p-values You will have noticed my disdain for the dichotomisation of statistical results into 'significant' and 'not significant'. (why would I persist with the scare quotes, right?) If the observed p-value from the registered analysis is close to 0.05 then the evidence that you obtained in that analysis against the null hypothesis is quite similar (almost the same) as it would need to be for a p-value just less than 0.05. A p-value of 0.049 is virtually the same as 0.051 and neither points to strong evidence. The dichotomisation into 'significant' and 'not significant' is about control of global error rates, not about evidence, and it hinders scientific consideration of the relevant information when making scientific inferences.

The weakness of evidence implied by an observed p=0.049 is critically relevant to the extrapolation from a statistical inference to a scientific inference about the real world, just as the strength of evidence needed for p=0.0000005 would be. And neither should be the sole consideration.

Conclusion Report exactly what you did and found (including mention of the preliminary results that led you to the hypotheses and design chosen for the study). Include the post-hoc analysis and form sensible conclusions using all of the information. You are unlikely to be able to defend strong conclusions, but further study should be informed by your experience. In the future be careful about whether pre-registration is appropriate for the nature of the study being undertaken.

$\endgroup$
2
  • $\begingroup$ Thank you very much - this was a very useful set of reflections. Your points bring context to the fore. The context is that I'm a university academic who also operates outside academia supporting practice by non-academic groups. In that non-academic context, I need to communicate internally (advise on practice) and externally (defend practice against external critics). On the basis that the p-registered p is .08 but the more powerful p is .02, I'm happy to advise internally that the intervention is probably good. The tricky part is defending the intervention against external criticism! $\endgroup$
    – Amorphia
    Commented Apr 30 at 7:16
  • 1
    $\begingroup$ @Amorphia Even the difference between 0.02 and 0.08 is not that large; both indicate somewhat against the null hypothesis but not strongly so. The statement that "the intervention is probably good" is a Bayesian one anyway - in frequentism and with p-values, there is no well defined probability of having an effect. The truth is your data give a weak indication but nothing more. This statement should not be hard to defend... in any case no standard theory (frequentist or Bayesian) will allow you to involve data in the analysis that gave rise to choosing this analysis in the first place. $\endgroup$ Commented May 1 at 10:21
6
$\begingroup$

I think that this post is full of answers with great suggestions and excellent discussion. However, you seem very interested in a direct approach, and so I'd like to offer a mathematical approach that answers your question more directly with a meta-analysis approach. For various reasons, you may not want to actually use this approach.


For simplicity, let's focus on a one-sided test \begin{align*} H_0: \mu_x &= \mu_y \\ H_1: \mu_x &> \mu_y \\ \end{align*}

Now let's get some notation out of the way. Suppose our first sample consists of data $X_{11}, \ldots, X_{1n_1}$ and $Y_{11}, \ldots, Y_{1m_1}$, and our second sample consists of $X_{21}, \ldots, X_{2n_2}$, and $Y_{21}, \ldots, Y_{2m_2}$. Let's define \begin{align*} D_1 &= \frac{1}{n_1}\sum_{i=1}^{n_1}X_{1i} - \frac{1}{m_1}\sum_{i=1}^{m_1}Y_{1i} \\ D_{12} &= \frac{1}{n_1+n_2}\left(\sum_{i=1}^{n_1}X_{1i}+ \sum_{i=1}^{n_2}X_{2i}\right) - \left(\frac{1}{m_1}\sum_{i=1}^{m_1}Y_{1i}+ \sum_{i=1}^{m_2}Y_{2i}\right). \end{align*}

Since we already know that the first sample leads to a p-value of less than $0.05$, we can condition on the fact that $D_1 \in \mathcal R_1$, where $\mathcal R_1 = \{d : d > d_0\}$ is a rejection region of size $0.05$ for the appropriate choice of $d_0$. Conditioning on this fact, we can calculated an adjusted p-value using the following formula: \begin{align*} \text{p-val} &= P(D_{12} > \hat D_{12} | D_1 > d_0, H_0) \\[1.5ex] &= \frac{P(D_{12} > \hat D_{12} \cap D_1 > d_0 | H_0)}{P(D_1 > d_0)} \\[1.2ex] &= \frac{P(D_{12} > \hat D_{12} \cap D_1 > d_0 | H_0)}{0.05}, \end{align*} where $\hat D_{12}$ is the value actually obtained in your sample. Thus, to compute the adjusted p-value, you can run a similar permutation test as before, but we need to (i) throw away all cases where $D_1$ is not significant and (ii) we multiply the obtained p-value by $20$.

I will give R code to perform this below, but first let me describe the results. I simulated data using $n_1=n_2=m_1=m_2 = 500$ using normal distributions with $\mu_x=10$, $\mu_y=9.9$ and standard deviations equal to $1$. I then wrote code to approximate the p-value for a permutation test with and without correction. Without correction (possible p-hacking) I get a p-value of $0.011$. With the correction, I get a p-value of $0.133$. Note that, for this particular simulation, you would have gotten a smaller p-value $(0.082)$ had you just used the second sample by itself. Results may vary.


R Code

Note: Finding $d_0$ can be a challenge. In the code below, it is estimated via permutation, based on the sample (effectively some form of Bootstrapping). In reality, it is probably better to find it via Monte Carlo simulation. This requires making assumptions about the distributions of $X$ and $Y$, such as the value that $\mu_x=\mu_y$ should take via simulation.

# Simulate data
set.seed(1209102)
n1 <- 500
n2 <- 500
m1 <- 500
m2 <- 500
X1 <- rnorm(n1, 10, 1)
X2 <- rnorm(n2, 10, 1)
Y1 <- rnorm(m1, 9.9, 1)
Y2 <- rnorm(m2, 9.9, 1)

# Compute statistics
Tx1 <- sum(X1)
Tx2 <- sum(X2)
Ty1 <- sum(Y1)
Ty2 <- sum(Y2)

D1  <- Tx1/n1 - Ty1/m1
D12 <- Tx1/(n1+n2) + Tx2/(n1+n2) - Ty1/(m1+m2) - Ty2/(m1+m2)

# Get d0 threshold
M <- 100000
D1_perm <- rep(NA, M)
XY <- c(X1, Y1)
for(i in 1:M){
  ind <- sample(n1+n2, n1, replace=FALSE)
  D1_perm[i] <- sum(XY[ind])/n1 - sum(XY[-ind])/m1
}
d0 <- quantile(D1_perm, 0.95)
D1 - d0

# Get joint p-value
M <- 100000
cnt1 <- cnt2 <- cnt12 <- 0
XY <- c(X1, X2, Y1, Y2)
for(i in 1:M){
  # Assign new groups to data
  indx1 <- sample(n1+n2+m1+m2, n1, FALSE)
  indx2 <- sample((1:(n1+n2+m1+m2))[-indx1], n2, FALSE)
  indy1 <- sample((1:(n1+n2+m1+m2))[-c(indx1, indx2)], m1, FALSE)
  indy2 <- (1:(n1+n2+m1+m2))[-c(indx1, indx2, indy1)]
    
  D1_perm <- sum(XY[indx1])/n1 - sum(XY[indy1])/m1
  D2_perm <- sum(XY[indx2])/n2 - sum(XY[indy2])/m2
  D12_perm <- (sum(XY[indx1]) + sum(XY[indx2]))/(n1+n2) - (sum(XY[indy1]) + sum(XY[indy2]))/(m1+m2)
  
  if(D1_perm > d0 & D12_perm > D12){
    cnt1 <- cnt1 + 1
  }
  
  if(D2_perm > D2){
    cnt2 <- cnt2 + 1
  }
  
  if(D12_perm > D12){
    cnt12 <- cnt12 + 1
  }
}
cat("Uncorrected p-value: ", cnt12/M,
    "\nCorrected p-value: ", 20*cnt1/M,
    "\np-value for 2nd sample alone: ", cnt2/M)

$\endgroup$
5
  • $\begingroup$ Thanks! I am very interested in this approach but I need to understand it better. I understand the general idea and much of the maths but I think the main thing preventing me from a full understanding is that your definition of d0 is too implicit for me. I don't really get exactly what parameter that is. Also, do you have a reference for this approach, or a descriptive phrase for it that I could use to find references? $\endgroup$
    – Amorphia
    Commented May 1 at 8:45
  • 1
    $\begingroup$ This is nice and elegant, but does it have a better power than just analysing the second data set alone? Looking at the factor 20, I have my doubts. $\endgroup$ Commented May 1 at 10:29
  • 1
    $\begingroup$ @ChristianHennig There is a little bit of a counterbalance to the factor of $20$, since $P(A\cap B) \leq P(A)$, but yes I think you are generally going to be right. Although the exact answer will depend on sample/effect sizes, this approach will often be less powerful than just analyzing the second data set. $\endgroup$
    – knrumsey
    Commented May 1 at 14:56
  • $\begingroup$ @Amorphia. $d_0$ is just the number that satisfies the equation: $P(D_1 > d_0) = 0.05$, where $D_1$ is defined in the question (and when the null hypothesis is true). Finding $d_0$ is very problem specific and so it didn't make sense to discuss it in too much detail. I don't know of any references, sorry, but It may be helpful to generally look for the keyword meta-analysis? I'll update if I can find anything more useful. $\endgroup$
    – knrumsey
    Commented May 1 at 14:59
  • $\begingroup$ Thanks so much for the further explanation @knrumsey, this is beginning to make more sense to me, although it's at the edge of my ability as an average psychology researcher! $\endgroup$
    – Amorphia
    Commented May 3 at 9:14

Not the answer you're looking for? Browse other questions tagged or ask your own question.