0
$\begingroup$

In multiple places, p-value is defined as something that quantifies the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct. A p-value < 0.05 is usually considered in favour of the alternate hypothesis.
If I understand correctly 'more extreme' are results in favour of the null hypothesis, because then a high p-value is in favour of it.
My actual question is that, I feel a p-value of 0.05 is way too low to reject the null hypothesis. To have conclusive favour in support of the null hypothesis, it should be more than that.. maybe 0.7 or 0.9? Perhaps I don't understand the p-value well enough intuitively. Any help would be appreciated. Thanks.

$\endgroup$
4
  • $\begingroup$ You either accept the alternative hypothesis or you keep your null hypothesis (but you do not accept it). $\endgroup$ Commented Feb 19 at 11:24
  • 3
    $\begingroup$ Welcome to Cross Validated! Could you please elaborate on why you think $0.7$ or $0.9$ would be better than $0.05?$ $\endgroup$
    – Dave
    Commented Feb 19 at 11:27
  • 3
    $\begingroup$ It sounds like you need a clear beginner's guide to p-values and their roles in inference. Can I recommend my own? link.springer.com/chapter/10.1007/164_2019_286 $\endgroup$ Commented Feb 20 at 6:31
  • $\begingroup$ Please check this wonderful answer to the question 'What is the meaning of p values and t values in statistical tests?'. $\endgroup$ Commented Feb 20 at 6:33

3 Answers 3

4
$\begingroup$

'More extreme results are in favor of the null hypothesis' --> This is not true!

The point of the p-value is to quantify how extreme your observation is if you assume the null hypothesis to be true. Assume you have a null hypothesis that states that some mean is equal to 3 and the alternative hypothesis states it is larger than 3. You now collect data and build a test statistic and calculate a p-value. If you have a mean in your sample that is close to 3, the p-value is going to say that it is quite normal to find a sample that has a mean that is at least as extreme as the one you have. And thus the null hypothesis seems plausible (note: you NEVER accept it!).

However, if you have a sample with mean equal to 5, the p-value will be very low since, assuming the true mean is 3 (your null hypothesis), it will be very unlikely to find a sample with a mean at least equal to 5. A low p-value (let's say 0.01) will then tell you: there is 1% that your results will occur when the null hypothesis is actually true. So this is good evidence to say that the null hypothesis might not be true at all (Note again: you also don't accept the alternative since you can never know 100% sure it is true, here it is still 1% possible that you found this extreme result due to luck and that the null hypothesis is still true).

I hope this clarifies that extreme results do not favor the null hypothesis since they are exactly the prove against it that you need.

If you would set the significance level equal to 0.9 as you say for example (so you reject the null hypothesis for all p-values lower than 0.9), you are going to make a lot of wrong conclusions since a p-value of 0.85 says that there is 85% probability to find a result like you did under the null hypothesis (so it is very common to find it and really gives no strong evidence against the null hypothesis), but since it is below your significance level of 0.9, you would still reject the null hypothesis based on this result and probably make a type I error. Note that also your significance level is in fact your type I error probability, so controlling it on 90% means that you still allow results that have 90% chance to be a type I error.

In conclusion: you really want a low p-value to avoid type I errors as much as possible.

$\endgroup$
2
  • 1
    $\begingroup$ I mostly agree, but see my answer. Sometimes type 2 error is much worse than type 1. $\endgroup$
    – Peter Flom
    Commented Feb 19 at 14:52
  • 1
    $\begingroup$ Certainly! I focussed here on the type I error since it seems that the OP was strangled with the basic concepts, but sure, it is equally important to be aware of trade-off between the type 1 and type 2 errors and their uses/(dis)advantages $\endgroup$ Commented Feb 19 at 15:45
3
$\begingroup$

You wrote:

If I understand correctly 'more extreme' are results in favour of the null hypothesis, because then a high p-value is in favour of it.

this is just the opposite of the case. More extreme result give a lower p. Let's demonstrate:

set.seed(1234)

x1 <- rnorm(100, 1, 5) #Random normal, mean 1, sd 5
x2 <- rnorm(100, 5, 5)  #Random normal, mean 5, sd 5

t.test(x1) #p = .67
t.test(x2) #p = 0.0000000000000002

Of course, p is also affected by sample size, and the effect size is affected not just by the difference from 0, but by the sd.

However, you are on to something when you complain about p < 0.05 being so generally accepted. If you lower the threshold for p, you decrease the type 1 error rate, but you increase the type 2 error rate. Often we accept power of 0.8, which is \beta of 0.2 implying that type 2 errors are four times worse than type 1 errors. This may be reasonable, or it may be nonsensical. Suppose, for example, that you have come up with a new treatment for a terminal disease that has no known treatments. Then a type 1 error means you give an ineffective treatment to someone who is dying, while a type 2 error means you fail to cure people.

Even Ronald Fisher once said that researchers should vary the threshold to suit the case, but, while I remember reading this, I cannot find the citation.

However, a threshold of 0.9 seems preposterous; it's hard to think of a case where that could be useful.

$\endgroup$
6
  • 1
    $\begingroup$ Nice answer (+1), Peter. Out of curiosity, what was the statement sort of like that you attributed to Fisher? I would love to dig in some sources, if you can frame the statement. $\endgroup$ Commented Feb 20 at 9:19
  • $\begingroup$ @User1865345 It was something like "no sane researcher uses the same significance level in all cases." $\endgroup$
    – Peter Flom
    Commented Feb 20 at 12:04
  • $\begingroup$ Thanks, Peter. That's a reasonable statement. $\endgroup$ Commented Feb 20 at 12:12
  • 1
    $\begingroup$ Peter, I asked a question at hsm to which I got a comprehensive answer. I think you would find it interesting: Did Ronald Fisher ever say anything on varying the threshold of significance level? $\endgroup$ Commented Feb 21 at 9:50
  • 1
    $\begingroup$ Thanks @User1865345. That looks like an interesting site! $\endgroup$
    – Peter Flom
    Commented Feb 21 at 11:51
0
$\begingroup$

Sometimes it is easier to look at a visual representation to better understand the as extreme or more part of the definition of the p-value, or the p-value itself.

Suppose you repeat an experiment over and over again under the exact same conditions. You measure 10 samples and calculate the sample means, standard errors, and t-statistics for each experiment. The t-statistic in its simplest form is just the difference of the sample mean and some hypothesized parameter value (let's just for simplicity say it's zero) divided by the standard error of the mean of the respective sample:

$$t=\frac{\bar{x}-\mu_{0}}{s/\sqrt{n}}$$

Then you plot the t-statistics in a histogram or density plot (see purple line in the plot below). This distribution is also known as a sampling distribution, which fully characterizes your experiments with respect to the underlying unknown parameter. The sampling distribution will eventually converge to Student's t-distribution whose area under the curve will sum up to 1 (shaded blue and red areas in the plot below).

enter image description here

Now let's forget about the sampling distribution, which we never directly observe anyway, and instead focus on Student's t-distribution which serves as a very good approximation of the sampling distribution (provided the assumption of independent and random sampling has been satisfied). The shape of the distribution only depends on the degrees of freedom of your sample ($n-1$). Now you can ask for example which values will I observe on average 95% of the time given this experiment. This would be the blue-shaded area and all values that fall within this area would be in agreement with the null hypothesis that those values were generated from the same data generating process (experimental condition). But we still have to account for the remaining 5% under the curve. We can shove those into the tails, i.e. 2.5% left and 2.5% right, which are the red-shaded areas. Any values that would fall into those extremes (extreme in the sense of far away from the mean of the distribution), would not be in agreement with the null hypothesis and we would reject the null hypothesis, meaning that we believe that those values have likely originated from a different experimental population. This would be a significant result and the probability of observing these values is 0.05 or smaller than that. Remember areas under the curves represent probabilities, so if you fall into the red extremes, the area is small and hence the probability or p-value is small.

From that you can also see that the p-value is always a probability statment about the observed data and not a probability about the null or alternative hypothesis being true. A small p-value indicates that the observed data is unlikely (but not improbable) under the null hypothesis provided the null hypothesis is true.

Here's the code I used, just in case you were wondering. You can change those values and play around with it. Simulation is very helpful in understanding these concepts.

# required package
library(tidyverse)

# create a simulation function for 95% confidence intervals and p-values
simulation <- function(n, mu, stdev) {
  s <- rnorm(n, mu, stdev)
  tibble(
    N = length(s),
    sample_mean = mean(s),
    sample_sd = sd(s),
    sample_se = sample_sd / sqrt(N),
    confint_95 = sample_se * qt(0.975, N - 1),
    t_statistic = (sample_mean - mu) / sample_se,
    p_value = (1 - pt(abs(t_statistic), N - 1)) * 2
  )
}

# specify parameters and sample size for each draw or experiment
n <- 10 # sample size
mu <- 0 # population mean
stdev <- 5 # population standard deviation


############ SIMULATION ##############

# choose number of repetitions
sim <- 1:1000

# rerun experiment under exact same conditions
map(sim, ~ simulation(n, mu, stdev)) %>%
  bind_rows() %>%
  mutate(experiment_id = 1:length(sim)) -> draws

ggplot(data = draws_colored, aes(x = t_statistic)) + 
  geom_area(fun = dt, args = list(df = n-1), stat = "function", fill = "red") +
  geom_area(fun = dt, args = list(df = n-1), stat = "function", fill = "steelblue", xlim = c(qt(p = 0.025, df = n-1), qt(p = 0.975, df = n-1))) +
  stat_function(fun = dt, args = list(df = n-1), color = "steelblue", linewidth = 1) +
  geom_density(color = "purple", linewidth = 1.5) +
  scale_x_continuous(limits = c(-5,5)) +
  labs(x = "t-values", y = "Probability density", title = paste("t-distribution with",n-1,"degrees of freedom"))
$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.