0
$\begingroup$

I have created a plot of the regression slope of sea surface temperatures (x) and an atmospheric variable (y). Although, I need to test the statistical significance of these trends using a non-parametric test (doesn't assume data is normally distributed). Specifically, I am trying to use the Mann-Whitney U-test as it was suggested by a reviewer (but open to whatever will work and allow me to compare to results from the Students T-test). To do this, I have already calculated the regression slope to compare to a situation when the slope is zero. Then, I created an array of zeros that is the same size as my regression array to put into the statistical function:

### Try out the Mann-Whitney U-test: #### DOESNT WORK RIGHT NOW!!!!! 
zero = np.zeros(np.shape(regression))   ### shape: (721,1440)
test = mannwhitneyu(zero, regression)   ### shape: (721,1440)

Although, after running the test, I end up with an array of only one dimension and size 1440:

psave = test[1]   ### This array is only a single dimension (1440)

Ultimately, I would like to end up with an array of p-values from a non-parametric test of the shape (721,1440) testing the significance of my regression slope. Thank you for any and all help!!

Referencing my question from StackOverflow: https://stackoverflow.com/q/76996090/22121415

$\endgroup$
5
  • $\begingroup$ First, regression does NOT assume the data are normally distributed. Second, if you want to do nonparametric regression, then do that, not OLS regression. $\endgroup$
    – Peter Flom
    Commented Aug 29, 2023 at 20:53
  • $\begingroup$ It's not clear to me where your shape of (721, 1440) comes from; 1440 might be from the number of minutes in a day, but why would those be in columns? $\endgroup$
    – jbowman
    Commented Aug 29, 2023 at 22:11
  • $\begingroup$ I am more concerned with the data not being normally distributed for a t-test...are you saying that if I am doing a regression, I don't need to worry about normality? Sorry, statistics is clearly not my strong suite. And as far as the shape of the data, the dimensions are lat and lon. It started as 3D data (time, lat, lon) and after regressing over the time dimension, Im left with just lat and lon (721,1440) respectively. $\endgroup$ Commented Aug 29, 2023 at 22:32
  • $\begingroup$ Did you run one regression for each lat, lon combination? $\endgroup$
    – jbowman
    Commented Aug 29, 2023 at 22:37
  • 1
    $\begingroup$ You're asking about regression slope (which is PAIRED data, measured with different variables, and different units). It makes no sense to use a Mann-Whitney (unpaired data on the same variable in the same units). I agree with Peter Flom, the marginal distribution of the variables isn't relevant. Did you definitely want a test of a linear relationship? or would any monotonic relationship work? Or a non-monotonic relation? $\endgroup$
    – Glen_b
    Commented Aug 29, 2023 at 22:38

1 Answer 1

0
$\begingroup$

First, I'm not sure that the Mann-Whitney U Test is the right approach in this instance, but I'd be happy to be informed otherwise!

In a simple linear regression you are estimating the regression coefficients, the $\beta$'s, in the formula

$$ y_{i} = \beta_0 + \beta_1 x_{i} + \epsilon_{i} $$

Software will often assume that $\epsilon_{i} \overset{iid}{\sim} N(0, \sigma_{\epsilon}^{2})$ (i.e., that the errors terms are independent and identically distributed mean-zero and follow a Normal distribution).

As @peter-flom points out, this is slightly different than assuming that the data are normally distributed. I think the conditional distribution, $Y | X$, would be considered normal (when it is being assumed), but I'm not confident.

At any rate, the distribution of $\beta_1$, the slope, depends on the distribution of the error terms and it sounds like those are not normally distributed. Now, that's perhaps not the end of the world as you have the Central Limit Theorem to rely on; depending on the number of observations and how much the error distribution deviates from normality, a $t$-test might be fine.

Assuming it's not "fine", however, a straight forward non-parametric test of the regression coefficients is contained in Introduction to Modern Statistics, Section 24.2 on p. 453 (a pdf of the book can be obtained for free from its website).

The idea is to obtain many observations of $\beta_{1}$ under the null hypothesis, $H_{0}: \beta_{1} = 0$ via simulation. You do this by:

  1. permuting the order of your $y$-variable;
  2. estimate the regression coefficients and store them;
  3. repeat 1. and 2. a bunch (thousands of times);
  4. compare your observed $\beta_{1}$ estimate to the estimates that you have just simulated via randomization.

For a two-sided alternative hypothesis, $H_{A}: \beta_{1} \neq 0$, this comparison part in 4. boils down to something like:

pval_sim = (length(abs(beta_simulated) >= beta_observed) + 1) / (number_of_simulations + 1)

The $+1$'s in the numerator and denominator are because you treat the original data as one of the permutations. Likely this won't change things too much, regardless.

You end up with a simulated $p$-value that you can then compare to some significance level and also compare with the parametric test's $p$-value that is based on the output from the software (Python it looks like, but sadly I am not familiar with the output that is provided).

You should also make sure to look at diagnostic plots (those are likely part of the normal output from Python).

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.