Most appropriate correlation test for continuous and binary variables for non-normally distributed dataset with a high sample size

Question

I have a dataset with N ≈12800 with two types of variables: independent continuous (distances in m) and a dependent binary variable (yes and no), associated with each distance. I have to test the correlation between the two variables. The test that seemed appropriate for that was Point Biserial correlation. However, the dataset violates the underlying assumptions for this test, especially since the dataset is not normally distributed, it has outliers that cannot be removed and also the variances of the continuous variable for each binary group is not equal.

I read conflicting arguments regarding assuming the normality of the dataset considering the very high sample size. However, the metrics for the tests for normality (Jacque Bera, Kolmogorov-Smirnov) indicate the data is not normally distributed.

Considering this, which would be the most appropriate test for testing the correlation between the two variables? On researching, it seems that Spearman ρ is the most appropriate, however, I am not sure if this is valid in the case that one of the variables is binary.

There are many similar questions on site stats.stackexchange.com/questions/363543/…, stats.stackexchange.com/questions/558842/…, stats.stackexchange.com/questions/108007/…, stats.stackexchange.com/questions/522944/… — kjetil b halvorsen, Commented Jul 8, 2022 at 12:55
Do you have measurements at multiple distances? It sounds like a logistic regression may be the best approach for you. — mkt, Commented Jul 8, 2022 at 12:55

kqr · Accepted Answer · 2022-07-08 13:24:14Z

0

One option is percent concordant. I don't know how common it is but Kahneman sometimes uses it as an alternative to Pearson r. It's defined as "for two data points a and b in which X_a is false and X_b is true, what's the probability that Y_a is smaller than Y_b?"

answered Jul 8, 2022 at 13:24

kqr

8145 silver badges11 bronze badges

Add a comment |

Stack Exchange Network

Most appropriate correlation test for continuous and binary variables for non-normally distributed dataset with a high sample size

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
correlation
nonparametric
binary-data
continuous-data
or ask your own question.

Linked

Hot Network Questions

Most appropriate correlation test for continuous and binary variables for non-normally distributed dataset with a high sample size

1 Answer 1

Not the answer you're looking for? Browse other questions tagged correlationnonparametricbinary-datacontinuous-data or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
correlation
nonparametric
binary-data
continuous-data
or ask your own question.