2
$\begingroup$

I have a dataset with N ≈12800 with two types of variables: independent continuous (distances in m) and a dependent binary variable (yes and no), associated with each distance. I have to test the correlation between the two variables. The test that seemed appropriate for that was Point Biserial correlation. However, the dataset violates the underlying assumptions for this test, especially since the dataset is not normally distributed, it has outliers that cannot be removed and also the variances of the continuous variable for each binary group is not equal.

I read conflicting arguments regarding assuming the normality of the dataset considering the very high sample size. However, the metrics for the tests for normality (Jacque Bera, Kolmogorov-Smirnov) indicate the data is not normally distributed.

Considering this, which would be the most appropriate test for testing the correlation between the two variables? On researching, it seems that Spearman ρ is the most appropriate, however, I am not sure if this is valid in the case that one of the variables is binary.

$\endgroup$
2

1 Answer 1

0
$\begingroup$

One option is percent concordant. I don't know how common it is but Kahneman sometimes uses it as an alternative to Pearson r. It's defined as "for two data points a and b in which X_a is false and X_b is true, what's the probability that Y_a is smaller than Y_b?"

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.