Point Biserial vs. Spearman Correlation

Question

I am playing around with the Spambase dataset from UCI also in the bayesreg R package (4601 instances).

The target is a binary with class distributions of roughly $60-40\%$ $(0/1)$. There are $57$ or so predictors which are min-max scaled. The predictors are heavily right skewed as they are the frequency of a given word in an email, so naturally most instances fall towards $0$.

Would Spearman or point-biserial correlation be a more appropriate metric to quickly look at what features are decent predictors? From my understanding spearman can be used with continuous and ordinal, and point-biserial is explicitly for explaining continuous vs. ordinal. I understand that I am looking at parametric vs. nonparametric methods, but why does a parametric method seem to explain non-linear data better?

I don't really understand the benefits of using one over the other in this situation.

Shawn Hemelstrand · Accepted Answer · 2024-05-05 15:54:53Z

As Peter noted, Spearman and the point-biserial correlation are just special cases of Pearson correlations, where the former uses the correlation of ranks while the latter just simplifies by using a dummy-coded variable (Cohen et al., 2003, p.60-62). And like he said, correlations on their own aren't immediately great indicators of which variables are good for a model. In fact, they can be quite misleading depending on the reasons for their associations, the classic one being the correlation between ice cream sales and shark attacks. Point-biserial seems to be the more obvious option given it is supposed to be used for binaries and the Spearman naturally will end up with many ties (since there are several $0$ and $1$ values). Sometimes the differences between them will be pretty trivial, so it may not matter much.

There are some important parts you may want to consider with respect to modeling since that seems to be the implied primary focus here. First off, it is well known in the linguistics and psycholinguistics spaces that word frequency exhibits a log-normal distribution (hence what you see with the skew). Because any model using it will have a very compressed association, it is often transformed with the natural log, log2 or log10 (which has some pros and cons depending on what you are after). See this page on how to interpret log-transformed predictors, as this seems to fly under the radar for people who transform. Note that one doesn't have to do this, and it will depend on what your associations actually look like.

Though you have a fair number of observations, you also have a pretty obscene number of predictors. I'm guessing not all of them are terribly important. If your sole goal is prediction, you could probably get away with some regularization technique like lasso, ridge, etc. but it may be helpful to simply think about which ones actually make sense for your problem in the first place. I doubt all 50-something predictors will be that important, so I would choose which ones are at least worthy of inclusion (as your $n$ by $k$ ratio nets you about $n = 80$ observations per predictor). Some may consider this very low. Brysbaert (2019) for example states that regressions should have minimum $100$ observations between two variables in a low association scenario and $100$ per added predictor in a regression.

References

Brysbaert, M. (2019). How many participants do we have to include in properly powered experiments? A tutorial of power analysis with reference tables. Journal of Cognition, 2(1), 1–38. https://doi.org/10.5334/joc.72
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed). L. Erlbaum Associates.

Thank you for the extended explanation. This makes sense! I really really appreciate it, this is awesome thank you. — LennyBruceIsNotAfraid, Commented May 5 at 23:20

Peter Flom · Accepted Answer · 2024-05-05 11:59:50Z

First, there's a question of whether any correlation is a good screening tool here. Bivariate screening isn't really a good method, because it can leave out variables that are important. But leaving that aside (variable selection has been extensively discussed here):

Point biserial is for when one variable is dichotomous. Back in the old days, it was used because it is has an easier formula than the Pearson, to which it is equivalent to the Pearson correlation (see Wikipedia and many other sources). So, assume that the dichotomous variable is actually continuous and ask yourself if you would use Spearman (which works on ranks) or Pearson (which works on actual values).

The choice between Spearman and Pearson has also been discussed here. See this thread and this one (and more that are referenced within those).

Stack Exchange Network

Point Biserial vs. Spearman Correlation

2 Answers 2

References

Not the answer you're looking for? Browse other questions tagged
r
machine-learning
correlation
spearman-rho
point-biserial
or ask your own question.

Linked

Hot Network Questions

Point Biserial vs. Spearman Correlation

2 Answers 2

References

Not the answer you're looking for? Browse other questions tagged rmachine-learningcorrelationspearman-rhopoint-biserial or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
r
machine-learning
correlation
spearman-rho
point-biserial
or ask your own question.