As Peter noted, Spearman and the point-biserial correlation are just special cases of Pearson correlations, where the former uses the correlation of ranks while the latter just simplifies by using a dummy-coded variable (Cohen et al., 2003, p.60-62). And like he said, correlations on their own aren't immediately great indicators of which variables are good for a model. In fact, they can be quite misleading depending on the reasons for their associations, the classic one being the correlation between ice cream sales and shark attacks. Point-biserial seems to be the more obvious option given it is supposed to be used for binaries and the Spearman naturally will end up with many ties (since there are several $0$ and $1$ values). Sometimes the differences between them will be pretty trivial, so it may not matter much.
There are some important parts you may want to consider with respect to modeling since that seems to be the implied primary focus here. First off, it is well known in the linguistics and psycholinguistics spaces that word frequency exhibits a log-normal distribution (hence what you see with the skew). Because any model using it will have a very compressed association, it is often transformed with the natural log, log2 or log10 (which has some pros and cons depending on what you are after). See this page on how to interpret log-transformed predictors, as this seems to fly under the radar for people who transform. Note that one doesn't have to do this, and it will depend on what your associations actually look like.
Though you have a fair number of observations, you also have a pretty obscene number of predictors. I'm guessing not all of them are terribly important. If your sole goal is prediction, you could probably get away with some regularization technique like lasso, ridge, etc. but it may be helpful to simply think about which ones actually make sense for your problem in the first place. I doubt all 50-something predictors will be that important, so I would choose which ones are at least worthy of inclusion (as your $n$ by $k$ ratio nets you about $n = 80$ observations per predictor). Some may consider this very low. Brysbaert (2019) for example states that regressions should have minimum $100$ observations between two variables in a low association scenario and $100$ per added predictor in a regression.
References
- Brysbaert, M. (2019). How many participants do we have to include in properly powered experiments? A tutorial of power analysis with reference tables. Journal of Cognition, 2(1), 1–38. https://doi.org/10.5334/joc.72
- Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed). L. Erlbaum Associates.