I have a sample of 1 million articles form the web with various features. I'm in the progress of selecting features to use in a metric/predictor for article quality. To get some insight into the data and what features may be best, I computed correlations between features.
The following issue occurred: For the features A="article views" and B="thumbs up count" the correlation is 0.32 (Pearson) or 0.26 (Spearman). Intuition suggests that there is a correlation indeed. The features by themselves look exponentially distributed to me (very many small values, very few large values). I wanted to look at a graphical view of the correlation, but the scatter plot did not reveal anything let alone a linear association.
So I aggregated the data as follows:
- Order all data points by A (article views).
- Divide the list into n=100 equally big chunks.
- Compute sum(A) and sum(B) for all chunks.
Now when I plot the 100 value pairs sum(B) over sum(A) it shows an almost perfect straight line! (except for a minor aberration in the beginning). The Pearson correlation is almost 1.
What does this show / What should I make of this?
Does this mean that there is a strong dependency between A and B "in general", but for the individual articles there is "noise"? Can it have something to do with Ecological fallacy? Would you suggest a different way of exploring the association between these variables?