6
$\begingroup$

I have a sample of 1 million articles form the web with various features. I'm in the progress of selecting features to use in a metric/predictor for article quality. To get some insight into the data and what features may be best, I computed correlations between features.

The following issue occurred: For the features A="article views" and B="thumbs up count" the correlation is 0.32 (Pearson) or 0.26 (Spearman). Intuition suggests that there is a correlation indeed. The features by themselves look exponentially distributed to me (very many small values, very few large values). I wanted to look at a graphical view of the correlation, but the scatter plot did not reveal anything let alone a linear association.

So I aggregated the data as follows:

  1. Order all data points by A (article views).
  2. Divide the list into n=100 equally big chunks.
  3. Compute sum(A) and sum(B) for all chunks.

Now when I plot the 100 value pairs sum(B) over sum(A) it shows an almost perfect straight line! (except for a minor aberration in the beginning). The Pearson correlation is almost 1.

What does this show / What should I make of this?

Does this mean that there is a strong dependency between A and B "in general", but for the individual articles there is "noise"? Can it have something to do with Ecological fallacy? Would you suggest a different way of exploring the association between these variables?

$\endgroup$
4
  • 1
    $\begingroup$ Can you post the 2 scatterplots? What are the 2 r's? $\endgroup$ Commented May 5, 2015 at 16:50
  • $\begingroup$ @gung Yes, this is the raw plot and this the aggregated one. $\endgroup$
    – PhillipM
    Commented May 6, 2015 at 7:27
  • $\begingroup$ The formula in the second graph does not conform to data. The intercept and slope are way off. Maybe that is a problem? $\endgroup$
    – mpiktas
    Commented May 6, 2015 at 7:34
  • $\begingroup$ @mpiktas I'm sorry I forgot to label that in the aggregated plot the X-axis is in millions and the Y-axis is in thousdands. $\endgroup$
    – PhillipM
    Commented May 6, 2015 at 7:47

1 Answer 1

4
$\begingroup$

The issue is with the binning. When you order the variable $A$ by size, divide it to 100 equal bins and then sum the data in the bins you introduce the order. The bins at the beginning will have lower sums and higher sums at the end. This is perfectly normal, because that is the way bins were constructed.

Here is a simple simulation for illustration.

Generate 1 random million values of exponential distribution

    library(dplyr)
    a <- rexp(1e6)

enter image description here

Divide into 100 equal sized bins using quantiles:

    q <- quantile(a, seq(0, 1, length.out = 101))
    q[1] <- 0
    q[101] <- Inf
    bin <- cut(a, q)

Sum the values in the bins and plot them:

    dd <- data_frame(a=a,bin=bin)
    ee <- dd %>% group_by(bin) %>% summarise(a=sum(a))
    plot(q[-101], ee$a)

enter image description here

Compare the two graphs. The first is totally random, and in the second we have almost perfect relationship, because of the way we constructed the bins.

Now if we have another variable which is correlated with the original one, this introduced order does not disappear.

     b <- rexp(1e6) + a/3 

enter image description here

Here we observe linear relationship, with a lot of noise, which is no surprise, because this is the way we constructed the second variable.

If we perform binning, we get that the relationship is much stronger:

     dd <- data_frame(a=a, b=b, bin=bin)
     ee <- dd %>% group_by(bin) %>% summarise_each(funs(sum), 
                                        a:b)
     plot(ee$a, ee$b)

enter image description here

So the binning you performed accentuated existing relationship, but this does not mean that relationship is actually that strong.

Given that your data is article views and thumbs up, it is natural to expect that articles with high number of views tend to have more thumbs up. But this relationship is very noisy as evidenced by your initial scatter plot of the data.

You should probably fit a regression to figure out the relationship and how strong it is.

$\endgroup$
1
  • $\begingroup$ Thank you for the great answer! I've never used R before just matlab, but it was no problem to copy your example into an online interpreter and try out some variations. So my takeaway for now is: Yes, the bin plot does show that there is a relationship at large, but it does not speak much to its magnitude. $\endgroup$
    – PhillipM
    Commented May 6, 2015 at 10:58

Not the answer you're looking for? Browse other questions tagged or ask your own question.