0
$\begingroup$

I have tried the pandas code for trying to find out the correlation between the output and the inputs I am feeding. Here is the code:

dataframe.corrwith(dataframe['output']).plot(kind='barh',figsize=[20,10], legend=True,grid=True)

I got the following image:
output

I was trying to understand which column will affect the result more in the positive or negative or neutral way.

The above image I got, I am not able to conclude what exactly it means.
Can someone please tell me whether I am on the right direction of what I am trying to achieve? What is the meaning of the above image?
Let me know. Here is the link to the sample data set: Training.csv

$\endgroup$

2 Answers 2

2
$\begingroup$

In a correlation framework above, the biggest driver of the output is the input which has the greatest absolute correlation value.

Correlation lies in the range [-1,1], and:

  • Negative correlation (correlation < 0) implies that the input and output move in opposite directions - i.e. as the input increases, the output decreases (and vice versa).
  • Nil correlation (correlation == 0) implies that the two variables are completely unrelated.
  • Positive correlation (correlation > 0) implies that the input and output move in the same direction - i.e. as the input increases, the output increases (and vice versa).

In the chart above, it looks like all-but-one of the inputs are negatively correlated with the output. This implies that as these inputs increase, the output decreases and vice versa.

A few things about your approach:

  1. There is more than one type of correlation - Spearman (rank) and Pearson (linear) correlation are two examples. Be mindful of which you are using.
  2. It would be helpful to rank/sort the result before plotting it. It would be easier for you to visually identify the drivers of the output if your chart was sorted.
  3. You may also want to drop output from the chart. A variable's correlation with itself is always 1, and so this does not add any value to the graphic.
$\endgroup$
5
  • $\begingroup$ This is a great explanation @bradS. I would also add my thoughts: For finding correlation, we can also use heat map which will gives us more information. $\endgroup$
    – Sunil
    Commented Jan 16, 2019 at 9:35
  • $\begingroup$ brad .. thank you for the explanation. I was in a dilemma.. now a bit clear. I will check with the other correlation metrics. $\endgroup$ Commented Jan 16, 2019 at 9:41
  • $\begingroup$ @Sunil When there is only one output to be correlated, how heatmap can be helpful, please let me know. when there are multiple columns correlations, then yes heatmap may help. $\endgroup$ Commented Jan 16, 2019 at 9:42
  • $\begingroup$ I believe in the input.csv files there are multiple inputs @JafferWilson $\endgroup$
    – Sunil
    Commented Jan 18, 2019 at 8:19
  • $\begingroup$ @Sunil Yes you are right there,amigo. I have multiple inputs. $\endgroup$ Commented Jan 18, 2019 at 9:47
1
$\begingroup$

You are in right way! Length of each box is actually correlation that you are looking for, positive, if box is on the right side from start (null) position, or negative on the left side. You can see that grey box have length 1, which is obvious because output has perfect correlation with itself. Regarding features: it seems, that they all have a small negative correlation (except input4, which correlation with output is slightly above zero), as they are all on the left side from the null.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.