1
$\begingroup$

I'm writing a research paper and am using R for my quantitative analysis. I'm using OLS regression and have needed to perform a log transformation on my dependent variables for linearity however I don't understand my regression table anymore and was hoping someone could help me to interpret it.

In order for the log regression to run I needed to filter the zeros using this code, which could explain also why my results looks different as well as the effect of the log: Code to get rid of zeros

Without the log, it used to look like this: Pre-log regression table

After the regression, it looks like this: Post-log regression

$\endgroup$

1 Answer 1

0
$\begingroup$

very nice analysis, thanks for sharing them.

Several suggestions:

  • To make reports comparable consider to apply 'filtering zeros' in both reports. Currently you applied the filter only for the log models.
  • Or as an option instead of using transformation log(tax) do the log(tax+1). In this case tax=0 will be mapped to log(1+0) = 0. This will allow to use even zero datapoints. Of course this will change the models accuracy.

Back to your reports. The reports are very brief (compared to standard OLS regression report). Thus based on the numbers you provided you can judge: 1)quality of the models and 2) the significance of variables.

Model quality

  • First of all look on Adjusted R2. The higher the R2 the better the model. R2 shows % of the dependent variable explained by factors. For example 44.5% of log-tax model is explained by factors. To sum up log models show higher predicting power. But of course this might be because you have filtered part of the dataset for building log-models.

Effects direction and effects significance

  • Then you can check which coefficients turned to zero. For example Num Employees was an important factor (non zero with three stars) in predicting tax and is zero in log tax model. Here you should use an expert opinion to decide which model shows true relationship, i.e. does num employees have any effect on tax or not and etc.
  • Pay attention to sign of the factors. For example in tax model Num Employees coefficient is -14.347. This means that Num Employees have negative effect on taxes, i.e. the more employees you have the lower are taxes.
  • Finally you should pay attention to stars near each coefficient. Three stars near a factor coefficient means that the factor is important for the model and the model will be much weaker if you'll exclude the factor. Coefficients without start are called insignificant. Such coefficients may be excluded from the model without significant drops in R2 (predicting accuracy). Or more simply speaking we are not sure that they are not zeros. Yes, in the table you might see non-zero values, but there is a high probability that actually the effect is zero. Usually you need more data to become more sure about such factors. For example, let's check EM 2022 in log-tax model, the coefficient is -13.998, while this is a big value it doesn't have the stars, thus insignificant. So there is still high probability that might be zero what means that factor EM 2022 doesn't affect the log-tax. In such cases you can't say that the EM 2022 have negative effect on the log-tax, you say that effect is negative, but insignificant, thus we can't rely on this negative relationship.
$\endgroup$
4
  • $\begingroup$ Thanks for the detailed reply Johnny, was super helpful. I tried log(var+1) on my logged DVs. However, it now won't let me run the regression as I get an error: Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'y' I know I have zero and negative values in my df which may explain the error, but I don't know how to combat it - did you have any ideas? $\endgroup$
    – Bella
    Commented Apr 17 at 12:13
  • $\begingroup$ Hello, Bella, thx for the comment. Check stats of your dependent variable: calculate % of zeros and negative values for 'tax', 'profits' and et. Maybe it's worth just excluding negative values from both datasets. I would say that how you want to filter/modify input data is very much depend on the context. If you want to compare performance of both models than it's highly recommended to have the same input dataset (same filters applied to 'tax', 'profits' and etc). $\endgroup$ Commented Apr 18 at 11:42
  • $\begingroup$ I just tried excluding negative values for all three of my DVs but it made my original df go from 5019 obs to 195 obs so analysis would be futile. $\endgroup$
    – Bella
    Commented Apr 18 at 13:44
  • $\begingroup$ That's quite strange. What is a mean and st.deviation for each of the target variable? Did you scale them somehow during data processing? $\endgroup$ Commented Apr 18 at 18:58

Not the answer you're looking for? Browse other questions tagged or ask your own question.