I'm currently analyzing the age variable in a dataset of all italian phyisicians (~470,000 obs) and I'm trying to check if age is significantly different between three groups defined by another variable (job). There are 34,348 obs for job 1, 6,404 obs for job 2 and 432,963 for job 3. First idea was to use the Oneway ANOVA to check whether the difference between means was significant, but ANOVA assumes that:
- Data is normally-distributed
- Data is homoscedastic
So my next idea was to use the Kruskall-Wallis test. Then I came across this site Handbook of Biological Statistics by John H. McDonald and I read:
While Kruskal-Wallis does not assume that the data are normal, it does assume that the different groups have the same distribution, and groups with different standard deviations have different distributions. If your data are heteroscedastic, Kruskal–Wallis is no better than one-way anova, and may be worse. Instead, you should use Welch's anova for heteoscedastic data
So I looked for the Welch's ANOVA and it seemed to me that assumption of normality applies for Welch's ANOVA too:
The assumptions are pretty much the same for Welch’s ANOVA as for the classic ANOVA. For example, the assumption of normality still holds. However, you should run Welch’s when you violate the assumption of equal variances. You can run it with unequal sample sizes
Getting back to my original problem: how can I answer my original question?
I came across this question on Cross Validated but I'm really out of my depth here. I mean, I understood the concept of bootstrapping but I can't understand how to check for a significant difference between the groups.
job | count | mean | std | min | 25% | 50% | 75% | max |
---|---|---|---|---|---|---|---|---|
1 | 34348 | 56.8 | 11.9 | 24.0 | 50.0 | 61.0 | 66.0 | 83.0 |
2 | 6404 | 58.3 | 9.3 | 27.0 | 54.0 | 61.0 | 65.0 | 79.0 |
3 | 432963 | 53.1 | 16.3 | 23.0 | 38.0 | 55.0 | 67.0 | 107.0 |
I'm doing my analyses in python.
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
fig, ax = plt.subplots(figsize=(8, 8))
sm.qqplot(df[df.job == '1'].age, fit=True, marker='.', line='45', markerfacecolor='C0', markeredgecolor='C0', alpha=0.2, ax=ax, label='job1')
sm.qqplot(df[df.job == '2'].age, fit=True, marker='.', line='45', markerfacecolor='C1', markeredgecolor='C1', alpha=0.2, ax=ax, label='job2')
sm.qqplot(df[df.job == '3'].age, fit=True, marker='.', line='45', markerfacecolor='C2', markeredgecolor='C2', alpha=0.2, ax=ax, label='job3')
ax.legend()
fig.tight_layout()
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
tdf = (pd.crosstab(
df.age,
df.job,
normalize='columns',
) * 100).sort_index()
fig, ax = plt.subplots(2, 2, figsize=(8, 8))
ax = ax.flatten()
ax[0].bar(tdf.index, tdf['1'], width=1, color='C0')
ax[1].bar(tdf.index, tdf['2'], width=1, color='C1')
ax[2].bar(tdf.index, tdf['3'], width=1, color='C4')
ax[3].plot(tdf.index, tdf['1'], color='C0', alpha=1, label='job1')
ax[3].plot(tdf.index, tdf['2'], color='C1', alpha=1, label='job2')
ax[3].plot(tdf.index, tdf['3'], color='C4', alpha=1, label='job3')
for i, _ax in enumerate(ax):
_ax.set_ylim(0, 9)
_ax.set_yticks(
np.arange(0, 10, 1),
labels=['{}%'.format(x) for x in np.arange(0, 10, 1)]
)
_ax.set_xlabel('age', fontsize='small')
ax[-1].legend()
fig.tight_layout()
The Bartlett's test for homoscedasticity resulted <0,001.
EDIT 1: I'm interested in checking the difference in the median age instead of mean age (since age is not normally-distributed) and in checking if the whole distribution differs across groups.