5
$\begingroup$

I'm currently analyzing the age variable in a dataset of all italian phyisicians (~470,000 obs) and I'm trying to check if age is significantly different between three groups defined by another variable (job). There are 34,348 obs for job 1, 6,404 obs for job 2 and 432,963 for job 3. First idea was to use the Oneway ANOVA to check whether the difference between means was significant, but ANOVA assumes that:

  1. Data is normally-distributed
  2. Data is homoscedastic

So my next idea was to use the Kruskall-Wallis test. Then I came across this site Handbook of Biological Statistics by John H. McDonald and I read:

While Kruskal-Wallis does not assume that the data are normal, it does assume that the different groups have the same distribution, and groups with different standard deviations have different distributions. If your data are heteroscedastic, Kruskal–Wallis is no better than one-way anova, and may be worse. Instead, you should use Welch's anova for heteoscedastic data

So I looked for the Welch's ANOVA and it seemed to me that assumption of normality applies for Welch's ANOVA too:

The assumptions are pretty much the same for Welch’s ANOVA as for the classic ANOVA. For example, the assumption of normality still holds. However, you should run Welch’s when you violate the assumption of equal variances. You can run it with unequal sample sizes

Getting back to my original problem: how can I answer my original question?

I came across this question on Cross Validated but I'm really out of my depth here. I mean, I understood the concept of bootstrapping but I can't understand how to check for a significant difference between the groups.

job count mean std min 25% 50% 75% max
1 34348 56.8 11.9 24.0 50.0 61.0 66.0 83.0
2 6404 58.3 9.3 27.0 54.0 61.0 65.0 79.0
3 432963 53.1 16.3 23.0 38.0 55.0 67.0 107.0

I'm doing my analyses in python.

import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
fig, ax = plt.subplots(figsize=(8, 8))

sm.qqplot(df[df.job == '1'].age, fit=True, marker='.', line='45', markerfacecolor='C0', markeredgecolor='C0', alpha=0.2, ax=ax, label='job1')
sm.qqplot(df[df.job == '2'].age, fit=True, marker='.', line='45', markerfacecolor='C1', markeredgecolor='C1', alpha=0.2, ax=ax, label='job2')
sm.qqplot(df[df.job == '3'].age, fit=True, marker='.', line='45', markerfacecolor='C2', markeredgecolor='C2', alpha=0.2, ax=ax, label='job3')

ax.legend()
fig.tight_layout()

Q-Q plot

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')

tdf = (pd.crosstab(
    df.age,
    df.job,
    normalize='columns',
) * 100).sort_index()
fig, ax = plt.subplots(2, 2, figsize=(8, 8))
ax = ax.flatten()

ax[0].bar(tdf.index, tdf['1'], width=1, color='C0')
ax[1].bar(tdf.index, tdf['2'], width=1, color='C1')
ax[2].bar(tdf.index, tdf['3'], width=1, color='C4')


ax[3].plot(tdf.index, tdf['1'], color='C0', alpha=1, label='job1')
ax[3].plot(tdf.index, tdf['2'], color='C1', alpha=1, label='job2')
ax[3].plot(tdf.index, tdf['3'], color='C4', alpha=1, label='job3')

for i, _ax in enumerate(ax):
    _ax.set_ylim(0, 9)
    _ax.set_yticks(
        np.arange(0, 10, 1),
        labels=['{}%'.format(x) for x in np.arange(0, 10, 1)]
    )
    _ax.set_xlabel('age', fontsize='small')

ax[-1].legend()
fig.tight_layout()

age relative distribution by job and overall

The Bartlett's test for homoscedasticity resulted <0,001.

EDIT 1: I'm interested in checking the difference in the median age instead of mean age (since age is not normally-distributed) and in checking if the whole distribution differs across groups.

$\endgroup$
3
  • 2
    $\begingroup$ Hi Zeno and welcome to CV! I think a good start would be for you to clarify what parameter of age you're interested in? The mean/average? The median? The whole distribution? What exactly do you want to compare between the groups? That's what will guide the choice of hypothesis test first and foremost. Please edit your question to clarify this rather than posting it in the comments as not everyone reads those. $\endgroup$ Commented Jun 14, 2023 at 20:48
  • 2
    $\begingroup$ Clearly the mean ages are significantly different. Why is that obvious? Because you can readily compute the standard errors of the three means from the data and will find that any two of those means differ by a large multiple of either standard error. You might want to reformulate your investigation in terms of identifying how the three age distributions differ from each other. Viewing the QQ plots of pairs of groups could be a good start. Even more simply, plotting true histograms (relative frequency per unit age) rather than the raw bar charts shown here would be quite revealing. $\endgroup$
    – whuber
    Commented Jun 14, 2023 at 22:03
  • 2
    $\begingroup$ The histograms might be easier to compare when you just plot the outline instead of the area. The overlapping areas are currently difficult to compare. $\endgroup$ Commented Jun 15, 2023 at 8:10

2 Answers 2

2
$\begingroup$

A side-by-side plot of the density distribution, or histograms will give any person sufficient insights into the differences.

When you do this, then make sure you have

  • The same scale for the different plots. For example, in your QQ-plots you have different scales on the vertical axis which makes it difficult to compare the distributions.

    example of different scales

    The job 3 has a much bigger tail for the older ages, and this is not clear because the scale reaches up to 150 years old, whereas the second plot reaches up to 90 years old. In your plot this is not directly clear because the 80 years in your middle panel is higher than the 100 years in your right panel.

  • The histogram in terms of frequency instead of absolute numbers. Currently your histograms are difficult to compare because there are many more in job 3 than the others. If you normalize them such that they are more or less the same level, then it is easier to see the relative differences between the different jobs.

    Also, you seem to have plotted a bar chart instead of a histogram and there are several wider and smaller gaps between the bars. The reason for this is not very clear. Is there a meaning behind it?

You could compare some parameters like a mean or median. This makes sense if,

  • These means or other population descriptions have a special meaning. E.g. a cruiseschip operator might be interested in the average amount of food consumed by the tourists and not so much the exact distribution.
  • We wish to compare large amounts of distributions like in a table, and comparing the entire distribution becomes too cluttered. A summary that expresses the essence of the distribution might work as well.
$\endgroup$
7
  • 1
    $\begingroup$ These are just suggestions to improve your current example. The question in your title is very broad and different answers might work for different questions. $\endgroup$ Commented Jun 14, 2023 at 21:34
  • $\begingroup$ I edited the question according to your right and precise comments. $\endgroup$ Commented Jun 15, 2023 at 7:37
  • $\begingroup$ I'm interesting in understanding how u would tackle the problem. I can see some differences in the age distribution. Considering that in Italy retirement age for doctors is 68Y, the median would answer "how many Y from now will a half of the doctors retire?". The mean would answer "how many Y from now doctors can work?" but it doesnt consider the number of doctors, as if having 10 that will work 10Y and 10 20Y was the same as having 10 that will work 5Y and 10 25Y. What good summary would you consider? All that given I also need to chose a test to prove that differences are significant. $\endgroup$ Commented Jun 15, 2023 at 8:11
  • $\begingroup$ @ZenoDallaValle Why do you need a summary? Aren't the three curves for the age distributions providing a good enough insight? If you need a summary, then how to tackle that would depend on the reasons for the summary. $\endgroup$ Commented Jun 15, 2023 at 8:19
  • $\begingroup$ Regarding testing of significance; for that you need to formulate a specific hypothesis. Was that, testing a hypothesis, the goal of getting and comparing these data? That is, to quantitatively test a specific expected difference. What was that hypothesised difference. Or were the observations just to explore qualitatively how the three groups differ?... $\endgroup$ Commented Jun 15, 2023 at 8:21
1
$\begingroup$

So assumptions for statistical models are always a little bit of a controversial topic, and some people care more about them than others.

So when you check for normality you usually want the residuals to be normally distributed. You do not necessarily have to check the variables. Now heteroscedasticity is relates to the assumption that the variance of the dependent variables remains the same across a all levels of a independent variable. So in case of a an ANOVA you want the variance of the dependent variable to be the same across the groups.

If that is not given you can use the Welch ANOVA.

Not using statistical tests to determine whether a assumption is met or not is also somewhat controversial. It is often argued that this is not really good practise as when you sample size gets large (like yours) anu small deviation will result in a significant test, and hence is not really useful.

Further, when talking about normality, you might find the Central Limit Theorem handy:

In probability theory, the central limit theorem (CLT) establishes that, in many situations, for independent and identically distributed random variables, the sampling distribution of the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.