How likely is a random sample to be significantly similar to the full population?

Question

I am looking for a way to compare the proportions between two data sets of surveys, where individuals can be placed in more than one classification.

One of my groups contains the survey results of 10.000 individuals, and the other is a subset of 100 individuals that I randomly selected from the first dataset (they are not independent).

             pop                samp
Dutch        0.03539377         0.05
French       0.13623071         0.18
English      0.98779873         0.98

I want to know how likely it is that a random sample (of 100 individuals) will have significantly different proportions when compared to the full population.

For this I have generated 50000 random sets of 100 individuals each, and now I want to calculate how many of those are significantly different from the whole population (with a p.value <= 0.05).

Both the chisq.test, prop.test and wilcox.test gave similarly inappropriate results (out of 50000 random subsets of 10 individuals each, only 4% of subsets where significantly different), I am now trying multinomial.test, which I think might be closer to what I want, but not really sure.

Any suggestion?

You'll have to define "significantly similar" in a mathematically formal way if you want to get a mathematical answer... — jbowman, Commented Oct 24, 2018 at 15:59
@jbowman That is a good point, I edited the question. Do you think is clearer now? or this didn't help at all? — Sergio Henriques, Commented Oct 24, 2018 at 16:26
You don't compare subsets to the whole sample (it's not a "population"), because--as you note--they are not independent. You need to compare a subset to the rest of the sample. Then, by construction, the chance that an appropriate statistical test determines they are different is only $100\alpha\%$ where $\alpha$ is the size of the test (and $100(1-\alpha)\%)$ is its confidence level). Also, could you explain in what sense you have "paired" data? It looks like you're just not conducting these tests correctly, but without the details it's hard to determine what all the mistakes might be. — whuber, Commented Oct 24, 2018 at 16:50
I have tried to make the question clearer and address your suggestions. Do you think it is clearer now? Or am I being to vague? — Sergio Henriques, Commented Oct 31, 2018 at 16:36
As @MartijnWeterings points out in an answer, the overlap among the 3 classes is posing a problem. Do you have the data broken down into what seem to be the 7 possible distinct combinations (D= Dutch, E=English, F=French): D,E,F,D+E,D+F,E+F,D+E+F? — EdM, Commented Oct 31, 2018 at 17:00

Sextus Empiricus · Accepted Answer · 2018-10-31 17:15:09Z

Your table is not a contingency table as used in the $\chi^2$-test.

$$\begin{array}{ccc|c} Dutch & French & English & Total \\ \hline 5 & 18 & 98 & 100 \end{array}$$

The total is not the sum of the three types.
Also, the numbers may be correlated (e.g. people who know Dutch may likely know English as well ).

(specify how you exactly perform the chi-squared test, calculation of the statistic, how many degrees of freedom, etc.)

I imagine that you may have the similar problems with the other tests. (although how you apply Wilcoxon test I would not know)

As EdM mentions in the comments you can instead use the seven categories of all possible combinations of 1, 2, and 3 languages. This might still be problematic because there might be cells with small numbers, but you can use instead of the chi-squared-test the multinomial distribution to compute an exact probability (what you were already thinking about yourself).

Stack Exchange Network

How likely is a random sample to be significantly similar to the full population?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
r
distributions
survey
proportion
or ask your own question.

Hot Network Questions

How likely is a random sample to be significantly similar to the full population?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged rdistributionssurveyproportion or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
r
distributions
survey
proportion
or ask your own question.