Background:
This question is based on one asked on the statistics stack exchange, CrossValidated.SE here. Alas, the full answer to the statistical question seems to require enormous computational resources to perform rigorously, so I gather the only approach forward is to perform instead a stochastic simulation.
Hence this question on Mathematica.SE.
How would one use Mathematica and its curated geographic and census database to determine how many Americans, randomly chosen, are needed to have a 50% chance that two of them live in a) the same state or b) in the same or an adjacent state?
One can determine the populations of each states by:
WolframAlpha["US state population table",
{{"PropertyRanking:USStateData", 1}, "QuantityData"},
PodStates -> {"PropertyRanking:USStateData__More",
"PropertyRanking:USStateData__More",
"PropertyRanking:USStateData__More"}]
One can determine the adjacency matrix of the states (or undirected graph $g$) using GeoData and neighboring
, as described by this analogous problem with the counties of Florida:
counties=EntityList[US counties in Florida (administrative divisions)];
and
Cases[GeoNearest["USCounty", counties[[16]]], Except[counties[[16]]]]
(Alas, the generalization to states within the US does not seem to work directly.)
So the approach to part b) would be to do a large simulation of choosing $n=2$ people randomly according to the probabilities based on the state populations. Then find what percentage of the time these two people live in the same or adjacent states. Surely for $n=2$ this will be a small number, say $1\%$. Then repeat with $n=3$. And $n=4$... until one finds the probability of roughly $50\%$. Is there a more efficient approach?
Given the population statistics in Wolfram curated data and the neighboring state data inherent in the GeoData, what are the numerical values of the solutions to parts a) and b)?