2
$\begingroup$

For my analysis, I'm interested in a particular subset from a non-normally distributed population. I would therefore like to generate a sample from that population. The sample will have drastically different mean/variance and has a "known" shape that may itself be normal, "seminormal" (looking normal-ish but not a smooth bellcurve), or, more likely, something else entirely.

Currently, I am constructing my sample by taking multiple (arbitrarily constructed) subsamples of varying shapes and combining those until I get the desired "known" shape. I'm providing a quick sketch to illustrate the idea. The analysis is not actually about sleep, I just chose something to contextualize the units/shape.

Distribution Sketch

Hopefully you can see that if you combined subsamples 1,2,3 you will approach something that approximates the Goal Distribution. I've drawn a random shape, but the goal could also be normal or seminormal.

With all that background, my question is simply: does this process have a name and/or is it reasonable? In case this is a very common thing that I'm just not aware of, I would love to know some packages in R/Python that can help with this since I don't just need one such sample, but multiple, each with different shapes.

From what I understand it is a similar concept to this post, but they are looking for a distribution that is at least similar and I am looking for something that is completely different.

$\endgroup$
2
  • $\begingroup$ Are subsamples 1-3 already collected from your sampling frame? What you have drawn as a "goal" distribution looks like an equal mixture distribution. Why not just combine a stratified sample from your 3 subsamples. $\endgroup$
    – AdamO
    Commented Mar 18 at 16:55
  • $\begingroup$ No, the order is reversed. I am generating the subsamples after the fact (and thus, can be anything I want) with the intent that their mixture will create the goal distribution. The subsamples are created by randomly selecting observations, e.g. randomly selecting 10k observations between [75,150] for subsample 1 $\endgroup$
    – Linkray
    Commented Mar 18 at 17:15

1 Answer 1

2
$\begingroup$

In earlier posts it was established that you can sample from $\mathcal{F}$ (the distribution of your population or parent frame) to generate a subsample with distribution $\mathcal{F}^*$ by assigning sample propobabilities according to the likelihood ratio, i.e. sample observation $x_i$ according to probability $\pi(x) = f^*(x_i)/f(x_i)$. (note I am calling these "probabilities" because most sampling algorithms will in fact scale these weights to become probabilities, but more precisely it's $\pi(x_i) = f^*(x_i)/f(x_i) \left/ \sum_{j=1}^n (f^*(x_j)/f(x_j)) \right.$.

If you have more than one subsample (call your 3 densities $f$, $g$, and $h$), it becomes a multinomial probability problem so that the sampling probabilities are proportional to: $\pi(x) \propto [ f^*(x_i)/ f(x_i), f^*(x_i)/ g(x_i), f^*(x_i)/ h(x_i)] $ i.e. the more likely the observation is to belong to sample $f$, $g$ or $h$, the closer the probability is to 1.

$\endgroup$
1
  • $\begingroup$ Thanks, I think this is close to the idea I'm trying to capture. From your explanation, I understand it to say I don't actually need multiple subsamples, I can just assign weights and generate the desired sample in the first pass? $\endgroup$
    – Linkray
    Commented Mar 18 at 17:08

Not the answer you're looking for? Browse other questions tagged or ask your own question.