Biased Sampling from a Non-Normal Dataset

Question

For my analysis, I'm interested in a particular subset from a non-normally distributed population. I would therefore like to generate a sample from that population. The sample will have drastically different mean/variance and has a "known" shape that may itself be normal, "seminormal" (looking normal-ish but not a smooth bellcurve), or, more likely, something else entirely.

Currently, I am constructing my sample by taking multiple (arbitrarily constructed) subsamples of varying shapes and combining those until I get the desired "known" shape. I'm providing a quick sketch to illustrate the idea. The analysis is not actually about sleep, I just chose something to contextualize the units/shape.

Hopefully you can see that if you combined subsamples 1,2,3 you will approach something that approximates the Goal Distribution. I've drawn a random shape, but the goal could also be normal or seminormal.

With all that background, my question is simply: does this process have a name and/or is it reasonable? In case this is a very common thing that I'm just not aware of, I would love to know some packages in R/Python that can help with this since I don't just need one such sample, but multiple, each with different shapes.

From what I understand it is a similar concept to this post, but they are looking for a distribution that is at least similar and I am looking for something that is completely different.

Are subsamples 1-3 already collected from your sampling frame? What you have drawn as a "goal" distribution looks like an equal mixture distribution. Why not just combine a stratified sample from your 3 subsamples. — AdamO, Commented Mar 18 at 16:55
No, the order is reversed. I am generating the subsamples after the fact (and thus, can be anything I want) with the intent that their mixture will create the goal distribution. The subsamples are created by randomly selecting observations, e.g. randomly selecting 10k observations between [75,150] for subsample 1 — Linkray, Commented Mar 18 at 17:15

AdamO · Accepted Answer · 2024-03-18 17:03:07Z

In earlier posts it was established that you can sample from $\mathcal{F}$ (the distribution of your population or parent frame) to generate a subsample with distribution $\mathcal{F}^*$ by assigning sample propobabilities according to the likelihood ratio, i.e. sample observation $x_i$ according to probability $\pi(x) = f^*(x_i)/f(x_i)$. (note I am calling these "probabilities" because most sampling algorithms will in fact scale these weights to become probabilities, but more precisely it's $\pi(x_i) = f^*(x_i)/f(x_i) \left/ \sum_{j=1}^n (f^*(x_j)/f(x_j)) \right.$.

If you have more than one subsample (call your 3 densities $f$, $g$, and $h$), it becomes a multinomial probability problem so that the sampling probabilities are proportional to: $\pi(x) \propto [ f^*(x_i)/ f(x_i), f^*(x_i)/ g(x_i), f^*(x_i)/ h(x_i)] $ i.e. the more likely the observation is to belong to sample $f$, $g$ or $h$, the closer the probability is to 1.

Thanks, I think this is close to the idea I'm trying to capture. From your explanation, I understand it to say I don't actually need multiple subsamples, I can just assign weights and generate the desired sample in the first pass? — Linkray, Commented Mar 18 at 17:08

Stack Exchange Network

Biased Sampling from a Non-Normal Dataset

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
sampling
nonparametric
or ask your own question.

Linked

Hot Network Questions

Biased Sampling from a Non-Normal Dataset

1 Answer 1

Not the answer you're looking for? Browse other questions tagged samplingnonparametric or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
sampling
nonparametric
or ask your own question.