In a survey, a complex sample was collected, and the sample was designed to provide estimates at national level. In other words, individuals from one state were more likely to be sampled due to stratification and clustering, leading to intentional oversampling of certain groups for better representation.
The final dataset is as following:
ID | PSU (Primary Sampling Unit) | Stratum | Sampling Weight | Age | Ethnicity | Income |
---|---|---|---|---|---|---|
001 | 102 | A | 1.5 | 35 | Hispanic | 65000 |
002 | 203 | B | 2.0 | 45 | Caucasian | 40000 |
003 | 102 | A | 1.5 | 28 | Asian | 90000 |
... | ... | ... | ... | ... | ... | ... |
Given that I have information about the location of each individual, I want to add a column state
, and regress ( at the individual level) Income
on state
and other variables, to estimate the location influence on salaries.
Naturally, the regression would consider the unit weights, using some package for analysis of complex samples, as the survey
package from R
, for instance.
What are the implications of that, considering the sampling design?
state
, add it on. There isn't any negative consequences of this. $\endgroup$