I have cross-sectional data (for four different time periods) and I am trying to create a pseudo-panel based on four different variables. However, after trying the code (see below) I get more observations than I had in the beginning (from 4687 observations across the different datasets, now I have 202245 observations), which does not make sense as individuals with the same values on the four variables ("sex", "age_group", "education", "sector_eu") should appear in one row.
You can find a shorter version of my datasets here (as I can not post them directly in this forum as with less than 100 observations per dataset the code wouldn't work).
With only 100 observations per dataset, the resulting number of the pseudo-panel is of 12 observations, so in this case we don't have the problem of more observations (as described above).
A similar question to this one was asked in this post, from which I have the code snippet that I use to create the pseudo-panel:
list_df <- list(DT15_sl, dt17_sl,DATA17_sl,dt18_sl)
result <- Reduce(function(x, y)
merge(x, y, by=c("sex", "age_group", "education", "sector_eu")),
lapply(list_df, function(x)
{names(x)[10:14] <- paste(names(x)[10:14], x$year_month[4], sep = '_');x[-4]}))
result
This is what result looks like:
sex age_group education sector_eu ESTU.x PROV.x date.x sector_eu_without.x
1 1 35-39 4 Non-tradable sectors 3113 8 2015-09-30 <NA>
2 1 35-39 4 Non-tradable sectors 3113 8 2015-09-30 <NA>
3 2 25-29 4 Non-tradable sectors 3113 8 2015-09-30 <NA>
4 2 25-29 4 Non-tradable sectors 3113 8 2015-09-30 <NA>
5 2 80+ 1 Import-oriented sectors 3113 8 2015-09-30 Import-oriented sectors
6 2 80+ 1 Import-oriented sectors 3113 8 2015-09-30 Import-oriented sectors
7 2 80+ 1 Import-oriented sectors 3113 8 2015-09-30 Import-oriented sectors
8 2 80+ 1 Import-oriented sectors 3113 8 2015-09-30 Import-oriented sectors
9 2 80+ 1 Import-oriented sectors 3113 8 2015-09-30 Import-oriented sectors
10 2 80+ 1 Import-oriented sectors 3113 8 2015-09-30 Import-oriented sectors
11 2 80+ 1 Import-oriented sectors 3113 8 2015-09-30 Import-oriented sectors
12 2 80+ 1 Import-oriented sectors 3113 8 2015-09-30 Import-oriented sectors
indep_2015-09 ideology_2015-09 identity_2015-09 income_2015-09 work_sit_fact_2015-09 ESTU.y PROV.y
1 0 5 3 7 Currently working 3191 8
2 0 5 3 3 Currently working 3191 8
3 0 5 3 7 Currently working 3191 8
4 0 5 3 7 Currently working 3191 8
5 0 3 1 6 Retired 3191 8
6 0 3 1 6 Retired 3191 8
7 0 NA 4 4 Retired 3191 8
8 0 NA 4 4 Retired 3191 8
9 1 NA 3 5 Retired 3191 8
10 1 NA 3 5 Retired 3191 8
11 0 6 1 NA Retired 3191 8
12 0 6 1 NA Retired 3191 8
date.y sector_eu_without.y indep_2017-10 ideology_2017-10 identity_2017-10 income_2017-10
1 2017-10-02 <NA> 0 4 3 4
2 2017-10-02 <NA> 0 4 3 4
3 2017-10-02 <NA> NA NA 5 3
4 2017-10-02 <NA> NA NA 5 3
5 2017-10-02 Import-oriented sectors 1 6 3 NA
6 2017-10-02 Import-oriented sectors 1 6 4 4
7 2017-10-02 Import-oriented sectors 1 6 3 NA
8 2017-10-02 Import-oriented sectors 1 6 4 4
9 2017-10-02 Import-oriented sectors 1 6 3 NA
10 2017-10-02 Import-oriented sectors 1 6 4 4
11 2017-10-02 Import-oriented sectors 1 6 3 NA
12 2017-10-02 Import-oriented sectors 1 6 4 4
work_sit_fact_2017-10 ESTU.x PROV.x date.x sector_eu_without.x indep_2017-12 ideology_2017-12
1 Unemployed 3202 8 2017-12-28 <NA> 0 1
2 Unemployed 3202 8 2017-12-28 <NA> 0 1
3 Currently working 3202 8 2017-12-28 <NA> 0 5
4 Currently working 3202 8 2017-12-28 <NA> 1 2
5 Retired 3202 8 2017-12-28 Import-oriented sectors NA NA
6 Retired 3202 8 2017-12-28 Import-oriented sectors NA NA
7 Retired 3202 8 2017-12-28 Import-oriented sectors NA NA
8 Retired 3202 8 2017-12-28 Import-oriented sectors NA NA
9 Retired 3202 8 2017-12-28 Import-oriented sectors NA NA
10 Retired 3202 8 2017-12-28 Import-oriented sectors NA NA
11 Retired 3202 8 2017-12-28 Import-oriented sectors NA NA
12 Retired 3202 8 2017-12-28 Import-oriented sectors NA NA
identity_2017-12 income_2017-12 work_sit_fact_2017-12 ESTU.y PROV.y date.y
1 3 9 Currently working 3226 8 2018-10-01
2 3 9 Currently working 3226 8 2018-10-01
3 3 NA Currently working 3226 8 2018-10-01
4 3 NA Currently working 3226 8 2018-10-01
5 3 NA Retired 3226 8 2018-10-01
6 3 NA Retired 3226 8 2018-10-01
7 3 NA Retired 3226 8 2018-10-01
8 3 NA Retired 3226 8 2018-10-01
9 3 NA Retired 3226 8 2018-10-01
10 3 NA Retired 3226 8 2018-10-01
11 3 NA Retired 3226 8 2018-10-01
12 3 NA Retired 3226 8 2018-10-01
sector_eu_without.y indep_2018-10 ideology_2018-10 identity_2018-10 income_2018-10
1 <NA> NA 5 3 NA
2 <NA> NA 5 3 NA
3 <NA> NA NA 3 NA
4 <NA> NA NA 3 NA
5 Import-oriented sectors 0 3 5 NA
6 Import-oriented sectors 0 3 5 NA
7 Import-oriented sectors 0 3 5 NA
8 Import-oriented sectors 0 3 5 NA
9 Import-oriented sectors 0 3 5 NA
10 Import-oriented sectors 0 3 5 NA
11 Import-oriented sectors 0 3 5 NA
12 Import-oriented sectors 0 3 5 NA
work_sit_fact_2018-10
1 Currently working
2 Currently working
3 Currently working
4 Currently working
5 Retired
6 Retired
7 Retired
8 Retired
9 Retired
10 Retired
11 Retired
12 Retired
As you can see, the first two rows should be one individual as they have the same values on "sex", "age_group", "education" and "sector_eu". Looking at the values of these four variables, "result" should only contain 3 observations (instead of 12) as there are only 3 different unique combinations of the four variables.
What am I doing wrong or missing?
Further, if I had two additional variables in two of the four datasets (e.g. in DATA17_sl and DT15_sl but not in dt18_sl and dt17_sl), let's say in position 15 and 16, and I wanted to include them in the pseudo-panel as well, how would I have to change my code?
I thank you in advance for every help you can provide.