Reshaping/transposing data in Stata

Question

I have a table with the first column case IDs and the following columns matched control IDs. Some control IDs will be paired to more than one case ID. There's an index date which corresponds to all IDs in that row. I need to reshape my data so that I have all case and control IDs in one column, with the index date preserved. (Additionally I need a column which flags any IDs from the 'CaseA' column as 1 - however this isn't the part I'm struggling with.)

CaseA	ControlB1	ControlB2	ControlB3	ControlB4	IndexDate
ID_x	ID_1	ID_2	ID_3	ID_4	1/1/01
ID_y	ID_5	ID_2	ID_4	ID_6	1/2/01
ID_z	ID_1	ID_5	ID_7	ID_8	1/5/01

IDs	IndexDate	Case/control
ID_x	1/1/01	1
ID_1	1/1/01	0
ID_2	1/1/01	0
ID_3	1/1/01	0
ID_4	1/1/01	0
ID_y	1/2/01	1
ID_5	1/2/01	0
ID_2	1/2/01	0
ID_4	1/2/01	0
ID_6	1/2/01	0

... and so on.

I have looked into transpose and have been trying approaches with reshape:

reshape long CaseA ControlB*, i(IndexDate)

But I don't think the data is quite in the right format for this.

Nick Cox · Accepted Answer · 2024-07-05 15:04:48Z

The layout (you say "format") is fine. You just need some renaming first. Furthermore, the code misses the point that reshape long needs stub names, not variable names. That is, variables to be stacked vertically will ideally have the same prefix, as when foo1 and foo2 have the same prefix foo, but foo and bar need renaming. (There are other acceptable inputs, but your names don't match that either.)

Stata tip: the spreadsheet terminology of rows and columns is better replaced by talking of observations and variables.

You've presented your date variable as if it were string. It may well be, and certainly should be, a numeric variable with a date display format.

* Example generated by -dataex-. For more info, type help dataex
clear
input str5(casea controlb1 controlb2 controlb3 controlb4) str6 indexdate
"ID_1"  "ID_2"  "ID_3"  "ID_4"  "ID_5"  "1/1/01"
"ID_6"  "ID_7"  "ID_8"  "ID_9"  "ID_10" "1/2/01"
"ID_11" "ID_12" "ID_13" "ID_14" "ID_15" "1/5/01"
end

rename (casea-controlb4) (id=)

reshape long id, i(indexdate) j(which) string 

gen case_control = strpos(which, "case") > 0 

list

. * Example generated by -dataex-. For more info, type help dataex
. clear

. input str5(casea controlb1 controlb2 controlb3 controlb4) str6 indexdate

         casea  controlb1  controlb2  controlb3  controlb4  indexdate
  1. "ID_1"  "ID_2"  "ID_3"  "ID_4"  "ID_5"  "1/1/01"
  2. "ID_6"  "ID_7"  "ID_8"  "ID_9"  "ID_10" "1/2/01"
  3. "ID_11" "ID_12" "ID_13" "ID_14" "ID_15" "1/5/01"
  4. end

. 
. rename (casea-controlb4) (id=)

. 
. reshape long id, i(indexdate) j(which) string 
(j = casea controlb1 controlb2 controlb3 controlb4)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations                3   ->   15          
Number of variables                   6   ->   3           
j variable (5 values)                     ->   which
xij variables:
    idcasea idcontrolb1 ... idcontrolb4   ->   id
-----------------------------------------------------------------------------

. 
. gen case_control = strpos(which, "case") > 0 

. 
. list 

     +-----------------------------------------+
     | indexd~e       which      id   case_c~l |
     |-----------------------------------------|
  1. |   1/1/01       casea    ID_1          1 |
  2. |   1/1/01   controlb1    ID_2          0 |
  3. |   1/1/01   controlb2    ID_3          0 |
  4. |   1/1/01   controlb3    ID_4          0 |
  5. |   1/1/01   controlb4    ID_5          0 |
     |-----------------------------------------|
  6. |   1/2/01       casea    ID_6          1 |
  7. |   1/2/01   controlb1    ID_7          0 |
  8. |   1/2/01   controlb2    ID_8          0 |
  9. |   1/2/01   controlb3    ID_9          0 |
 10. |   1/2/01   controlb4   ID_10          0 |
     |-----------------------------------------|
 11. |   1/5/01       casea   ID_11          1 |
 12. |   1/5/01   controlb1   ID_12          0 |
 13. |   1/5/01   controlb2   ID_13          0 |
 14. |   1/5/01   controlb3   ID_14          0 |
 15. |   1/5/01   controlb4   ID_15          0 |
     +-----------------------------------------+

EDIT 5 July

This is a reshape long of your modified data. Whether the duplicates are problematic otherwise is a substantive question: you have some built-in dependence.

* Example generated by -dataex-. For more info, type help dataex
clear
input str4(casea controlb1 controlb2 controlb3 controlb4) str6 indexdate
"ID_x" "ID_1" "ID_2" "ID_3" "ID_4" "1/1/01"
"ID_y" "ID_5" "ID_2" "ID_4" "ID_6" "1/2/01"
"ID_z" "ID_1" "ID_5" "ID_7" "ID_8" "1/5/01"
end

reshape long controlb , i(casea indexdate) j(which) 

list, sepby(casea)

     +-------------------------------------+
     | casea   indexd~e   which   controlb |
     |-------------------------------------|
  1. |  ID_x     1/1/01       1       ID_1 |
  2. |  ID_x     1/1/01       2       ID_2 |
  3. |  ID_x     1/1/01       3       ID_3 |
  4. |  ID_x     1/1/01       4       ID_4 |
     |-------------------------------------|
  5. |  ID_y     1/2/01       1       ID_5 |
  6. |  ID_y     1/2/01       2       ID_2 |
  7. |  ID_y     1/2/01       3       ID_4 |
  8. |  ID_y     1/2/01       4       ID_6 |
     |-------------------------------------|
  9. |  ID_z     1/5/01       1       ID_1 |
 10. |  ID_z     1/5/01       2       ID_5 |
 11. |  ID_z     1/5/01       3       ID_7 |
 12. |  ID_z     1/5/01       4       ID_8 |
     +-------------------------------------+

Hi Nick, Thanks so much for your speedy help. I do still have a problem however, which I think is due to having duplicated IDs in the controlb1-b4 columns, and differing indexdates for these. (Some controls were matched to multiple cases, and the indexdate is decided based on the case.) Is there any way around this? Thanks. Error: variable id does not uniquely identify the observations Your data are currently wide. You are performing a reshape long. You specified i (index_date) and j(which). In the current wide form, variable index_date should uniquely identify the observations. — Emma, Commented Jul 5 at 14:00
You need to revise your question so that you show a dataset that is complicated enough to show the problem and compact enough to show here. — Nick Cox, Commented Jul 5 at 14:32
That said, I fear that you're asking for a data layout that won't work for your data. I would post a version of this on Statalist where you're much more likely to get attention from medical statisticians. — Nick Cox, Commented Jul 5 at 14:44
Sure I've edited my question to reflect this. But I think you might be right, I have also been thinking I may need to look into a different data layout! — Emma, Commented Jul 5 at 14:46
I'm unfortunately unable to register for statalist as my version of stata is based on a server without internet access, and so cannot get the authkey to register or contact them! — Emma, Commented Jul 5 at 15:01

Collectives™ on Stack Overflow

Reshaping/transposing data in Stata

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
stata
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged stata or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
stata
or ask your own question.