1

I have a table with the first column case IDs and the following columns matched control IDs. Some control IDs will be paired to more than one case ID. There's an index date which corresponds to all IDs in that row. I need to reshape my data so that I have all case and control IDs in one column, with the index date preserved. (Additionally I need a column which flags any IDs from the 'CaseA' column as 1 - however this isn't the part I'm struggling with.)

CaseA ControlB1 ControlB2 ControlB3 ControlB4 IndexDate
ID_x ID_1 ID_2 ID_3 ID_4 1/1/01
ID_y ID_5 ID_2 ID_4 ID_6 1/2/01
ID_z ID_1 ID_5 ID_7 ID_8 1/5/01
IDs IndexDate Case/control
ID_x 1/1/01 1
ID_1 1/1/01 0
ID_2 1/1/01 0
ID_3 1/1/01 0
ID_4 1/1/01 0
ID_y 1/2/01 1
ID_5 1/2/01 0
ID_2 1/2/01 0
ID_4 1/2/01 0
ID_6 1/2/01 0

... and so on.

I have looked into transpose and have been trying approaches with reshape:

reshape long CaseA ControlB*, i(IndexDate)

But I don't think the data is quite in the right format for this.

1 Answer 1

0

The layout (you say "format") is fine. You just need some renaming first. Furthermore, the code misses the point that reshape long needs stub names, not variable names. That is, variables to be stacked vertically will ideally have the same prefix, as when foo1 and foo2 have the same prefix foo, but foo and bar need renaming. (There are other acceptable inputs, but your names don't match that either.)

Stata tip: the spreadsheet terminology of rows and columns is better replaced by talking of observations and variables.

You've presented your date variable as if it were string. It may well be, and certainly should be, a numeric variable with a date display format.

* Example generated by -dataex-. For more info, type help dataex
clear
input str5(casea controlb1 controlb2 controlb3 controlb4) str6 indexdate
"ID_1"  "ID_2"  "ID_3"  "ID_4"  "ID_5"  "1/1/01"
"ID_6"  "ID_7"  "ID_8"  "ID_9"  "ID_10" "1/2/01"
"ID_11" "ID_12" "ID_13" "ID_14" "ID_15" "1/5/01"
end

rename (casea-controlb4) (id=)

reshape long id, i(indexdate) j(which) string 

gen case_control = strpos(which, "case") > 0 

list 
. * Example generated by -dataex-. For more info, type help dataex
. clear

. input str5(casea controlb1 controlb2 controlb3 controlb4) str6 indexdate

         casea  controlb1  controlb2  controlb3  controlb4  indexdate
  1. "ID_1"  "ID_2"  "ID_3"  "ID_4"  "ID_5"  "1/1/01"
  2. "ID_6"  "ID_7"  "ID_8"  "ID_9"  "ID_10" "1/2/01"
  3. "ID_11" "ID_12" "ID_13" "ID_14" "ID_15" "1/5/01"
  4. end

. 
. rename (casea-controlb4) (id=)

. 
. reshape long id, i(indexdate) j(which) string 
(j = casea controlb1 controlb2 controlb3 controlb4)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations                3   ->   15          
Number of variables                   6   ->   3           
j variable (5 values)                     ->   which
xij variables:
    idcasea idcontrolb1 ... idcontrolb4   ->   id
-----------------------------------------------------------------------------

. 
. gen case_control = strpos(which, "case") > 0 

. 
. list 

     +-----------------------------------------+
     | indexd~e       which      id   case_c~l |
     |-----------------------------------------|
  1. |   1/1/01       casea    ID_1          1 |
  2. |   1/1/01   controlb1    ID_2          0 |
  3. |   1/1/01   controlb2    ID_3          0 |
  4. |   1/1/01   controlb3    ID_4          0 |
  5. |   1/1/01   controlb4    ID_5          0 |
     |-----------------------------------------|
  6. |   1/2/01       casea    ID_6          1 |
  7. |   1/2/01   controlb1    ID_7          0 |
  8. |   1/2/01   controlb2    ID_8          0 |
  9. |   1/2/01   controlb3    ID_9          0 |
 10. |   1/2/01   controlb4   ID_10          0 |
     |-----------------------------------------|
 11. |   1/5/01       casea   ID_11          1 |
 12. |   1/5/01   controlb1   ID_12          0 |
 13. |   1/5/01   controlb2   ID_13          0 |
 14. |   1/5/01   controlb3   ID_14          0 |
 15. |   1/5/01   controlb4   ID_15          0 |
     +-----------------------------------------+

EDIT 5 July

This is a reshape long of your modified data. Whether the duplicates are problematic otherwise is a substantive question: you have some built-in dependence.

* Example generated by -dataex-. For more info, type help dataex
clear
input str4(casea controlb1 controlb2 controlb3 controlb4) str6 indexdate
"ID_x" "ID_1" "ID_2" "ID_3" "ID_4" "1/1/01"
"ID_y" "ID_5" "ID_2" "ID_4" "ID_6" "1/2/01"
"ID_z" "ID_1" "ID_5" "ID_7" "ID_8" "1/5/01"
end

reshape long controlb , i(casea indexdate) j(which) 

list, sepby(casea)

     +-------------------------------------+
     | casea   indexd~e   which   controlb |
     |-------------------------------------|
  1. |  ID_x     1/1/01       1       ID_1 |
  2. |  ID_x     1/1/01       2       ID_2 |
  3. |  ID_x     1/1/01       3       ID_3 |
  4. |  ID_x     1/1/01       4       ID_4 |
     |-------------------------------------|
  5. |  ID_y     1/2/01       1       ID_5 |
  6. |  ID_y     1/2/01       2       ID_2 |
  7. |  ID_y     1/2/01       3       ID_4 |
  8. |  ID_y     1/2/01       4       ID_6 |
     |-------------------------------------|
  9. |  ID_z     1/5/01       1       ID_1 |
 10. |  ID_z     1/5/01       2       ID_5 |
 11. |  ID_z     1/5/01       3       ID_7 |
 12. |  ID_z     1/5/01       4       ID_8 |
     +-------------------------------------+

7
  • Hi Nick, Thanks so much for your speedy help. I do still have a problem however, which I think is due to having duplicated IDs in the controlb1-b4 columns, and differing indexdates for these. (Some controls were matched to multiple cases, and the indexdate is decided based on the case.) Is there any way around this? Thanks. Error: variable id does not uniquely identify the observations Your data are currently wide. You are performing a reshape long. You specified i (index_date) and j(which). In the current wide form, variable index_date should uniquely identify the observations.
    – Emma
    Commented Jul 5 at 14:00
  • You need to revise your question so that you show a dataset that is complicated enough to show the problem and compact enough to show here.
    – Nick Cox
    Commented Jul 5 at 14:32
  • That said, I fear that you're asking for a data layout that won't work for your data. I would post a version of this on Statalist where you're much more likely to get attention from medical statisticians.
    – Nick Cox
    Commented Jul 5 at 14:44
  • Sure I've edited my question to reflect this. But I think you might be right, I have also been thinking I may need to look into a different data layout!
    – Emma
    Commented Jul 5 at 14:46
  • I'm unfortunately unable to register for statalist as my version of stata is based on a server without internet access, and so cannot get the authkey to register or contact them!
    – Emma
    Commented Jul 5 at 15:01

Not the answer you're looking for? Browse other questions tagged or ask your own question.