1

Hello I want to use National Science Foundation dataset but the raw excel file variable names does not transpose the data properly. Does anyone have any sample code on how to transpose the dataset so that it fit the stata format for analysis. Here is the raw excel file so you can understand the problem.

Raw Excel File from NSF website

2
  • 3
    Sorry, but we need much more than this. Most questions here focus on the OP's code and pose a specific problem based on that. SO is not a code-writing service. or even an advice-providing bureau.
    – Nick Cox
    Commented Oct 21, 2021 at 18:01
  • 1
    In any case your link is NOT a link to the file. It's an image.
    – Nick Cox
    Commented Oct 21, 2021 at 18:03

1 Answer 1

2
+50

*EDITED to add view of import

Keeping in mind Nick's apt point about the nature of SO, I do have some example code that may be helpful here. Without knowing what you mean by "transpose" I cannot give an exact answer, but you can adapt the reshape commands below to your purposes.

Import and view

// Import Excel File defining TL and BR of table
    import excel "nsb20197-tabs02-012.xlsx", cellrange(A6:K177) clear
    list in 1/15

     +----------------------------------------------------------------------------------------------------------------------------------------+
     |                                A       B       C       D      E                   F       G       H       I                   J      K |
     |----------------------------------------------------------------------------------------------------------------------------------------|
  1. |                Associate's level                                                                                                       |
  2. | American Indian or Alaska Native                                                                                                       |
  3. |                       All fields    6282    4131    2151   65.8                34.2    8935    5697    3238                63.8   36.2 |
  4. |                              S&E     608     386     222   63.5                36.5     942     495     447                52.5   47.5 |
  5. |                      Engineering      17       4      13   23.5                76.5      50       8      42                  16     84 |
     |----------------------------------------------------------------------------------------------------------------------------------------|
  6. |                 Natural sciences     384     220     164   57.3                42.7     453     174     279                38.4   61.6 |
  7. |   Social and behavioral sciences     207     162      45   78.3                21.7     439     313     126                71.3   28.7 |
  8. |                          Non-S&E    5674    3745    1929     66                  34    7993    5202    2791   65.09999999999999   34.9 |
  9. |        Asian or Pacific Islander                                                                                                       |
 10. |                       All fields   27313   15522   11791   56.8                43.2   54809   30916   23893                56.4   43.6 |
     |----------------------------------------------------------------------------------------------------------------------------------------|
 11. |                              S&E    2649    1284    1365   48.5                51.5    7862    3492    4370                44.4   55.6 |
 12. |                      Engineering     160      23     137   14.4   85.59999999999999     574     111     463                19.3   80.7 |
 13. |                 Natural sciences    2010     939    1071   46.7                53.3    4419    1562    2857                35.3   64.7 |
 14. |   Social and behavioral sciences     479     322     157   67.2                32.8    2869    1819    1050                63.4   36.6 |
 15. |                          Non-S&E   24664   14238   10426   57.7                42.3   46947   27424   19523                58.4   41.6 |



Structuring the data

// Create supercategories
    
    * level = Column A if column A contains the word level (ignoring case)
    gen level = word(A,1) if ustrregexm(A, "level", 1), before(A)
    
    * demographic = Column A if next ob in column A contains the word all fields (ignoring case)
    gen demographic = A if ustrregexm(A[_n+1], "all fields", 1), before(A)
    
    * fill down demographic and level, and drop blank rows
    foreach v of varlist level demographic {
        replace `v' = `v'[_n-1] if missing(`v')
    }
    drop if mi(demographic) | demographic == A | regexm(A, level)
    
// rename variables 

    * rename A
    rename A field
    
    * rename count columns
    local list "B C D E F" 
    local year = 2000
    rename (`list') (all_`year' female_`year' male_`year' perc_female_`year' perc_male_`year' )
    
    local list "G H I J K" 
    local year = 2017
    rename (`list') (all_`year' female_`year' male_`year' perc_female_`year' perc_male_`year' )
    
    * destring 
    destring *_2000 *_2017, replace

Reshaping to long

    * reshape long
    drop perc*
    reshape long all_ male_ female_, i(level demographic field) j(year)
    
    rename *_ degrees*
    reshape long degrees,  i(level demographic field year) j(gender) string

An example of how one might reshape wide by field

    * test reshape wide by field + ensure the variable name is less than 32 characters post reshape 
    replace field = lower(strtoname(field))
    replace field = substr(field, 1, 32 - strlen("degrees") - 1) 
    reshape wide degrees, i(level demographic year gender) j(field) string
3
  • 1
    One technique here which is more general is that in Excel files it is often implied by an empty cell that the first non-empty cell above applies. Stata has no idea of anything similar but copying values downwards is easy enough. See also stata.com/support/faqs/data-management/replacing-missing-values
    – Nick Cox
    Commented Oct 22, 2021 at 14:57
  • No-one should be alarmed by numbers like 85.59999999999999 as Stata will do its level best to regard that as 85.6. Surprise or not, 0.6 can't be held exactly in binary.
    – Nick Cox
    Commented Oct 22, 2021 at 15:31
  • Thank you all for the feedback and suggestions to improve my comments on stack overflow. Commented Oct 27, 2021 at 0:09

Not the answer you're looking for? Browse other questions tagged or ask your own question.