SlideShare a Scribd company logo
SDA IN THE REPOSITORY
Repository Fringe 2016
2016-08-2
Laine Ruus, University of Edinburgh. EDINA and Data
Library
laine.ruus@ed .ac.uk or laine.ruus@utoronto.ca
OUTLINE
 Weaknesses of current repository systems
 What is SDA?
 What SDA does for researchers
 What SDA does for teachers
 What SDA does for repositories
 SDA and sensitive data
 Why repositories need SDA or equivalent
 Homework
WEAKNESSES OF CURRENT REPOSITORY SYSTEMS
 Pass-through systems
 User takes all or nothing in many cases
 Metadata – one size fits all?
 SDA won’t provide a solution to all these issues
for all data, but can resolve some problems
for some types of data
WHAT IS SDA?
 SDA stands for Survey Documentation and Analysis
 SDA does for encoded numeric data what Windows Media Player, PhotoShop, etc, do for
sound and video files: the job of the software is rendering, our job is interpretation
 Winner of the following awards:
 American Association for Public Opinion Research (AAPOR): Warren J. Mitofsky Innovators
Award
 American Political Science Association (APSA): Best Instructional Software Award
Unless you are Russell Crowe, or John Forbes
Nash…
… you probably can’t make
much sense
of this
We make sense of numeric
microdata by creating summary
descriptive statistics
- and by generating inferential
statistics to establish and
describe relationships among
characteristics
SDA can do all of the above…
SO WHAT IS SDA?
 a server-side application, accessed through any forms-capable web browser (IE,
Firefox, Chrome, etc)
 a user-friendly interface, with lots of context-specific help screens.
 provides statistical analysis capability for microdata, and to a certain extent, for
aggregate and time-series data,
 generates descriptive and inferential statistics, manipulates data, and generates basic
visualisations of the content of numeric data
 provides "slice-and-dice" access to numeric data
 University of Edinburgh Data Library has installed an SDA server accessible from:
http://www.ed.ac.uk/information-services/research-support/data-library
WHAT SDA DOES FOR RESEARCHERS:
 all metadata about a variable can be consolidated in one location
 univariate descriptive statistics, with/without standard measures of shape, variance, skewness, etc
 multivariate descriptive statistics, with/without standard measures of central tendency, dispersion,
significance
 inferential statistics: comparison of means, correlations and regressions (multivariate, logit or probit)
 recode variables, and/or compute new variables, and share them with others (or not)
 analyse with/without control and/or filter variables
 compute 90%, 95%, or 99% confidence intervals (asymmetric)
 turn weighting off/on (it is on by default, if weight variables are defined)
 compute design effects (deft) for complex sample surveys with stratum/cluster variables
 download either the whole dataset or a bespoke subset (including recoded/computed variables) for
analysis in other software
 basic data visualisations, such as histograms, pie charts, line charts
WHAT SDA DOES FOR TEACHERS:
 an accessible source of data for exercises/assignments
 teach numeracy (e.g. how to read tables) without having to teach software
 teach introductory and intermediate level statistics without having to teach software
 teach the difference between simple random sample (SRS) and complex sample designs and
how they affect measures (design effects (deft)).
 saved output files contain information about variables, recodes and computes, control, filter,
stratum and/or cluster, and weight variables to document what the student did
 variable recodes and new, computed variables, can be shared with students or other
researchers/teachers
 a vehicle for distance education, without software licencing issues
 a vehicle with which to share your own data with other researchers or with a class.
WHAT SDA DOES FOR REPOSITORIES:
 access to numeric data as a source for descriptive/inferential statistics, without
requiring users to have expensive/hard to learn statistical analysis software
 access to data with all relevant variable-level metadata in one interface
 stores a generic-format data file and DDI-compliant metadata file, as well as
syntax files for ingesting the data into SAS, SPSS, and/or Stata (ie a long-term
preservation format)
 access to data without having to remove sensitive variables
 can be configured to provide only pre-defined tables (cross-tabulations)
 can be configured to allow users to load their own data files
 provide access to enhanced version(s) of data files, to facilitate analysis
SDA AND SENSITIVE DATA
 for sensitive data, SDA is FISMA-moderate compliant:
 individual variables or variable combinations can be embargoed,
 cell count limits can be imposed,
 downloading data and listing cases can be disabled, etc.
 analysis with control and/or filter variables can be disabled
 for additional capabilities, see http://sda.berkeley.edu/man40h/disclosure.htm
 account and password protection at the file level
 IP-address range protection at the file level
 for even more sensitive data, SDA Quick Tables facility allows making available only pre-defined tables
 Ie, SDA provides privacy protection at the point of analysis, not at the point of ingest
 the repository can store the full dataset, and provide access to a ‘sanitized’ version without maintaining
separate versions
SDA AND SENSITIVE DATA – CHECK IT OUT
 individual statistical procedures can be disabled
 eg: http://stats.datalib.edina.ac.uk/sdaweb/analysis/?dataset=sda_test
 use of control and filter variables can be disabled
 eg: http://stats.datalib.edina.ac.uk/sdaweb/analysis/?dataset=sda_test
 individual variables or variable combinations can be embargoed
 eg: Scottish school leavers survey, 1981 – any variables labelled ‘not public’
 downloading and listing cases can be disabled, etc.
 eg: http://stats.datalib.edina.ac.uk/sdaweb/analysis/?dataset=sda_test
 account and password protection
 eg: Growing up in Scotland, cohort 1, sweep 6, 2010-2011 (subset)
 Quick Tables:
 eg: http://sda.berkeley.edu:8080/quicktables/quickconfig.do?datasetKey=gss04
WHY REPOSITORIES NEED SDA OR EQUIVALENT
 Be responsive to the needs of your users, ie those researchers/students who will
eventually use the data in your repository
 Encourage secondary usage of numeric data by providing
 enhanced, DDI-compliant metadata in one location
 ‘slice-and-dice’ functionality
 analytic functionality
 Minimise the work involved in privacy-proofing human/corporate-based data, and
checking it, on the part of the researcher, as well as yourself
 The full utility of a dataset should not be compromised – in time, those legal privacy
protections for human-based data will expire. Store the whole dataset, just proscribe the
analyses – your grandchildren and great-grandchildren will thank you!
SDA ISN’T FOR EVERY DATASET
 Since 2008, the Univ. of Edinburgh DataShare repository has ingested the following types
of files:
collection (1)
dataset (237) – most of these are not numeric data in this sense
Image (1)
image (974)
interactive resource (3)
moving image (26)
software (11)
sound (153)
text (8)
 Nor do you necessarily need your own SDA server – you can piggy-back
on someone else’s.
HOMEWORK: MAKE LIKE A RESEARCHER - 1
 U Edinburgh’s SDA server: http://stats.datalib.edina.ac.uk/sda/
 Univariate descriptive statistics
 Q: did the UK population in 2011 in perceive itself to be in good/bad health?
 Dataset: Census microdata teaching file, 2011
 Row variable: health
 Output options: summary statistics
 Chart options: line chart or bar chart
HOMEWORK: MAKE LIKE A RESEARCHER - 2
 Cross-tabulation (bivariate descriptive statistics)
 Q: which UK country in 2011 had most in excellent health?
 Dataset: Census microdata teaching file, 2011
 Row variable: health; Column variable: country
 Output options: summary statistics
 Cross-tabulation (bivariate descriptive statistics)
 Q: which UK country in 2011 had most people in very bad health?
 Dataset: Census microdata teaching file, 2011
 Row variable: health; Column variable: country
 Output options: cell contents: Percentaging - Row
 Output options: summary statistics
HOMEWORK: MAKE LIKE A RESEARCHER - 3
 Cross-tabulation (bivariate descriptive statistics) - 3
 Q: might there be an association between socio-economic status and perceived health?
 Dataset: Census microdata teaching file, 2011
 Row variable: health; Column variable: socgrd, Control variable: country
 Output options: cell contents: Percentaging - Column
 Output options: summary statistics
 Chart options: type of chart: bar chart
 Q2: what other characteristics that are available in this dataset might have an association with
perceived health?
HOMEWORK: MAKE LIKE A RESEARCHER - 4
 Comparison of means
 Q: might there be an association between cultural/material possessions in the home, and
enjoying maths? In what direction? How is Scotland different?
 Dataset: PISA 2012: student questionnaire data set
 Dependent variable: cultpos; Row variable: st29q04; Column variable: nc; Selection filter:
nc(82610,82620,12400,75200)
 Output options: SRS std errs, Z/T-statistic, P-value, ANOVA stats
 Chart option: bar chart or line chart
HOMEWORK: MAKE LIKE A RESEARCHER - 5
 Regression
 Q: do gender and father’s socio-economic class, have an effect on success in school, measured
as the number of O-level A-C grade achieved?
 Dataset: Scottish school leavers (1980) survey, 1981
 Dependent variable: totoac; Independent variables: sex(d:1), dadclass(m:50)
 Sample design: SRS
WHAT QUESTIONS DO YOU HAVE?

More Related Content

Repository Fringe 2016 - Survey Documentation and Analysis

  • 1. SDA IN THE REPOSITORY Repository Fringe 2016 2016-08-2 Laine Ruus, University of Edinburgh. EDINA and Data Library laine.ruus@ed .ac.uk or laine.ruus@utoronto.ca
  • 2. OUTLINE  Weaknesses of current repository systems  What is SDA?  What SDA does for researchers  What SDA does for teachers  What SDA does for repositories  SDA and sensitive data  Why repositories need SDA or equivalent  Homework
  • 3. WEAKNESSES OF CURRENT REPOSITORY SYSTEMS  Pass-through systems  User takes all or nothing in many cases  Metadata – one size fits all?  SDA won’t provide a solution to all these issues for all data, but can resolve some problems for some types of data
  • 4. WHAT IS SDA?  SDA stands for Survey Documentation and Analysis  SDA does for encoded numeric data what Windows Media Player, PhotoShop, etc, do for sound and video files: the job of the software is rendering, our job is interpretation  Winner of the following awards:  American Association for Public Opinion Research (AAPOR): Warren J. Mitofsky Innovators Award  American Political Science Association (APSA): Best Instructional Software Award
  • 5. Unless you are Russell Crowe, or John Forbes Nash…
  • 6. … you probably can’t make much sense of this We make sense of numeric microdata by creating summary descriptive statistics - and by generating inferential statistics to establish and describe relationships among characteristics SDA can do all of the above…
  • 7. SO WHAT IS SDA?  a server-side application, accessed through any forms-capable web browser (IE, Firefox, Chrome, etc)  a user-friendly interface, with lots of context-specific help screens.  provides statistical analysis capability for microdata, and to a certain extent, for aggregate and time-series data,  generates descriptive and inferential statistics, manipulates data, and generates basic visualisations of the content of numeric data  provides "slice-and-dice" access to numeric data  University of Edinburgh Data Library has installed an SDA server accessible from: http://www.ed.ac.uk/information-services/research-support/data-library
  • 8. WHAT SDA DOES FOR RESEARCHERS:  all metadata about a variable can be consolidated in one location  univariate descriptive statistics, with/without standard measures of shape, variance, skewness, etc  multivariate descriptive statistics, with/without standard measures of central tendency, dispersion, significance  inferential statistics: comparison of means, correlations and regressions (multivariate, logit or probit)  recode variables, and/or compute new variables, and share them with others (or not)  analyse with/without control and/or filter variables  compute 90%, 95%, or 99% confidence intervals (asymmetric)  turn weighting off/on (it is on by default, if weight variables are defined)  compute design effects (deft) for complex sample surveys with stratum/cluster variables  download either the whole dataset or a bespoke subset (including recoded/computed variables) for analysis in other software  basic data visualisations, such as histograms, pie charts, line charts
  • 9. WHAT SDA DOES FOR TEACHERS:  an accessible source of data for exercises/assignments  teach numeracy (e.g. how to read tables) without having to teach software  teach introductory and intermediate level statistics without having to teach software  teach the difference between simple random sample (SRS) and complex sample designs and how they affect measures (design effects (deft)).  saved output files contain information about variables, recodes and computes, control, filter, stratum and/or cluster, and weight variables to document what the student did  variable recodes and new, computed variables, can be shared with students or other researchers/teachers  a vehicle for distance education, without software licencing issues  a vehicle with which to share your own data with other researchers or with a class.
  • 10. WHAT SDA DOES FOR REPOSITORIES:  access to numeric data as a source for descriptive/inferential statistics, without requiring users to have expensive/hard to learn statistical analysis software  access to data with all relevant variable-level metadata in one interface  stores a generic-format data file and DDI-compliant metadata file, as well as syntax files for ingesting the data into SAS, SPSS, and/or Stata (ie a long-term preservation format)  access to data without having to remove sensitive variables  can be configured to provide only pre-defined tables (cross-tabulations)  can be configured to allow users to load their own data files  provide access to enhanced version(s) of data files, to facilitate analysis
  • 11. SDA AND SENSITIVE DATA  for sensitive data, SDA is FISMA-moderate compliant:  individual variables or variable combinations can be embargoed,  cell count limits can be imposed,  downloading data and listing cases can be disabled, etc.  analysis with control and/or filter variables can be disabled  for additional capabilities, see http://sda.berkeley.edu/man40h/disclosure.htm  account and password protection at the file level  IP-address range protection at the file level  for even more sensitive data, SDA Quick Tables facility allows making available only pre-defined tables  Ie, SDA provides privacy protection at the point of analysis, not at the point of ingest  the repository can store the full dataset, and provide access to a ‘sanitized’ version without maintaining separate versions
  • 12. SDA AND SENSITIVE DATA – CHECK IT OUT  individual statistical procedures can be disabled  eg: http://stats.datalib.edina.ac.uk/sdaweb/analysis/?dataset=sda_test  use of control and filter variables can be disabled  eg: http://stats.datalib.edina.ac.uk/sdaweb/analysis/?dataset=sda_test  individual variables or variable combinations can be embargoed  eg: Scottish school leavers survey, 1981 – any variables labelled ‘not public’  downloading and listing cases can be disabled, etc.  eg: http://stats.datalib.edina.ac.uk/sdaweb/analysis/?dataset=sda_test  account and password protection  eg: Growing up in Scotland, cohort 1, sweep 6, 2010-2011 (subset)  Quick Tables:  eg: http://sda.berkeley.edu:8080/quicktables/quickconfig.do?datasetKey=gss04
  • 13. WHY REPOSITORIES NEED SDA OR EQUIVALENT  Be responsive to the needs of your users, ie those researchers/students who will eventually use the data in your repository  Encourage secondary usage of numeric data by providing  enhanced, DDI-compliant metadata in one location  ‘slice-and-dice’ functionality  analytic functionality  Minimise the work involved in privacy-proofing human/corporate-based data, and checking it, on the part of the researcher, as well as yourself  The full utility of a dataset should not be compromised – in time, those legal privacy protections for human-based data will expire. Store the whole dataset, just proscribe the analyses – your grandchildren and great-grandchildren will thank you!
  • 14. SDA ISN’T FOR EVERY DATASET  Since 2008, the Univ. of Edinburgh DataShare repository has ingested the following types of files: collection (1) dataset (237) – most of these are not numeric data in this sense Image (1) image (974) interactive resource (3) moving image (26) software (11) sound (153) text (8)  Nor do you necessarily need your own SDA server – you can piggy-back on someone else’s.
  • 15. HOMEWORK: MAKE LIKE A RESEARCHER - 1  U Edinburgh’s SDA server: http://stats.datalib.edina.ac.uk/sda/  Univariate descriptive statistics  Q: did the UK population in 2011 in perceive itself to be in good/bad health?  Dataset: Census microdata teaching file, 2011  Row variable: health  Output options: summary statistics  Chart options: line chart or bar chart
  • 16. HOMEWORK: MAKE LIKE A RESEARCHER - 2  Cross-tabulation (bivariate descriptive statistics)  Q: which UK country in 2011 had most in excellent health?  Dataset: Census microdata teaching file, 2011  Row variable: health; Column variable: country  Output options: summary statistics  Cross-tabulation (bivariate descriptive statistics)  Q: which UK country in 2011 had most people in very bad health?  Dataset: Census microdata teaching file, 2011  Row variable: health; Column variable: country  Output options: cell contents: Percentaging - Row  Output options: summary statistics
  • 17. HOMEWORK: MAKE LIKE A RESEARCHER - 3  Cross-tabulation (bivariate descriptive statistics) - 3  Q: might there be an association between socio-economic status and perceived health?  Dataset: Census microdata teaching file, 2011  Row variable: health; Column variable: socgrd, Control variable: country  Output options: cell contents: Percentaging - Column  Output options: summary statistics  Chart options: type of chart: bar chart  Q2: what other characteristics that are available in this dataset might have an association with perceived health?
  • 18. HOMEWORK: MAKE LIKE A RESEARCHER - 4  Comparison of means  Q: might there be an association between cultural/material possessions in the home, and enjoying maths? In what direction? How is Scotland different?  Dataset: PISA 2012: student questionnaire data set  Dependent variable: cultpos; Row variable: st29q04; Column variable: nc; Selection filter: nc(82610,82620,12400,75200)  Output options: SRS std errs, Z/T-statistic, P-value, ANOVA stats  Chart option: bar chart or line chart
  • 19. HOMEWORK: MAKE LIKE A RESEARCHER - 5  Regression  Q: do gender and father’s socio-economic class, have an effect on success in school, measured as the number of O-level A-C grade achieved?  Dataset: Scottish school leavers (1980) survey, 1981  Dependent variable: totoac; Independent variables: sex(d:1), dadclass(m:50)  Sample design: SRS
  • 20. WHAT QUESTIONS DO YOU HAVE?