SlideShare a Scribd company logo
Preservation of Astronomical Data Arnold Rots Smithsonian Astrophysical Observatory Virtual Astronomical Observatory
Context I am Archive Astrophysicist for the Chandra Data Archive Chandra is an X-ray telescope, one of NASA’s great observatories The CDA is operated by the Smithsonian Astrophysical Observatory under contract with NASA Here are right away two separate federal masters As such it is one of the NASA astrophysics data centers But I am also the lead for Data Curation and Preservation for the Virtual Astronomical Observatory The CDA is compliant with VAO standards The VAO is a member of the International Virtual Observatory Alliance
The Astronomical Data Universe International Virtual Observatory Alliance Virtual Astronomical Observatory (USA) Chandra Data Archive
The Other Data Universe Smithsonian Institution Smithsonian Astrophysical Observatory Chandra Data Archive
Our Complex Relations As an observatory data archive we have multiple federal masters: Smithsonian Institution NASA NSF (through VAO) And more non-federal masters: IVOA Interoperability standards and protocols US user community International user community Bright points: No privacy issues No national security issues No commercial value
The Smithsonian Side SI is going digital There is a world of difference between collections and scientific research data On the one hand our experience is valuable for the non-research units On the other hand, there are legitimate differences in approach and requirements
Virtual Observatory Objective: Make all astronomical data seamlessly accessible Provide analysis tools This requires: Interoperability standards: metadata and protocols Ubiquity of FITS data file format helps Development of general tools IVOA Standards authority Collaborative consortium of national VO organizations VAO USA member of the IVOA NSF & NASA-funded collaboration of nine institutions
VAO: Virtual Astronomical Observatory Standards and protocols Collaborating in the IVOA framework Tools development Compliant with IVOA standards, US priorities, IVOA-coordinated User support Documentation, portal, … Operations Provides the necessary framework Technology assessment What technologies do we use/introduce? EPO Data Curation and Preservation
VAO DC&P Components Mission/Observatory data archives NASA centers: more or less OAIS- and TDR-compliant For CDA, we keep it all (all versions, multiple copies) on spinning disk Contributed datasets This is a problem area Bibliographic repository The Astrophysics Data System (ADS) has the entire astronomical literature (excepting books) online for the entire international community Semantic linking Linking datasets with datasets, papers with papers, datasets with papers; another problem area; Dataset Identifiers Discovery tools
Ontology/Semantic Linking Triplestore prototype discovery tool: http://adslabs.harvard.edu/semantic/publications.html Just a factoid from collecting bibliographic links: The amount of Chandra data published each year is the equivalent of 5-6 years of observations
Challenges (and some solutions) Data Management Plans We have experience, we will provide guidance and support Establish public repositories If you want people to contribute datasets, you have to give them a place to put them; who is going to provide/run this? Funding DMPs should help here; beware of unfunded mandates Make distributed repositories interoperate and transparently look like one This is easier for the established mission/observatory archives than for repositories of contributed datasets Define metadata requirements Probably the most crucial challenge
More Challenges Get users to contribute their datasets Highly processed products; data behind the plots and figures One has to make this easy: special tools needed Get the users to provide adequate and correct metadata Even harder; but also even more important * Get users to provide links Equally hard; but essential for semantic data discovery These three need to be part of the manuscript submission process * AVM standard for astronomical Jpegs is a step in the right direction
And More Challenges Get the community to release research data Not a problem for NASA-funded data The NSF DMP requirement is a good step forward The culture is changing, but private institutions are problematic OAIS compliance will probably come around naturally Agency IT standards and requirements Certification as a Trusted Digital Repository It is harder to convince people that this is a good thing But it may come to be required at some point
Repository Requirements A Trusted Digital Repository should provide these services: Storage Standardized metadata Access  Authorization Provenance Curation Authentication Policy enforcement
Repository Requirements Preservation metadata need to cover: Authenticity Original arrangement Integrity Chain of custody and history Trustworthiness Data processing, archiving, and all else will cease at some point, but there are three areas that will never be finished: Preservation User support User interface development Preservation of software?

More Related Content

Rots RDAP11 Data Archives in Federal Agencies

  • 1. Preservation of Astronomical Data Arnold Rots Smithsonian Astrophysical Observatory Virtual Astronomical Observatory
  • 2. Context I am Archive Astrophysicist for the Chandra Data Archive Chandra is an X-ray telescope, one of NASA’s great observatories The CDA is operated by the Smithsonian Astrophysical Observatory under contract with NASA Here are right away two separate federal masters As such it is one of the NASA astrophysics data centers But I am also the lead for Data Curation and Preservation for the Virtual Astronomical Observatory The CDA is compliant with VAO standards The VAO is a member of the International Virtual Observatory Alliance
  • 3. The Astronomical Data Universe International Virtual Observatory Alliance Virtual Astronomical Observatory (USA) Chandra Data Archive
  • 4. The Other Data Universe Smithsonian Institution Smithsonian Astrophysical Observatory Chandra Data Archive
  • 5. Our Complex Relations As an observatory data archive we have multiple federal masters: Smithsonian Institution NASA NSF (through VAO) And more non-federal masters: IVOA Interoperability standards and protocols US user community International user community Bright points: No privacy issues No national security issues No commercial value
  • 6. The Smithsonian Side SI is going digital There is a world of difference between collections and scientific research data On the one hand our experience is valuable for the non-research units On the other hand, there are legitimate differences in approach and requirements
  • 7. Virtual Observatory Objective: Make all astronomical data seamlessly accessible Provide analysis tools This requires: Interoperability standards: metadata and protocols Ubiquity of FITS data file format helps Development of general tools IVOA Standards authority Collaborative consortium of national VO organizations VAO USA member of the IVOA NSF & NASA-funded collaboration of nine institutions
  • 8. VAO: Virtual Astronomical Observatory Standards and protocols Collaborating in the IVOA framework Tools development Compliant with IVOA standards, US priorities, IVOA-coordinated User support Documentation, portal, … Operations Provides the necessary framework Technology assessment What technologies do we use/introduce? EPO Data Curation and Preservation
  • 9. VAO DC&P Components Mission/Observatory data archives NASA centers: more or less OAIS- and TDR-compliant For CDA, we keep it all (all versions, multiple copies) on spinning disk Contributed datasets This is a problem area Bibliographic repository The Astrophysics Data System (ADS) has the entire astronomical literature (excepting books) online for the entire international community Semantic linking Linking datasets with datasets, papers with papers, datasets with papers; another problem area; Dataset Identifiers Discovery tools
  • 10. Ontology/Semantic Linking Triplestore prototype discovery tool: http://adslabs.harvard.edu/semantic/publications.html Just a factoid from collecting bibliographic links: The amount of Chandra data published each year is the equivalent of 5-6 years of observations
  • 11. Challenges (and some solutions) Data Management Plans We have experience, we will provide guidance and support Establish public repositories If you want people to contribute datasets, you have to give them a place to put them; who is going to provide/run this? Funding DMPs should help here; beware of unfunded mandates Make distributed repositories interoperate and transparently look like one This is easier for the established mission/observatory archives than for repositories of contributed datasets Define metadata requirements Probably the most crucial challenge
  • 12. More Challenges Get users to contribute their datasets Highly processed products; data behind the plots and figures One has to make this easy: special tools needed Get the users to provide adequate and correct metadata Even harder; but also even more important * Get users to provide links Equally hard; but essential for semantic data discovery These three need to be part of the manuscript submission process * AVM standard for astronomical Jpegs is a step in the right direction
  • 13. And More Challenges Get the community to release research data Not a problem for NASA-funded data The NSF DMP requirement is a good step forward The culture is changing, but private institutions are problematic OAIS compliance will probably come around naturally Agency IT standards and requirements Certification as a Trusted Digital Repository It is harder to convince people that this is a good thing But it may come to be required at some point
  • 14. Repository Requirements A Trusted Digital Repository should provide these services: Storage Standardized metadata Access Authorization Provenance Curation Authentication Policy enforcement
  • 15. Repository Requirements Preservation metadata need to cover: Authenticity Original arrangement Integrity Chain of custody and history Trustworthiness Data processing, archiving, and all else will cease at some point, but there are three areas that will never be finished: Preservation User support User interface development Preservation of software?