SlideShare a Scribd company logo
Prototypes of pro-active approaches to support the
archiving of web references for scholarly
communications
Richard Wincewicz1, Peter Burnhill1
& Herbert Van de Sompel2
1EDINA, University of Edinburgh, 2Los Alamos National Laboratory
The Project Team
2013 – 2015, funded by the
Andrew W. Mellon
Foundation
• Los Alamos National Laboratory:
Research Library: Herbert Van de Sompel
Harihar Shankar, [Martin Klein, Rob Sanderson]
• University of Edinburgh:
Language Technology Group: Claire Grover,
Beatrice Alex, Colin Matheson, Richard Tobin, [Ke “Adam” Zhou]
EDINA * : Peter Burnhill, Muriel Mewissen (Project Manager),
Tim Stickland, Richard Wincewicz, [Neil Mayo]
Centre for Service Delivery & Digital Expertise
Overview
1. Introduction
2. Evidence
3. Remedy
1. Introduction
Reference Rot
Links to Web at Large resources are subject to
Reference Rot. This is a combination of two factors:
• Link Rot: Link stops working
• e.g. HTTP 404 “Not Found”
• Content Drift: Linked content changes over time
• Possibly to the extent that it is no longer
representative of the content that was initially
referenced
2. Evidence
Articles that Link to Articles & to Web At Large Resources
(PMC)
Martin Klein et al. (2014) Scholarly context not found
http://dx.doi.org/10.1371/journal.pone.0115253
Articles that Link to Articles & to Web At Large Resources
(Elsevier)
Martin Klein et al. (2014) Scholarly context not found
http://dx.doi.org/10.1371/journal.pone.0115253
Articles with URI References (PMC)
Articles 479,194
with URI references 399,005
with URI references to articles 240,857
with URI references to Web at Large 156,160
Martin Klein et al. (2014) Scholarly context not found
http://dx.doi.org/10.1371/journal.pone.0115253
Link Rot (PMC)
Martin Klein et al. (2014) Scholarly context not found
http://dx.doi.org/10.1371/journal.pone.0115253
Link Rot (Elsevier)
Martin Klein et al. (2014) Scholarly context not found
http://dx.doi.org/10.1371/journal.pone.0115253
Links from arXiv, Elsevier, PMC to TLD Targets
Martin Klein et al. (2014) Scholarly context not found. In: PLOS ONE
http://dx.doi.org/10.1371/journal.pone.0115253
Grey is Link Rot – Referenced Content Not Accessible
Martin Klein et al. (2014) Scholarly context not found. In: PLOS ONE
http://dx.doi.org/10.1371/journal.pone.0115253
Grey is Not Archived - Referenced Content Lost
Martin Klein et al. (2014) Scholarly context not found. In: PLOS ONE
http://dx.doi.org/10.1371/journal.pone.0115253
Content Drift – http://dl00.org
2000 2004
2005 2008
(a) Dynamic content
values on webpage change
over time
(b) Static content
but very different (often
unrelated) web pages
3. Remedy
Create Snapshots of Referenced Resources
Various web archives support on-demand creation of
snapshots of URIs (manual, API):
 archive.today
 Internet Archive
 perma.cc
 webcitation.org
When creating snapshots, maintain:
 Original URI
 Snapshot URI
 Date/Time of snapshot
Create Snapshots of Referenced Resources
Snapshots can be created at various stages. The closer to
the moment of referencing, the better the image captured.
Stage Actor Snapshot Quality
Preparation Author/reference tool best
Submission
/Issue
Editor/manuscript
system
good
Publication
Aggregator/
publisher platform
ok
Post-publication
Librarian/IR,
journal archive
better than nothing
Authoring - Zotero Plugin Demonstrator
Richard Wincewicz (2014) Prototype Hiberlink plugin for Zotero for pro-active
archiving and temporal references
https://www.youtube.com/v/ZYmi_Ydr65M%26vq
Publication - OJS
Publication - OJS
Publication - OJS
Publication - OJS
Publication - HiberActive Service Demonstrator
Martin Klein et al. (2014) HiberActive: Pro-Active Archiving of web references from scholarly
articles
Open Repositories 2014 http://www.slideshare.net/martinklein0815/hiberactive
Reference Resources Robustly
When referencing resources include:
 Original URI – Allows the user to revisit the URI as it
is at the time of reading, if the URI is still operational
 Snapshot URI – Allows the user to visit the snapshot,
if one was created, and if the web archive in which it
was created is still operational
 Date/Time – with the original URI allow the user to
visit any snapshot created around the Date/Time in
any web archive around the world (using Memento
infrastructure)
(2015) Robust Links - Motivation
http://robustlinks.mementoweb.org/about/
Reference Resources Actionably
When referencing resources, use Link Decorations to convey
Original URI, Snapshot URI, Date/Time
<a href=“http://www.stanford.edu”
data-originalurl=“http://archive.is/FAy6o”
data-versiondate=“2014-08-15” >
<a href=“http://www.stanford.edu”
data-versiondate=“2014-08-15” >
Herbert Van de Sompel et al. (2015) Robust Links - Link Decorations
http://robustlinks.mementoweb.org/spec/
<a href=“http://archive.is/FAy6o”
data-versionurl=“http://www.stanford.edu”
data-versiondate=“2014-08-15” >
Robust Links Using Link Decorations, JavaScript,
Memento API
Demo - http://robustlinks.mementoweb.org/demo/uri_references_js.html
robustlinks.js - https://github.com/mementoweb/robustlinks
Activate Robust Links
There are no Link Decorations, currently. But there is an
article publication date:
 Express the article publication date in an actionable
manner (‘datePublished’ or ‘dateModified’
Schema.org properties) in HTML pages that contain
URI references
 Tailor robustlinks.js to exclude links to articles
 Inject robustlinks.js in HTML pages that contain URI
references
Users Follow Robust Links into Web
Archives
The combination of the referenced URI and the article
publication date:
 Leads users to a snapshot in a web archive, created
as close as possible to the article publication date
 Addresses link rot
 Addresses content drift
Create Archive Copies
When ingesting new content into the platform:
 Parse for URI references
 Create snapshots in web archives of select URIs
 For these URIs, use Link Decorations in HTML to
convey:
• original URI
• snapshot URI
• snapshot Date/Time
Users Follow Robust Links into Web
Archives
The Link Decorations:
 Lead users to the created snapshot, if the web
archive is operational
 Lead users to a snapshot in any web archive, created
as close as possible to the snapshot Date/Time
 Addresses link rot
 Addresses content drift
Prototypes of pro-active approaches to support the
archiving of web references for scholarly
communications
Richard Wincewicz1, Peter Burnhill1
& Herbert Van de Sompel2
1EDINA, University of Edinburgh, 2Los Alamos National Laboratory
http://hiberlink.org #hiberlink

More Related Content

Prototypes of pro-active approaches to support the archiving of web references for scholarly communications

  • 1. Prototypes of pro-active approaches to support the archiving of web references for scholarly communications Richard Wincewicz1, Peter Burnhill1 & Herbert Van de Sompel2 1EDINA, University of Edinburgh, 2Los Alamos National Laboratory
  • 2. The Project Team 2013 – 2015, funded by the Andrew W. Mellon Foundation • Los Alamos National Laboratory: Research Library: Herbert Van de Sompel Harihar Shankar, [Martin Klein, Rob Sanderson] • University of Edinburgh: Language Technology Group: Claire Grover, Beatrice Alex, Colin Matheson, Richard Tobin, [Ke “Adam” Zhou] EDINA * : Peter Burnhill, Muriel Mewissen (Project Manager), Tim Stickland, Richard Wincewicz, [Neil Mayo] Centre for Service Delivery & Digital Expertise
  • 5. Reference Rot Links to Web at Large resources are subject to Reference Rot. This is a combination of two factors: • Link Rot: Link stops working • e.g. HTTP 404 “Not Found” • Content Drift: Linked content changes over time • Possibly to the extent that it is no longer representative of the content that was initially referenced
  • 7. Articles that Link to Articles & to Web At Large Resources (PMC) Martin Klein et al. (2014) Scholarly context not found http://dx.doi.org/10.1371/journal.pone.0115253
  • 8. Articles that Link to Articles & to Web At Large Resources (Elsevier) Martin Klein et al. (2014) Scholarly context not found http://dx.doi.org/10.1371/journal.pone.0115253
  • 9. Articles with URI References (PMC) Articles 479,194 with URI references 399,005 with URI references to articles 240,857 with URI references to Web at Large 156,160 Martin Klein et al. (2014) Scholarly context not found http://dx.doi.org/10.1371/journal.pone.0115253
  • 10. Link Rot (PMC) Martin Klein et al. (2014) Scholarly context not found http://dx.doi.org/10.1371/journal.pone.0115253
  • 11. Link Rot (Elsevier) Martin Klein et al. (2014) Scholarly context not found http://dx.doi.org/10.1371/journal.pone.0115253
  • 12. Links from arXiv, Elsevier, PMC to TLD Targets Martin Klein et al. (2014) Scholarly context not found. In: PLOS ONE http://dx.doi.org/10.1371/journal.pone.0115253
  • 13. Grey is Link Rot – Referenced Content Not Accessible Martin Klein et al. (2014) Scholarly context not found. In: PLOS ONE http://dx.doi.org/10.1371/journal.pone.0115253
  • 14. Grey is Not Archived - Referenced Content Lost Martin Klein et al. (2014) Scholarly context not found. In: PLOS ONE http://dx.doi.org/10.1371/journal.pone.0115253
  • 15. Content Drift – http://dl00.org 2000 2004 2005 2008 (a) Dynamic content values on webpage change over time (b) Static content but very different (often unrelated) web pages
  • 17. Create Snapshots of Referenced Resources Various web archives support on-demand creation of snapshots of URIs (manual, API):  archive.today  Internet Archive  perma.cc  webcitation.org When creating snapshots, maintain:  Original URI  Snapshot URI  Date/Time of snapshot
  • 18. Create Snapshots of Referenced Resources Snapshots can be created at various stages. The closer to the moment of referencing, the better the image captured. Stage Actor Snapshot Quality Preparation Author/reference tool best Submission /Issue Editor/manuscript system good Publication Aggregator/ publisher platform ok Post-publication Librarian/IR, journal archive better than nothing
  • 19. Authoring - Zotero Plugin Demonstrator Richard Wincewicz (2014) Prototype Hiberlink plugin for Zotero for pro-active archiving and temporal references https://www.youtube.com/v/ZYmi_Ydr65M%26vq
  • 24. Publication - HiberActive Service Demonstrator Martin Klein et al. (2014) HiberActive: Pro-Active Archiving of web references from scholarly articles Open Repositories 2014 http://www.slideshare.net/martinklein0815/hiberactive
  • 25. Reference Resources Robustly When referencing resources include:  Original URI – Allows the user to revisit the URI as it is at the time of reading, if the URI is still operational  Snapshot URI – Allows the user to visit the snapshot, if one was created, and if the web archive in which it was created is still operational  Date/Time – with the original URI allow the user to visit any snapshot created around the Date/Time in any web archive around the world (using Memento infrastructure) (2015) Robust Links - Motivation http://robustlinks.mementoweb.org/about/
  • 26. Reference Resources Actionably When referencing resources, use Link Decorations to convey Original URI, Snapshot URI, Date/Time <a href=“http://www.stanford.edu” data-originalurl=“http://archive.is/FAy6o” data-versiondate=“2014-08-15” > <a href=“http://www.stanford.edu” data-versiondate=“2014-08-15” > Herbert Van de Sompel et al. (2015) Robust Links - Link Decorations http://robustlinks.mementoweb.org/spec/ <a href=“http://archive.is/FAy6o” data-versionurl=“http://www.stanford.edu” data-versiondate=“2014-08-15” >
  • 27. Robust Links Using Link Decorations, JavaScript, Memento API Demo - http://robustlinks.mementoweb.org/demo/uri_references_js.html robustlinks.js - https://github.com/mementoweb/robustlinks
  • 28. Activate Robust Links There are no Link Decorations, currently. But there is an article publication date:  Express the article publication date in an actionable manner (‘datePublished’ or ‘dateModified’ Schema.org properties) in HTML pages that contain URI references  Tailor robustlinks.js to exclude links to articles  Inject robustlinks.js in HTML pages that contain URI references
  • 29. Users Follow Robust Links into Web Archives The combination of the referenced URI and the article publication date:  Leads users to a snapshot in a web archive, created as close as possible to the article publication date  Addresses link rot  Addresses content drift
  • 30. Create Archive Copies When ingesting new content into the platform:  Parse for URI references  Create snapshots in web archives of select URIs  For these URIs, use Link Decorations in HTML to convey: • original URI • snapshot URI • snapshot Date/Time
  • 31. Users Follow Robust Links into Web Archives The Link Decorations:  Lead users to the created snapshot, if the web archive is operational  Lead users to a snapshot in any web archive, created as close as possible to the snapshot Date/Time  Addresses link rot  Addresses content drift
  • 32. Prototypes of pro-active approaches to support the archiving of web references for scholarly communications Richard Wincewicz1, Peter Burnhill1 & Herbert Van de Sompel2 1EDINA, University of Edinburgh, 2Los Alamos National Laboratory http://hiberlink.org #hiberlink