Web Today, Good Tomorrow? Transactional archiving of web content [Long Version]
- 1. Web Today, Gone Tomorrow?
=> transactional archiving of web content
Peter Burnhill
University of Edinburgh
Professional/Scholarly Publishing (PSP) Division, Association of American Publishers (AAP)
Washington DC, 1-4 February 2017
- 2. Overview: Good News & Bad News + Remedy
3 Good News on Archiving e-Journals
&
3 Red Alerts of Bad News on Integrity of e-Journals
& their authors, customers & readers
+
5 Options to improve integrity of published product
that will help their authors, customers & readers
+ Amber
Alerts
- 3. 12 ‘Dark Archive Nodes’
in long-lived research institutions
in 8 different countries/jurisdictions:
North America: Indiana, Rice,
Stanford, Virginia, OCLC; Alberta (Ca.)
Europe: Edinburgh (UK);
Humboldt (Ger); Cattolica dSC (It.)
Asia/Pacific: ANU; NII (Japan); UHK
Triggered 29 titles so far
[1.1 m downloads in 2016]
Triggered release at Stanford & EDINA
via OpenURL's to local library link-resolvers & CrossRef
CLOCKSS Archive Network
Library Stewardship: Global & Decentralized
not-for-profit joint venture
Board: 12 publishers & 12 libraries
Cross-sectoral
collaboration &
innovation
Stanford
TRAC Certified
- 4. ① Web-scale not-for-profit archiving agencies:
① National institutions (usually national libraries) …
① Consortia of university libraries & specialist centres …
National Science Library,
Chinese Academy of Sciences
1. We now have a variety of digital shelving
National Science Library,
Chinese Academy of Sciences
Good News: a lot of online e-journal content is being kept safe
Swiss National Library
- 5. … to discover who is looking after whatAn established Global Monitor
thekeepers.org
2. We have means to search ‘holdings’ on digital shelves
12 ‘keepers’
(+ Swiss
National
Library)
Funded by:
Developed &
managed by:
on Title or ISSN, using
the ISSN Register
& ISSN-L as kernel field
- 6. 6
Search for Origins of Life
… but coverage
of volumes is
partial & patchy
This e-journal is being archived
by 5 archiving agencies …
free to use @ thekeepers.org
- 8. 3. Good News: # Titles known to be archived is increasing
more archiving + more archives reporting into Registry!
… at least 1 … 3 or more
Dec 2013 22,196 8,618
Nov 2014 26,195 9,656
Dec 2015 29,663 10,710
Dec 2016 33,711 12,644
Kept Safer
Up by c.50%
over past 3 years
Registry as ‘Observatory’: provide evidence on progress
- 10. ‘e-journals’
‘book-length work’
conference proceedings
e-theses
Continuing Resources = ‘SERIALS’
(issued in Parts)
‘ONGOING INTEGRATING RESOURCES’
(changes over Time)
Updating websites,
repositories,
databases
Govt. publications ‘issued on web’
trade magazines,
etc.
ISSN assigned to:
‘e-newsmedia’
‘data as findings’
‘The Scholarly
Record ….’
+
Practical focus: what ISSN identifies as
‘continuing resource’ issued online
- 11. ISSN
Count
In 1+
Archive
Ingest
Ratio %
KeepSafe
Ratio %
187,445 Globally 18.0 6.7
5295 3613 Netherlands 68.2 45.7
1109 671 Egypt 60.5 12.9
16624 7230 UK 43.5 22.7
33848 11123 USA 32.9 7.7
8167 1997 Germany 24.5 10.1
2048 409 Switzerland 20.0 4.1
902 133 Austria 14.8 6.5
731 73 Belgium 10.0 0.5
8054 797 Brazil 10.0 0.2
2895 278 Poland 9.6 1.0
1041 97 Sweden 9.3 2.6
414 54 Bulgaria 8.4 0.2
663 646 South Africa 8.1 2.9
9373 49 Canada 6.9 0.4
760 385 Mexico 6.4 0.4
6034 76 India 6.4 2.1
Big variation by Country of Publication, 2016
3+
Elsevier
Hindawi
Wiley etc
Springer
Karger
T&F, CUP etc
very low
KeepSafe
Ratios
Amber Alert
- 12. Arts & Humanities
are very much
‘at risk’
‘elite’ Journals for some disciplines at risk
Law
Classics
Classics
KeepSafe Ratio
74.2 Agriculture, Veterinary and Food Science
74.2 Sociology
73.8 Economics and Econometrics
73.2 Earth Systems and Environmental Sciences
43.5 Theology and Religious Studies
41.1 History
39.1 Music, Drama, Dance and Performing Arts
38.3 Modern Languages and Linguistics
37.2 English Language and Literature
37.1 Law
17.6 Classics
STEM Journals
well archived
%
From UK
University
submissions
to Research
Excellence
Framework
REF 2014
Amber Alert
- 13. very many ‘at risk’ e-journals
from the “65% of publishers”:
the hardest to reach & work with
BIG publishers
act early but
incompletely
** Amber Alert **
a lot of Arts, Humanities,
Law & ‘applied’ literature
not being archived
STEM Journals
well archived
- 14. Progress as archiving agencies form a Keepers Network
to tackle that Long Tail and ensure completeness
=> Their recent Statement * endorsed by library community
• ARL + CARL + LIBER + RLUK + AUL
IARLA : International Alliance of Research Library Associations
• Ivy Plus Libraries Collections Group, USA
+ library groups in Canada, Australasia, South America and Europe
* ‘Working Together to Ensure the Future of the Digital Scholarly Record’
http://thekeepers.blogs.edina.ac.uk/keepers-extra/ensuringthefuture
=> Need support from Publishers & Publisher Associations
1. To read and endorse the Keepers Statement *
• be vocal to all publishers in your support of archiving agencies
• make it easier for archives to ingest your content & keep it safe
2. To dble-check actual ingest of your content via Keepers Registry
- 15. Useful links in addition to thekeepers.org
‘Ensuring the Future of the Digital Scholarly Record’,
The Signal, Library of Congress
https://blogs.loc.gov/thesignal/2017/01/the-keepers-registry-ensuring-the-future-of-the-digital-scholarly-record/
‘Tales from The Keepers Registry: Serial Issues About Archiving & the Web’
Serials Review 39 2013
http://dx.doi.org/10.1016/j.serrev.2013.02.003
Author’s copy: https://www.era.lib.ed.ac.uk/handle/1842/6682
‘Helping to ensure ease & continuity of access to digital resources’
Digital Future and You: Library of Congress, Washington DC, 10 December, 2012
Burnhill-Keepers-LibCongress.pdf
‘Building a Social Compact for Preserving E-Journals’
The Serials Librarian 70, 2016
http://dx.doi.org/10.1080/0361526X.2016.1141630
Anne Kenney NASIG 2015, https://www.youtube.com/watch?v=03H376Npm0w
“that really great thing called the Keepers Registry.”
- 16. References to
Content
=> Back into Scholarly
Publications
=> Out onto
the Web at Large
Has ‘fixity’ dynamic , lacks fixity
DOI, ISSN
CLOCKSS, Portico,
CrossRef, etc
URLs
‘Web today, gone tomorrow’
Reference RotE-Journal Archiving
#keepers #hiberlink
Threat to Integrity of scholarly publication
=> References to Content
Now The Bad News: 3 Red Alerts for Publishers
- 17. Project 2 years: March 2013 to June 2015
Funder Andrew W. Mellon Foundation
Partners University of Edinburgh
EDINA & Language Technology Group,
School of Informatics
Los Alamos National Laboratory
ambition
1. Define and measure the extent of ‘Reference Rot’
2. Scope possible intervention opportunities to stop the rot
we did that and went further to
3. Devise sustainable solutions capable of maximal reach
The aim today is to
4. Prompt action by those who can make a difference …
- 18. arXiv
Elsevier
corpus PMC
Dark solid lines represents URIs to Web-at-large, from 1997/2011
Red Alert 1 Scholarly Articles increasingly link to
Web Resources, not just back to other Articles
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from
Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253
http://journals.plos.org/plosone/article?id=info:doi/10.1371/journal.pone.0115253
Data:
1.2m articles with URI references, of which 393,000 to ‘Wild Web’ => 1million URIs
- 19. Reference Rot = Link Rot + Content Drift
When what was referenced & cited
ceases to say the same thing, or ‘has ceased to be’
http://www.snorgtees.com/this-parrot-has-ceased-to-be
1. Link Rot: Link stops working
=> two questions about the
1 million URLs to Web-at-large
1. Do those links (URLs) still work?
- on the ‘Live Web’’?
2. Is there a ‘Memento’
of that reference
in the ‘Archived Web’?
- 20. Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014)
Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot.
PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253 http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253
within 14 days of publication date …
PMC Elsevier
‘Not Archived’ 74.5% 75.2%
Of those ‘Not Archived’ % %
still ‘Live’ on the Web 80 67.3
‘No longer Live’ on the Web 20% 32.7% Many ‘missing,
presumed lost’
Most referenced
URIs at risk of loss
Team at Harvard Law School established similar evidence
• 70% of the URLs within [law] journals & 50% of the URLs within U.S. Supreme Court opinions
… “do not produce the information originally cited.”
Jonathan Zittrain, Kendra Albert and Lawrence Lessig (2014). Perma: Scoping and Addressing the Problem of Link and
Reference Rot in Legal Citations. Legal Information Management 14. doi:10.1017/S1472669614000255.
Red Alert 2
Reference Rot is
already significant
- 21. Content Drift is even scarier! Red Alert 3
when what is at end of cited URL has changed, or gone!!
http://dl00.org
2000
http://dl00.org
2004
http://dl00.org
2005
http://dl00.org
2008
(a) Dynamic content
as values on webpage
changes over time
(b) Static content
but very different
(often unrelated) web pages
- 22. Content Drift (UK Web Archive, BL)
Andy Jackson (2015) Ten years of the UK web archive: what have we saved?
http://netpreserve.org/sites/default/files/attachments/2015_IIPC-GA_Slides_03_Jackson.pptx
After 2 years: 40% of URIs gone from the live web (link rot)
& 40% of URIs “unrecognizably (content drift)
- 23. ‘Similarity’ of Representative Mementos & Live Web Content as at August 2015 by Year of Publication
655,000 Elsevier articles, 1997 to 2012
Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out
of Four URI References Lead to Changed Content. PLOS ONE 11(12): e0167475.
doi:10.1371/journal.pone.0167475 http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0167475
‘Similarity’ decreases over time
After 3 years, only ¼
of URIs lead
to unchanged content
+ increase in Link Rot
25%
* fresh evidence
on ‘Content Drift’ *
- 24. only about 25%
of referenced resources
In articles published
in 2012
remain unchanged by
August of 2015
25%
25%
25%
Confirmed in all 3 datasets
- 25. => Content of Citations Rot over Time!!
… leading to rotten references for the reader
Get Smell Out Copyright © 2017
- 26. Rot in References means a Defective Article!
undermines the integrity of the scholarly record
http://www.fao.org/wairdocs/tan/x5883e/x5883e01.htm
- 27. So what should to expect of the Publisher?
Beyond the assurance that
the fish / references / articles
sold are not rotten
Kind permission from Manchester Evening News
- 28. 5 Options to Remedy Reference Rot
Hint: Remedy for fish is ‘Quick Freeze & Store with Date Stamp’
Kind permission from Asia Quality Control
Always end on the +ve … !!
- 29. ①Take Snapshot of what is at end of URL
& put in safe place until needed by reader
• Various web archives support on-demand creation of
snapshots of URLs:
– archive.is / Internet Archive / perma.cc / webcitation.org
Archive-It @archiveitorg perma.cc @permacc
- 30. Decide where to intervene for best effect?
Activity Actor Snapshot Quality
1. Preparation Author/reference tool best
2. Submission /Issue Editor/manuscript
system
good
3. Access
(post-publication)
Aggregator/
publisher platform
better late than not
4. Shelving Librarian/IR, journal archive better than nothing
Need to put the means of re-creating fixity within
the software being used in each workflow
- 31. ‘Best’ would be to help authors do right thing
- at earliest moment of capture!
http://the-animals-biography.blogspot.co.uk/2014/04/kingfisher.html
- 33. • Preparation -> Study -> Compose -> Submission
=> Good News: something already exists …
• Hiberlink Project: EDINA developed code for Zotero [open source]
Note
University of Edinburgh now investigating how to assist doctoral
students with their references to web resources in e-theses
② Help the Author record their dependencies?
• ‘transactional archiving’ of referenced web content
• do it when noted & citation created
• OK, but how to effect change in note-taking software?
eg EndNote, Mendeley, Reference Manager, RefMe, Zotero
- 34. Need to create a time-based record of what an Author
regards as significant …
- 35. … or needs to provide as evidence!
Alexander Lexén
https://www.flydreamers.com/en/photo/alexander-lexen-s-fly-fishing-catch-of-a-european-brown-trout-fly-dreamers-pic291999
- 36. More Good News:
Metadata for the citation of that Snapshot
Three key elements should be recorded in the citation:
1. Original URL
2. Snapshot URL where the web content was archived
3. Date/Time when the snapshot was taken (& archived)
A proposed standard ‘Robust Links’ syntax is set out at
http://robustlinks.mementoweb.org/
- 37. ③ Adapt the publisher process to ‘stop the rot’
• Submission -> Editing -> (Revision) -> Acceptance -> Issue
a) Publishers should create Snapshots in web archives
• Editors to use citations with the 3 Robust Link elements
b) Submission systems should accept citations submitted
with Robust Link syntax!
• Engage / amend / use ‘Robust Links’ syntax
=> Yet More Good News: something already exists …
Hiberlink Project: algorithm created for OJS [open source] ; code in GitHub
- 38. ④Value in having ‘Hibernator’ Infrastructure
Publishing
platform
‘Hibernator’
External archival
service
e.g. Internet Archive,
Perma cc
• Asynchronous - returns Hiberlink in Robust Link format
• Distributed - archived in different locations
• Lightweight - leveraging HTTP & what already exists
as middleware which simplifies interaction
between publisher systems & web archives
Note
University of Edinburgh is building the Hibernator for its doctoral
students to support references used as evidence in e-theses
- 39. Activity Responsibility Snapshot Quality
3. Access Platform better late than not
⑤Act to help the Reader, given rot
Access/Post-Publication -> Reader Access -> Use
• Install ‘Link Decoration’: enable readers to employ Memento
for search web archives for content ‘around time of submission’
Finish on this Good News:
Herbert Van de Sompel et al. (2015) Robust Links - Link Decorations
http://robustlinks.mementoweb.org/spec/
- 40. Thank You: Questions Welcome
p.burnhill@ed.ac.uk
With kind permission from 'Feather Saturnfly' on flickr, All Rights Reserved
- 41. Useful links – that still work
Hiberlink.org
Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content.
PLOS ONE 11(12) doi:10.1371/journal.pone.0167475
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0167475
The Cobweb: Can the Internet be archived? New Yorker, Annals of Technology, January 2015
http://www.newyorker.com/magazine/2015/01/26/cobweb
The growing problem of Internet “link rot” and best practices for media and online publishers
https://journalistsresource.org/studies/society/internet/website-linking-best-practices-media-online-publishers
Law Library of Congress Implements Solution for Link and Reference Rot
https://www.digitalgov.gov/2016/04/13/law-library-of-congress-implements-solution-for-link-and-reference-rot/