14

The sad news of Vaughan Jones have reminded me of a question I've been wanting to ask for a while.

A significant part of knowledge is stored on personal websites of scientists: preprints, lecture notes, errata... Quite often this "grey literature" gets cited in papers. These sources have an unfortunate tendency to disappear even at the best of times, but things get really bad when the author dies. Universities being what they are in the 21st Century, these websites are usually regarded as cost and liability centers by the IT departments, who waste no time in removing them.

While the Internet Archive often has some parts of the website archived, this is never guaranteed. Unprovoked, its crawler often only goes one level deep in the site if not less. If it does get a copy of a file, it might not come back for updates for years. Some universities even block out the crawler via robots.txt or make the archives inaccessible a-posteriori. (This seems to be particularly widespread in Canada for some reason.)

I am looking for best practices to prevent this loss of written knowledge. From personal experience, I have a few ideas to share, but I suspect they are far from the most efficient we can get.

1. When to archive?

When is a good time to archive a website? "As often as time allows" is the trivial answer, but of course there are better and worse times. I do think that once you hear of an author's death, archiving is a no-brainer -- you certainly won't miss out on any updates, and the danger of the material disappearing is as high as never before.

Now to the more substantial question: how to archive.

The ultimate source on archiving websites is a post by gwern. What worries me is the amount of sophistication it requires. I am unable to run his archiver-bot, nor can I easily find what I need among the links. I would highly appreciate some dumbed-down version that does 80% of the job with 20% of the effort; but until then, here are my manual tricks.

Stone soup warning! The following is probably less than optimal.

2. Personal archival

The simplest form of archival is personal archival, i.e., downloading the website on one's own computer. In most cases, the wget tool (which should exist for every major OS) does the trick. For example, let's download the website of Thomas M. Liggett (1944-2020) while it is still around:

wget -r -l 0 -e robots=off -np -nc --html-extension --no-check-certificate https://www.math.ucla.edu/~tml/

A few words about the litany of options in this command (I am no expert; see the full wget doc): "-r -l 0" means "download recursively (i.e., follow links as long as they don't lead away from the website you gave)"; "-e robots=off" is to ignore the robots.txt (you are downloading for yourself; usually unnecessary anyway); "--html-extension" is to avoid occasional HTML files being downloaded without the .html extension (not sure why this happens); "--no-check-certificate" is telling wget not to be pedantic about SSL certificates.

This doesn't always work. Some website confuse wget by hiding links in javascript. Some websites span several servers. Some have 500MB videos. Some have different files whose names only differ in capitalization, so when you download them on a windows system they will overwrite one another. But the above simple wget incantation works well enough often enough that I find it worth mentioning.

3. Internet Archive

So let's say the Internet Archive is unaware of a website, or has only partial or obsolete copies of it. Can we poke it to wake it up?

Yes, most of the time. It might be necessary to have an account at the Archive (not sure about it, but not a big hurdle anyway). Once you do so, you take the website you want to archive -- let's say this set of errata from the above-mentioned website:

https://www.math.ucla.edu/~tml/bookcorrections2.pdf

and put its URL into the Internet Archive Save site.

[EDIT: For the following paragraph, you need to be logged in with an account on the Internet Archive. I can recommend signing up.] But wait -- before you click "Save Page", you should check the "Save outlinks" checkmark (yes, even PDF files can have outlinks!). This will tell the Archive to capture and save not just the URL you are giving it, but also every URL linked from it. (I think it won't go deeper down the recursion tree.) So if you do this with a course page, you won't have to do this again with every single homework set in it. Thus, if you want the whole site, you only need to throw the Archive on a dominating set. (Make sure you get the order right: Don't do the start page before the course pages, or the Archive will save the course pages without the PDFs and you won't be able to throw it at those course pages for the next 20 minutes because it will think it has already archived them.)

You can close the window once it starts saving; you don't need to wait until it finishes.

I have just done this with Liggett's website as I was writing this post. I had to manually enter each of his course pages (and a couple more pages) and finally the start page into the Archive's Save page, but everything else was pulled in automatically. It took me about 3 minutes in total. Don't go too fast, as the Archive will block you if it gets too many requests from you in a short time; but the block doesn't last long. (I have personally only managed to trigger this block when I was rabidly cycling through a dozen open tabs and pressing the "Save" button on each.)

Incidentally, you can easily get in the habit of saving every of your own preprints on the Archive this way. If you created the hyperlinks in a reasonable way (i.e., using the "hyperref" package) and didn't forget to check "Save outlinks", this will have the effect that every source you hyperlink in that preprint will also be saved.

4. Contacting the universities

Possibly the lowest-tech option for preserving/resurrecting universities websites that were taken down is by nicely asking the (IT service of the maths department of the) university that used to host them. I've tried this 3 times so far, and have been successful twice (Dan Laksov and Jean-Louis Loday -- props to the University of Strasbourg and to KTH). The third time (Rudolf Fritsch's website, partly on the Archive), I was told they did not have the data anymore. Note that all these universities were European; your mileage will likely vary with American ones. (I would fear the latter would be more bureaucratic about it.) The upshot is "worth trying but don't rely on it to work". Either way, if the site has been restored, make sure to save it using the two methods above to prevent its loss when it re-disappears.

Question. What is missing here; what is improvable?

PS. I am specifically talking about mathematicians' websites, since those are the ones that I am most familiar with. If your first reaction at seeing this question is "why would one want to save anything from a personal homepage?" or "no one I know has a homepage", this is most likely because websites are less widespread in your discipline. If it is "who would care about a preprint from 30 years ago?", that's another place where subjects can be fairly different.

I am also not very interested in the legal or ethical side of archiving websites; in my experience, there has never been any controversy in what I am describing, and the worst externalities that can occur are on the level of "a university server gets more traffic than usual" and "future generations get to sneer at Professor Bigshot's messy website". Meanwhile the upsides of saving even the unpolished ideas of mildly known researchers can be substantial. If you feel there is more to these issues, please start a new question.

5
  • 1
    A very good question. And I myself would also worry about findability of archived versions if the archiving is done privately, even if at the former home university. E.g., a newly-created archive at an odd address, especially if blocked from robots, will be nearly invisible to search engines, if only because of being masked by older one-item hits for the author, etc., I'd fear. Commented Sep 8, 2020 at 20:38
  • 1
    Sort of a secondary "big data" issue: how to find needles in haystacks... Commented Sep 8, 2020 at 20:39
  • 6
    What is the question you’re actually asking, here?
    – nick012000
    Commented Sep 8, 2020 at 23:54
  • @nick012000nick See the third-to-last paragraph if my post. I am fairly sure my methods are significantly improvable. Commented Sep 9, 2020 at 6:21
  • 2
    I would recommend the long-term solution of having authors publish public content at persistent repositories like arxiv.org, osf.io, etc. This kind of scraped information will be a nightmare of unclear licensing. Commented Oct 22, 2020 at 7:38

1 Answer 1

2

A general solution to this problem is probably not easy. However, in case there is a specific professor whose webpage you are following, it would be a nice idea to use httrack to archive it for yourself at least. And you should be doing it every two or three months to keep an updated copy of the website.

1
  • This is good practice for personal use, but something public would be better. Commented Jul 16, 2021 at 9:51

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .