one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten.Reference rot comes in two forms:
- Link rot: The resource identified by a URI vanishes from the web. As a result, a URI reference to the resource ceases to provide access to referenced content.
- Content drift: The resource identified by a URI changes over time. The resource’s content evolves and can change to such an extent that it ceases to be representative of the content that was originally referenced.
I expected the rot rate to be high, but I was shocked by how quickly link rot and content drift come to dominate the scene. 50% of the content is lost after just one year, with more being lost each subsequent year. However, it’s worth noting that the loss rate is not maintained at 50%/year. If it was, the loss rate after two years would be 75% rather than 60%. This indicates there are some islands of stability, and that any broad ‘average lifetime’ for web resources is likely to be a little misleading.Clearly, the problem is very serious. Below the fold, details on just how serious and discussion of a proposed mitigation.
This work is enabled by the support by Web archives for RFC7089, which allows access to preserved versions (Mementos) of web pages by [url,datetime]. The basic question to ask is "does the web-at-large URI still resolve to the content it did when it was published?".
The earlier paper:
estimated the existence of representative Mementos for those URI references using an intuitive technique: if a Memento for a referenced URI existed with an archival datetime in a temporal window of 14 days prior and after the publication date of the referencing paper, the Memento was regarded representative.The new paper takes a more careful approach:
For each URI reference, we poll multiple web archives in search of two Mementos: a Memento Pre that has a snapshot date closest and prior to the publication date of the referencing article, and a Memento Post that has a snapshot date closest and past the publication date. We then assess the similarity between these Pre and Post Mementos using a variety of similarity measures.
|Incidence of web-at-large URIs|
From those articles 3,983,985 URIs were extracted. 1,059,742 were identified as web-at-large URIs, for 680,136 of which it was possible to identify [Memento Pre, Memento Post] pairs. Eliminating non-text URIs left 648,253. They use four different techniques to estimate similarity. By comparing the results they set an aggregate similarity threshold, then:
We apply our stringent similarity threshold to the collection of 648,253 URI references for which Pre/Post Memento pairs can be compared ... and find 313,591 (48.37%) for which the Pre/Post Memento pairs have the maximum similarity score for all measures; these Mementos are considered representative.Then they:
use the resulting subset of all URI references for which representative Mementos exist and look up each URI on the live web. Predictably, and as shown by extensive prior link rot research, many URIs no longer exist. But, for those that still do, we use the same measures to assess the similarity between the representative Memento for the URI reference and its counterpart on the live web.This revealed that over 20% of the URIs had suffered link rot, leaving 246,520. More now had different content type, or no longer contained text to be compared. 241,091 URIs remained for which:
we select the Memento with an archival date closest to the publication date of the paper in which the URI reference occurs and compare it to that URI’s live web counterpart using each of the normalized similarity measures.The aggregated result is:
a total of 57,026 (23.65%) URI references that have not been subject to content drift. In other words, the content on the live web has drifted away from the content that was originally referenced for three out of four references (184,065 out of 241,091, which equals 76.35%).Another way of looking at this result is that the authors could find only 57,026 out of 313,591 URIs for which matching Pre/Post Memento pairs existed could be shown not to have rotted, or 18.18%. For 334,662 out of 648,253 references with Pre/Post Memento pairs, or 51.63%, the referenced URI changed significantly between the Pre and Post Mementos, showing that it was probably unstable even as the authors were citing it. The problem gets worse through time:
even for articles published in 2012 only about 25% of referenced resources remain unchanged by August of 2015. This percentage steadily decreases with earlier publication years, although the decline is markedly slower for arXiv for recent publication years. It reaches about 10% for 2003 through 2005, for arXiv, and even below that for both Elsevier and PMC.
|Similarity over time at arXiv|
- Archiving Mementos of cited web-at-large URIs during publication, for example using Web archive nomination services such as
- The use of "robust links":
a link can be made more robust by including:
- The URI of the original resource for which the snapshot was taken;
- The URI of the snapshot;
- The datetime of linking, of taking the snapshot.
hreffor the URI of the original resource for which the snapshot was taken;
data-versionurlfor the URI of the snapshot;
data-versiondatefor the datetime of linking, of taking the snapshot.
The robust link proposal also describes a different model of link decoration:
hreffor the URI that provides the specific state, i.e. the snapshot or resource version;
data-originalurlfor the URI of the original resource;
data-versiondatefor the datetime of the snapshot, of the resource version.
With some HTML editing, I made the links to the papers above point to their DOIs at dx.doi.org, so they should persist, although DOIs have their own problems (and also here). I could not archive the URLs to which the DOIs currently resolve, apparently because PLOS blocks the Internet Archive's crawler. With more editing I decorated the link to Andy Jackson's talk in the way Martin suggested - the BL's blog should be fairly stable, but who knows? I saved the two external graphs to Blogger and linked to them there, as is my habit. Andy's graph was captured by the Internet Archive, so I decorated the link to it with that copy. I nominated the arXiv graph and my graph to the Internet Archive and decorated the links with their copy.
The difficulty of actually implementing these links, and the increased understanding of how unlikely it is that the linked-to content will be unchanged, reinforce the arguments in my post from last year entitled The Evanescent Web:
All the proposals depend on actions being taken either before or during initial publication by either the author or the publisher. There is evidence in the paper itself ... that neither authors nor publishers can get DOIs right. Attempts to get authors to deposit their papers in institutional repositories notoriously fail. The LOCKSS team has met continual frustration in getting publishers to make small changes to their publishing platforms that would make preservation easier, or in some cases even possible. Viable solutions to the problem cannot depend on humans to act correctly. Neither authors nor publishers have anything to gain from preservation of their work.It is worth noting that discussions with publishers about the related set of changes discussed in Improving e-Journal Ingest (among other things) are on-going. Nevertheless, this proposal is more problematic for them. Journal publishers are firmly opposed to pointing to alternate sources for their content, such as archives, so they would never agree to supply that information in their links to journal articles. Note that very few DOIs resolve to multiple targets. They would therefore probably be reluctant to link to alternate sources for web-at-large content from other for-profit or advertising-supported publishers, even if it were open access. The idea that journal publishers would spend the effort needed to identify whether a web-at-large link in an article pointed to for-profit content seems implausible.
Update: Michael Nelson alerted me to two broken links, which I fixed. The first was my mistake during hand-editing the HTML to insert the Memento links. The second is both interesting and ironic. The link with text "point to their DOIs at dx.doi.org" referred to
Persistent URIs Must Be Used To Be Persistent, a paper by Herbert van de Sompel, Martin Klein and Shawn Jones which shows:
a significant number of references to papers linked by their locating URI instead of their identifying URI. For these links, the persistence intended by the DOI persistent identifier infrastructure was not achieved.I.e. how important it is for links to papers to be
doi.orglinks. I looked up the paper using Google Scholar and copied the
doi.orglink from the landing page at the ACM's Digital Library, which was
https://dl.acm.org/citation.cfm?id=2889352. The link I copied was
https://doi.org/10.1145/2872518.2889352, which is broken!
Update 2: Rob Baxter commented on another broken link, which I have fixed.
Time to bring back the zotero commons! https://www.zotero.org/blog/zotero-and-the-internet-archive-join-forces/
FYI: specific versions of arXiv papers can be referenced:
The data is horrid, but not unexpected: I have been using the WWW since the very beginning like our blogger, actually since Gopher, or earlier with FTP sites, and I have saved a lot of links, and of course lots and lots of them go bad.
The problem is fundamental, both in the sense of funds (as our blogger says), and of foundations: the web is a showcase/publishing system, not an archival/reference one, and cannot be turned into one.
URLs are like giving references as "the book currently in the middle of the 3rd shelf in bookshop X in town Y". Eventually the bookshop rearranges the shelves, moves them, moves location or closes altogether. And bookshops or web sites are interested in "selling" copies, not providing detailed archiving of history of the copies they sell. Other URI types really just do viceversa, they provide identity rather than location of a copy.
The core issues is that any scheme that both identifies content and provides the location of one or more copies must rely on a central crossreference, and that's pragmatically unfeasible, even if Google sort of tries by spidering:
The best that can be done is like in the real world of books to give identity and location of a copy in some well known library where it is likely to be stable.
Long story story: I make every year $100-$200 donations to the Internet Archive. I wish that there were more completely independent Internet Archives, because it is a big single point of (political, funding) failure.
And (you couldn't make it up) the Crossref.org link behind "DOIs have their own problems." resolves to a 404... :-(
The 404 Rob spotted links is at http://crosstech.crossref.org/2015/01/problems-with-dx-doi-org-on-january-20th-2015-what-we-know.html - a Geoff Bilder post discussing the great DOI outage of 2015. One can see why Crossref might not want this link to resolve, or to be preserved in the Wayback Machine.
But by searching for the link we find the post in question and the followup report. Crossref is devoted to persistent identifiers, which survive changes to website structure such as broke the original link. It is therefore a bit odd that they don't "eat their own dog-food".
I will fix the link, and its occurrence in at least one earlier post, shortly.
Post a Comment