Tuesday, December 20, 2016

Reference Rot Is Worse Than You Think

At the Fall CNI Martin Klein presented a new paper from LANL and the University of Edinburgh, Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. Shawn Jones, Klein and the co-authors followed on from the earlier work on web-at-large citations from academic papers in Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot, which found:
one out of five STM articles suffering from reference rot, meaning it is impossible to revisit the web context that surrounds them some time after their publication. When only considering STM articles that contain references to web resources, this fraction increases to seven out of ten.
Reference rot comes in two forms:
  • Link rot: The resource identified by a URI vanishes from the web. As a result, a URI reference to the resource ceases to provide access to referenced content.
  • Content drift: The resource identified by a URI changes over time. The resource’s content evolves and can change to such an extent that it ceases to be representative of the content that was originally referenced.
Source
The British Library's Andy Jackson analyzed the UK Web Archive and found:
I expected the rot rate to be high, but I was shocked by how quickly link rot and content drift come to dominate the scene. 50% of the content is lost after just one year, with more being lost each subsequent year. However, it’s worth noting that the loss rate is not maintained at 50%/year. If it was, the loss rate after two years would be 75% rather than 60%. This indicates there are some islands of stability, and that any broad ‘average lifetime’ for web resources is likely to be a little misleading.
Clearly, the problem is very serious. Below the fold, details on just how serious and discussion of a proposed mitigation.

This work is enabled by the support by Web archives for RFC7089, which allows access to preserved versions (Mementos) of web pages by [url,datetime]. The basic question to ask is "does the web-at-large URI still resolve to the content it did when it was published?".

The earlier paper:
estimated the existence of representative Mementos for those URI references using an intuitive technique: if a Memento for a referenced URI existed with an archival datetime in a temporal window of 14 days prior and after the publication date of the referencing paper, the Memento was regarded representative.
The new paper takes a more careful approach:
For each URI reference, we poll multiple web archives in search of two Mementos: a Memento Pre that has a snapshot date closest and prior to the publication date of the referencing article, and a Memento Post that has a snapshot date closest and past the publication date. We then assess the similarity between these Pre and Post Mementos using a variety of similarity measures.
Incidence of web-at-large URIs
They worked with three corpora (arXiv, Elsevier and PubMed Central) with a total of about 1.8M articles referencing web-at-large URIs. This graph, whose data I took from Tables 4,5,6 of the earlier paper, shows that the proportion of articles with at least one web-at-large URI was increasing rapidly through 2012. It would be interesting to bring this analysis up-to-date, and to show not merely the proportion through time of articles with at least one web-at-large URI as in this graph, but also histograms through time of the proportion of citations that were to web-at-large URIs.

From those articles 3,983,985 URIs were extracted. 1,059,742 were identified as web-at-large URIs, for 680,136 of which it was possible to identify [Memento Pre, Memento Post] pairs. Eliminating non-text URIs left 648,253. They use four different techniques to estimate similarity. By comparing the results they set an aggregate similarity threshold, then:
We apply our stringent similarity threshold to the collection of 648,253 URI references for which Pre/Post Memento pairs can be compared ... and find 313,591 (48.37%) for which the Pre/Post Memento pairs have the maximum similarity score for all measures; these Mementos are considered representative.
Then they:
use the resulting subset of all URI references for which representative Mementos exist and look up each URI on the live web. Predictably, and as shown by extensive prior link rot research, many URIs no longer exist. But, for those that still do, we use the same measures to assess the similarity between the representative Memento for the URI reference and its counterpart on the live web.
This revealed that over 20% of the URIs had suffered link rot, leaving 246,520. More now had different content type, or no longer contained text to be compared. 241,091 URIs remained for which:
we select the Memento with an archival date closest to the publication date of the paper in which the URI reference occurs and compare it to that URI’s live web counterpart using each of the normalized similarity measures.
The aggregated result is:
a total of 57,026 (23.65%) URI references that have not been subject to content drift. In other words, the content on the live web has drifted away from the content that was originally referenced for three out of four references (184,065 out of 241,091, which equals 76.35%).
Another way of looking at this result is that the authors could find only 57,026 out of 313,591 URIs for which matching Pre/Post Memento pairs existed could be shown not to have rotted, or 18.18%. For 334,662 out of 648,253 references with Pre/Post Memento pairs, or 51.63%, the referenced URI changed significantly between the Pre and Post Mementos, showing that it was probably unstable even as the authors were citing it. The problem gets worse through time:
even for articles published in 2012 only about 25% of referenced resources remain unchanged by August of 2015. This percentage steadily decreases with earlier publication years, although the decline is markedly slower for arXiv for recent publication years. It reaches about 10% for 2003 through 2005, for arXiv, and even below that for both Elsevier and PMC.
Similarity over time at arXiv
Thus, as this arXiv graph shows, they find that, after a few years, it is very unlikely that a reader clicking on a web-at-large link in an article will see what the author intended. They suggest that this problem can be addressed by:
  • Archiving Mementos of cited web-at-large URIs during publication, for example using Web archive nomination services such as http://www.archive.org/web.
  • The use of "robust links":
    a link can be made more robust by including:
    • The URI of the original resource for which the snapshot was taken;
    • The URI of the snapshot;
    • The datetime of linking, of taking the snapshot.
The robust link proposal describes the model of link decoration that Klein discussed in his talk:
information is conveyed as follows:
  • href for the URI of the original resource for which the snapshot was taken;
  • data-versionurl for the URI of the snapshot;
  • data-versiondate for the datetime of linking, of taking the snapshot.
But this has a significant problem. The eventual reader will click on the link and be taken to the original URI, which as the paper shows, even if it resolves is very unlikely to be what the author intended. The robust links site also includes JavaScript to implement pop-up menus giving users a choice of Mementos,which they assume a publisher implementing robust links would add to their pages. An example of this is Reminiscing About 15 Years of Interoperability Efforts. Note the paper-clip and down-arrow appended to the normal underlined blue link rendering. Clicking on this provides a choice of Mementos.

The eventual reader, who has not internalized the message of this research, will click on the link. If it returns 404, they might click on the down-arrow and choose an alternate Memento. But far more often they will click on the link and get to a page that they have no way of knowing has drifted. They will assume it hasn't, so will not click on the down-arrow and not get to the page the author intended. The JavaScript has no way to know that the page has drifted, so cannot warn the user that it has.

The robust link proposal also describes a different model of link decoration:
information is conveyed as follows:
  • href for the URI that provides the specific state, i.e. the snapshot or resource version;
  • data-originalurl for the URI of the original resource;
  • data-versiondate for the datetime of the snapshot, of the resource version.
If this model were to be used, the eventual reader would end up at the preserved Memento, which for almost all archives would be framed with information from the archive. This would happen whether or not the original URI had rotted or the content had drifted. The reader would both access, and would know they were accessing, what the author intended. JavaScript would be needed only for the case where the linked-to Memento was unavailable, and other Web archives would need to be queried for the best available Memento.

The robust links specification treats these two models as alternatives, but in practice only the second provides an effective user experience without significant JavaScript support beyond what has been demonstrated. Conservatively, these two papers suggest that between a quarter and a third of all articles will contain at least one web-at-large citation and that after a few years it is very unlikely to resolve to the content the article was citing. Given the very high probability that the URI has suffered content drift, it is better to steer the user to the contemporaneous version if one exists.

With some HTML editing, I made the links to the papers above point to their DOIs at dx.doi.org, so they should persist, although DOIs have their own problems (and also here). I could not archive the URLs to which the DOIs currently resolve, apparently because PLOS blocks the Internet Archive's crawler. With more editing I decorated the link to Andy Jackson's talk in the way Martin suggested - the BL's blog should be fairly stable, but who knows? I saved the two external graphs to Blogger and linked to them there, as is my habit. Andy's graph was captured by the Internet Archive, so I decorated the link to it with that copy. I nominated the arXiv graph and my graph to the Internet Archive and decorated the links with their copy.

The difficulty of actually implementing these links, and the increased understanding of how unlikely it is that the linked-to content will be unchanged, reinforce the arguments in my post from last year entitled The Evanescent Web:
All the proposals depend on actions being taken either before or during initial publication by either the author or the publisher. There is evidence in the paper itself ... that neither authors nor publishers can get DOIs right. Attempts to get authors to deposit their papers in institutional repositories notoriously fail. The LOCKSS team has met continual frustration in getting publishers to make small changes to their publishing platforms that would make preservation easier, or in some cases even possible. Viable solutions to the problem cannot depend on humans to act correctly. Neither authors nor publishers have anything to gain from preservation of their work.
It is worth noting that discussions with publishers about the related set of changes discussed in Improving e-Journal Ingest (among other things) are on-going. Nevertheless, this proposal is more problematic for them. Journal publishers are firmly opposed to pointing to alternate sources for their content, such as archives, so they would never agree to supply that information in their links to journal articles. Note that very few DOIs resolve to multiple targets. They would therefore probably be reluctant to link to alternate sources for web-at-large content from other for-profit or advertising-supported publishers, even if it were open access. The idea that journal publishers would spend the effort needed to identify whether a web-at-large link in an article pointed to for-profit content seems implausible.

Update: Michael Nelson alerted me to two broken links, which I fixed. The first was my mistake during hand-editing the HTML to insert the Memento links. The second is both interesting and ironic. The link with text "point to their DOIs at dx.doi.org" referred to
Persistent URIs Must Be Used To Be Persistent, a paper by Herbert van de Sompel, Martin Klein and Shawn Jones which shows:
a significant number of references to papers linked by their locating URI instead of their identifying URI. For these links, the persistence intended by the DOI persistent identifier infrastructure was not achieved.
I.e. how important it is for links to papers to be doi.org links. I looked up the paper using Google Scholar and copied the doi.org link from the landing page at the ACM's Digital Library, which was https://dl.acm.org/citation.cfm?id=2889352. The link I copied was https://doi.org/10.1145/2872518.2889352, which is broken!

Update 2: Rob Baxter commented on another broken link, which I have fixed.

5 comments:

Unknown said...

Time to bring back the zotero commons! https://www.zotero.org/blog/zotero-and-the-internet-archive-join-forces/

Unknown said...

FYI: specific versions of arXiv papers can be referenced:

https://arxiv.org/abs/1612.00010v2

blissex said...

The data is horrid, but not unexpected: I have been using the WWW since the very beginning like our blogger, actually since Gopher, or earlier with FTP sites, and I have saved a lot of links, and of course lots and lots of them go bad.

The problem is fundamental, both in the sense of funds (as our blogger says), and of foundations: the web is a showcase/publishing system, not an archival/reference one, and cannot be turned into one.

URLs are like giving references as "the book currently in the middle of the 3rd shelf in bookshop X in town Y". Eventually the bookshop rearranges the shelves, moves them, moves location or closes altogether. And bookshops or web sites are interested in "selling" copies, not providing detailed archiving of history of the copies they sell. Other URI types really just do viceversa, they provide identity rather than location of a copy.

The core issues is that any scheme that both identifies content and provides the location of one or more copies must rely on a central crossreference, and that's pragmatically unfeasible, even if Google sort of tries by spidering:

http://www.sabi.co.uk/blog/0705may.html#070513b

The best that can be done is like in the real world of books to give identity and location of a copy in some well known library where it is likely to be stable.

Long story story: I make every year $100-$200 donations to the Internet Archive. I wish that there were more completely independent Internet Archives, because it is a big single point of (political, funding) failure.

Rob Baxter said...

And (you couldn't make it up) the Crossref.org link behind "DOIs have their own problems." resolves to a 404... :-(

David. said...

The 404 Rob spotted links is at http://crosstech.crossref.org/2015/01/problems-with-dx-doi-org-on-january-20th-2015-what-we-know.html - a Geoff Bilder post discussing the great DOI outage of 2015. One can see why Crossref might not want this link to resolve, or to be preserved in the Wayback Machine.

But by searching for the link we find the post in question and the followup report. Crossref is devoted to persistent identifiers, which survive changes to website structure such as broke the original link. It is therefore a bit odd that they don't "eat their own dog-food".

I will fix the link, and its occurrence in at least one earlier post, shortly.