Tuesday, February 10, 2015

The Evanescent Web

Papers drawing attention to the decay of links in academic papers have quite a history, i blogged about three relatively early ones six years ago. Now Martin Klein and a team from the Hiberlink project have taken the genre to a whole new level with a paper in PLoS One entitled Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. Their dataset is 2-3 orders of magnitude bigger than previous studies, their methods are far more sophisticated, and they study both link rot (links that no longer resolve) and content drift (links that now point to different content). There's a summary on the LSE's blog.

Below the fold, some thoughts on the Klein et al paper.

As regards link rot, they write:
In order to combat link rot, the Digital Object Identifier (DOI) was introduced to persistently identify journal articles. In addition, the DOI resolver for the URI version of DOIs was introduced to ensure that web links pointing at these articles remain actionable, even when the articles change web location.
But even when used correctly, such as http://dx.doi.org/10.1371/journal.pone.0115253, DOIs introduce a single point of failure. This became obvious on January 20th when the doi.org domain name briefly expired. DOI links all over the Web failed, illustrating yet another fragility of the Web. It hasn't been a good time for access to academic journals for other reasons either. Among the publishers unable to deliver content to their customers in the last week or so were Elsevier, Springer, Nature, HighWire Press and Oxford Art Online.

I've long been a fan of Herbert van de Sompel's work, especially Memento. He's a co-author on the paper and we have been discussing it. Unusually, we've been disagreeing. We completely agree on the underlying problem of the fragility of academic communication in the Web era as opposed to its robustness in the paper era. Indeed, in the introduction of another (but much less visible) recent paper entitled Towards Robust Hyperlinks for Web-Based Scholarly Communication Herbert and his co-authors echo the comparison between the paper and Web worlds from the very first paper we published on the LOCKSS system a decade and a half ago. Nor am I critical of the research underlying the paper, which is clearly of high quality and which reveals interesting and disturbing properties of Web-based academic communication. All I'm disagreeing with Herbert about is the way the research is presented in the paper.

My problem with the presentation is that this paper, which has a far higher profile than other recent publications in this area, and which comes at a time of unexpectedly high visibility for web archiving, seems to me to be excessively optimistic, and to fail to analyze the roots of the problem it is addressing. It thus fails to communicate the scale of the problem.

The paper is, for very practical reasons of publication in a peer-reviewed journal, focused on links from academic papers to the web-at-large. But I see it as far too optimistic in its discussion of the likely survival of the papers themselves, and the other papers they link to (see Content Drift below). I also see it as far too optimistic in its discussion of proposals to fix the problem of web-at-large references that it describes (see Dependence on Authors below).

All the proposals depend on actions being taken either before or during initial publication by either the author or the publisher. There is evidence in the paper itself (see Getting Links Right below) that neither authors nor publishers can get DOIs right. Attempts to get authors to deposit their papers in institutional repositories notoriously fail. The LOCKSS team has met continual frustration in getting publishers to make small changes to their publishing platforms that would make preservation easier, or in some cases even possible. Viable solutions to the problem cannot depend on humans to act correctly. Neither authors nor publishers have anything to gain from preservation of their work.

In addition, the paper fails to even mention the elephant in the room, the fact that both the papers and the web-at-large content are copyright. The archives upon which the proposed web-at-large solutions rest, such as the Internet Archive, are themselves fragile. Not just for the normal economic and technical reasons we outlined nearly a decade ago, but because they operate under the DMCA's "safe harbor" provision and thus must take down content upon request from a claimed copyright holder. The archives such as Portico and LOCKSS that preserve the articles themselves operate instead with permission from the publisher, and thus must impose access restrictions.

This is the root of the problem. In the paper world in order to monetize their content the copyright owner had to maximize the number of copies of it. In the Web world, in order to monetize their content the copyright owner has to minimize the number of copies. Thus the fundamental economic motivation for Web content militates against its preservation in the ways that Herbert and I would like.

None of this is to suggest that developing and deploying partial solutions is a waste of time. It is what I've been doing the last quarter of my life. There cannot be a single comprehensive technical solution. The best we can do is to combine a diversity of partial solutions. But we need to be clear that even if we combine everything anyone has worked on we are still a long way from solving the problem. Now for some details.

Content Drift

As regards content drift, they write:
Content drift is hardly a matter of concern for references to journal articles, because of the inherent fixity that, especially PDF-formated, articles exhibit. Nevertheless, special-purpose solutions for long-term digital archiving of the digital journal literature, such as LOCKSS, CLOCKSS, and Portico, have emerged to ensure that articles and the articles they reference can be revisited even if the portals that host them vanish from the web. More recently, the Keepers Registry has been introduced to keep track of the extent to which the digital journal literature is archived by what memory organizations. These combined efforts ensure that it is possible to revisit the scholarly context that consists of articles referenced by a certain article long after its publication.
While I understand their need to limit the scope of their research to web-at-large resources, the last sentence is far too optimistic.

First, research using the Keepers Registry and other resources shows that at most 50% of all articles are preserved. So future scholars depending on archives of digital journals will encounter large numbers of broken links.

Second, even the 50% of articles that are preserved may not be accessible to a future scholar. CLOCKSS is a dark archive and is not intended to provide access to future scholars unless the content is triggered. Portico is a subscription archive, future scholars' institutions may not have a subscription. LOCKSS provides access only to readers at institutions running a LOCKSS box. These restrictions are a response to the copyright on the content and are not susceptible to technical fixes.

Third, the assumption that journal articles exhibit "inherent fixity" is, alas, outdated. Both the HTML and PDF versions of articles from state-of-the-art publishing platforms contain dynamically generated elements, even when they are not entirely generated on-the-fly. The LOCKSS system encounters this on a daily basis. As each LOCKSS box collects content from the publisher independently, each box gets content that differs in unimportant respects. For example, the HTML content is probably personalized ("Welcome Stanford University") and updated ("Links to this article"). PDF content is probably watermarked ("Downloaded by"). Content elements such as these need to be filtered out of the comparisons between the "same" content at different LOCKSS boxes. One might assume that the words, figures, etc. that form the real content of articles do not drift, but in practice it would be very difficult to validate this assumption.

Soft-404 Responses

I've written before about the problems caused for archiving by "soft-403 and soft-404" responses by Web servers. These result from Web site designers who believe their only audience is humans, so instead of providing the correct response code when they refuse to supply content, they return a pretty page with a 200 response code indicating valid content. The valid content is a refusal to supply the requested content. Interestingly, PubMed is an example, as I discovered when clicking on the (broken) PubMed link in the paper's reference 58.

Klein et al define a live web page thus:
On the one hand, the HTTP transaction chain could end successfully with a 2XX-level HTTP response code. In this case we declared the URI to be active on the live web.
Their estimate of the proportion of links which are still live is thus likely to be optimistic, as they are likely to have encountered at least soft-404s if not soft-403s.

Getting Links Right

Even when the dx.doi.org resolver is working, its effectiveness in persisting links depends on its actually being used. Klein et al discover that in many cases it isn't:
one would assume that URI references to journal articles can readily be recognized by detecting HTTP URIs that carry a DOI, e.g., http://dx.doi.org/10.1007/s00799-014-0108-0. However, it turns out that references rather frequently have a direct link to an article in a publisher's portal, e.g. http://link.springer.com/article/10.1007%2Fs00799-014-0108-0, instead of the DOI link.
The direct link may well survive relocation of the content within the publisher's site. But journals are frequently bought and sold between publishers, causing the link to break. I believe there are two causes for these direct links, publisher's platforms inserting them so as not to risk losing the reader, but more importantly the difficulty for authors to create correct links. Cutting and pasting from the URL bar in their browser necessarily gets the direct link, creating the correct one via dx.doi.org requires the author to know that it should be hand-edited, and to remember to do it.

Attempts to ensure linked materials are preserved suffer from a similar problem:
The solutions component of Hiberlink also explores how to best reference archived snapshots. The common and obvious approach, followed by Webcitation and Perma.cc, is to replace the original URI of the referenced resource with the URI of the Memento deposited in a web archive. This approach has several drawbacks. First, through removal of the original URI, it becomes impossible to revisit the originally referenced resource, for example, to determine what its content has become some time after referencing. Doing so can be rather relevant, for example, for software or dynamic scientific wiki pages. Second, the original URI is the key used to find Mementos of the resource in all web archives, using both their search interface and the Memento protocol. Removing the original URI is akin to throwing away that key: it makes it impossible to find Mementos in web archives other than the one in which the specific Memento was deposited. This means that the success of the approach is fully dependent on the long term existence of that one archive. If it permanently ceases to exist, for example, as a result of legal or financial pressure, or if it becomes temporally inoperative as a result of technical failure, the link to the Memento becomes rotten. Even worse, because the original URI was removed from the equation, it is impossible to use other web archives as a fallback mechanism. As such, in the approach that is currently common, one link rot problem is replaced by another.
The paper, and a companion paper, describe Hiberlink's solution, which is to decorate the link to the original resource with an additional link to its archived Memento. Rene Voorburg of the KB has extended this by implementing robustify.js
robustify.js checks the validity of each link a user clicks. If the linked page is not available, robustify.js will try to redirect the user to an archived version of the requested page. The script implements Herbert Van de Sompel's Memento Robust Links - Link Decoration specification (as part of the Hiberlink project) in how it tries to discover an archived version of the page. As a default, it will use the Memento Time Travel service as a fallback. You can easily implement robustify.js on your web pages in so that it redirects pages to your preferred web archive.
Note, however, that soft-403s and soft-404s pose the same problem for robustify.js as they do for all Web archiving technologies.

Dependence on Authors

Many of the solutions that have been proposed to the problem of reference rot also suffer from dependence on authors:
Webcitation was a pioneer in this problem domain when, years ago, it introduced the service that allows authors to archive, on demand, web resources they intend to reference. ... But Webcitation has not been met with great success, possibly the result of a lack of authors' awareness regarding reference rot, possibly because the approach requires an explicit action by authors, likely because of both.
Webcitation is not the only one:
To a certain extent, portals like FigShare and Zenodo play in this problem domain as they allow authors to upload materials that might otherwise be posted to the web at large. The recent capability offered by these systems that allows creating a snapshot of a GitHub repository, deposit it, and receive a DOI in return, serves as a good example. The main drivers for authors to do so is to contribute to open science and to receive a citable DOI, and, hence potentially credit for the contribution. But the net effect, from the perspective of the reference rot problem domain, is the creation of a snapshot of an otherwise evolving resource. Still, these services target materials created by authors, not, like web archives do, resources on the web irrespective of their authorship. Also, an open question remains to which extent such portals truly fulfill a long term archival function rather than being discovery and access environments.
Hiberlink is trying to reduce this dependence:
In the solutions thread of Hiberlink, we explore pro-active archiving approaches intended to seamlessly integrate into the life cycle of an article and to require less explicit intervention by authors. One example is an experimental Zotero extension that archives web resources as an author bookmarks them during note taking. Another is HiberActive, a service that can be integrated into the workflow of a repository or a manuscript submission system and that issues requests to web archives to archive all web at large resources referenced in submitted articles.
But note that these services (and Voorburg's) depend on the author or the publisher installing them. Experience shows that authors are focused on getting their current paper accepted, large publishers are reluctant to implement extensions to their publishing platforms that offer no immediate benefit, and small publishers lack the expertise to do so.

Ideally, these services would be back-stopped by a service that scanned recently-published articles for web-at-large links and submitted them for archiving, thus requiring no action by author or publisher. The problem is that doing so requires the service to have access to the content as it is published. The existing journal archiving services, LOCKSS, CLOCKSS and Portico have such access to about half the published articles, and could in principle be extended to perform this service. In practice doing so would need at least modest funding. The problem isn't as simple as it appears at first glance, even for the articles that are archived. For those that aren't, primarily from less IT-savvy authors and small publishers, the outlook is bleak.


Finally, the solutions assume that submitting a URL to an archive is enough to ensure preservation. It isn't. The referenced web site might have a robots.txt policy preventing collection. The site might have crawler traps, exceed the archive's crawl depth, or use Javascript in ways that prevent the archive collecting a usable representation. Or the archive may simply not process the request in time to avoid content drift or link rot.


I have to thank Herbert van de Sompel for greatly improving this post through constructive criticism. But it remains my opinion alone.


René Voorburg said...

"Note, however, that soft-403s and soft-404s pose the same problem for robustify.js as they do for all Web archiving technologies."

I just uploaded a new version of the robustify.js helper script (https://github.com/renevoorburg/robustify.js) that attempts to recognize soft-404s. It does so by forcing a '404' with a random request and comparing the results of that with the results of the original request (using fuzzy hashing). It seems to work very well but I am missing a good test set of soft 404's.

David. said...

Good idea, René!

Peter Burnhill said...

As ever, a good and challenging read. Although I am not one of the authors of the paper you review I have been involved in a lot of the underlying thinking as one of the PIs in the project, described at Hiberlink.org and would like to add a few comments, especially on the matter of potential remedy.

We were interested in the prospect of change & intervention in three simple workflows (for the author; for the issuing body; for the hapless library/repository) in order to enable transactional archiving of referenced content - reasoning that it was best that this was done as early as possible after the content on the web was regarded as important, and also that such archiving was best done when the actor in question had their mind in gear.

The prototyping using Zotero and OJS was done via plug-ins because having access to the source code our colleague Richard Wincewicz could mock this up as a demonstrator. One strategy was that would then invite ‘borrowing’ of the functionality (of snapshot/DateTimeStamp/archive/‘decorate’ with DateTimeStamp of URI within the citation) by commercial reference managers and editorial software so that authors and/or publishers (editors?) did not have to do something special.

Reference rot is a function of time: the sooner the fish (fruit?) is flash frozen the less it has chance to rot. However, immediate post-publication remedy is better than none. The suggestion that there is pro-active fix for content ingested into LOCKSS, CLOCKSS and Portico (and other Keepers of digital content) by archiving of references is very much welcomed. This is part of our thinking for remodelling Repository Junction Broker which supports machine ingest into institutional repositories but what you suggest could have greater impact.

Martin Klein said...

A comment on the issue of soft404s:

Your point is well taken and the paper's methodology section would clearly have benefited from mentioning this detriment and why we chose to not address it. My co-authors and I are very well aware of the soft404 issue, common approaches to detect them (such as introduced in [1] and [2]), and have, in fact, applied such methods in the past [3].

However, given the scale of our corpus of 1 million URIs, and the soft404 ratio found in previous studies (our [3] found a ratio of 0.36% and [4] found 3.41%), we considered checking for soft404s too expensive in light of potential return. Especially since, as you have pointed out in the past [5], web archives also archive soft404s, we would have had to detect soft404s on the live web as well as in web archives.

Regardless, I absolutely agree that our reference rot numbers for links to web at large resources likely represent a lower bound. It would be interesting to investigate the ratio of soft404s and build a good size corpus to evaluate common and future detection approaches.

The soft404 on the paper's reference 58 (which is introduced by the publisher) seems to "only" be a function of the PubMed search as a request for [6] returns a 404.

[1] http://dx.doi.org/10.1145/988672.988716
[2] http://dx.doi.org/10.1145/1526709.1526886
[3] http://arxiv.org/abs/1102.0930
[4] http://dx.doi.org/10.1007/978-3-642-33290-6_22
[5] http://blog.dshr.org/2013/04/making-memento-succesful.html
[6] http://www.ncbi.nlm.nih.gov/pubmed/aodfhdskjhfsjkdhfskldfj

David. said...

Peter Burnhill supports the last sentence of my post with this very relevant reference<:

thoughts of (Captain) Clarence Birdseye

Some advice on quick freezing references to Web caught resources:

Better done when references are noted (by the author), and then could be re-examined at point of issue (by the editor / publisher). When delivered by the crate (onto digital shelves) the rot may have set in for some of these fish ...

David. said...

Geoffrey Bilder has a very interesting and detailed first instalment of a multi-part report on the DOI outage that is well worth reading.

David. said...

As reported on the UK Serials Group listserv, UK Elsevier subscribers encountered a major outage last weekend due to "unforeseen technical issues".

David. said...

The outages continued sporadically through Tuesday.

This brings up another issue about the collection of link rot statistics. The model behind these studies so far is that a Web resource appears at some point in time, remains continually accessible for a period, then becomes inaccessible and remains inaccessible "for ever". Clearly, the outages noted here show that this isn't the case. Between the resource's first appearance and its last, there is some probably time-varying probability that it is available that is less than 1.

David. said...

Timothy Geigner at TechDirt supplies the canonical example of why depending on the DMCA "safe harbor" is risky for preservation. Although in this case the right thing happened in response to a false DMCA takedown notice, detecting them is between difficult and impossible.

David. said...

Herbert Van de Sompel, Martin Klein and Shawn Jones revisit the issue of why DOIs are not in practice used to refer to articles in a poster for WWW2016 Persistent URIs Must Be Used To Be Persistent. Note that this link is not a DOI, in this case because the poster doesn't have one (yet?).