Wednesday, December 23, 2015

Signposting the Scholarly Web

At the Fall CNI meeting, Herbert Van de Sompel and Michael Nelson discussed an important paper they had just published in D-Lib, Reminiscing About 15 Years of Interoperability Efforts. The abstract is:
Over the past fifteen years, our perspective on tackling information interoperability problems for web-based scholarship has evolved significantly. In this opinion piece, we look back at three efforts that we have been involved in that aptly illustrate this evolution: OAI-PMH, OAI-ORE, and Memento. Understanding that no interoperability specification is neutral, we attempt to characterize the perspectives and technical toolkits that provided the basis for these endeavors. With that regard, we consider repository-centric and web-centric interoperability perspectives, and the use of a Linked Data or a REST/HATEAOS technology stack, respectively. We also lament the lack of interoperability across nodes that play a role in web-based scholarship, but end on a constructive note with some ideas regarding a possible path forward.
They describe their evolution from OAI-PMH, a custom protocol that used the Web simply as a transport for remote procedue calls, to Memento, which uses only the native capabilities of the Web. They end with a profoundly important proposal they call Signposting the Scholarly Web which, if deployed, would be a really big deal in many areas. Some further details are on GitHub, including this somewhat cryptic use case:
Use case like LOCKSS is the need to answer the question: What are all the components of this work that should be preserved? Follow all rel="describedby" and rel="item" links (potentially multiple levels perhaps through describedby and item).
Below the fold I explain what this means, and why it would be a really big deal for preservation.

Much of the scholarly Web consists of articles, each of which has a Digital Object Identifier (DOI). Herbert and Michael's paper's DOI is 10.1045/november2015-vandesompel. You can access it by dereferencing this link: CrossRef's DOI resolver will redirect you to the current location of the article, providing location-independence. The importance of location-independent links, and the fact that they are frequently not used, was demonstrated by Martin Klein and a team from the Hiberlink project in Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. I discussed this article in The Evanescent Web.

But Herbert and Michael's paper is an anomaly. The DOI resolution service redirects you to the full text HTML of the paper. This not what usually happens. A more representative but very simple example is: You are redirected to a "landing page" that contains the abstract, some information about the journal, and a lot of links. Try "View Source" to get some idea of how complex this simple example is; it links to 36 other resources. Some, such as stylesheets, should be collected for preservation. Others, such as the home pages of the journal's funders, should not be. Only one of the linked resources is the PDF of the article, which is the resource most needing preservation.

If a system is asked to ingest and preserve this DOI it needs to be sure that, whatever else it got, it did get the article's PDF. In this very simple, well-organized case there are two ways to identify the link leading to that PDF:
  • The link the reader would click on to get the PDF whose target is and whose anchor text is "PDF".
  • A meta-tag with name="citation_pdf_url" content="".
So we have two heuristics for finding the article's PDF, the anchor text and the citation_pdf_url meta-tag. Other might include anchor text "Download" or "Full Text". Similarly, the system needs to use heuristics to decide which links, such as those to the funder's home pages, not to follow. Sites vary a lot, and in practice preservation crawlers need a range of such heuristics. Most landing pages are far more complex than this example.

Top half of Figure 5.
The LOCKSS system's technology for extracting metadata, such as which URLs are articles, abstracts, figures, and so on, was outlined in a talk at the IIPC's 2013 GA, and detailed in one of the documents submitted for the CLOCKSS Archive's TRAC audit. It could be much simpler and more reliable if Herbert and Michael's proposal were adopted. Figure 5 in their paper shows two examples of signposting, the relevant one is the top half. It shows the normal case of accessing an article via a DOI. The DOI redirects to a landing page whose HTML text, as before, links to many resources. Some, such as A, are not part of the article. Some, such as the PDF, are. These resources are connected by typed links, as shown in the diagram. These typed links are implemented as link HTTP headers whose rel attribute expresses the type of the link using an IANA-registered type such as describes or item.

Now, when the preservation crawler is redirected to and fetches the landing page, the HTTP headers contain a set of link entries. Fetching each of them ensures that all the resources the publisher thinks are part of the article are collected for preservation. No heuristics are needed; there is no need even to parse the landing page HTML to find links to follow.

Of course, this describes an ideal world. Experience with the meta-tags that publishers use to include bibliographic metadata suggests some caution in relying solely on these link headers. Large publishing platforms could be expected to get them right most of the time, headers on smaller platforms might be less reliable. Some best practices would be needed. For example, are script tags enough to indicate JavaScript that is part of the article, or do the JavaScript files that are part of the article need a separate link header?

Despite these caveats it is clear that even if this way of unambiguously defining the boundaries of the artefact identified by a DOI was not universal, it would significantly reduce the effort needed to consistently and completely collect these artefacts for preservation. Ingest is the major cost of preservation, and "can't afford to collect" is the major cause of content failing to reach future readers. Thus anything that can significantly reduce the cost of collection is truly a big deal for preservation.