Over the past fifteen years, our perspective on tackling information interoperability problems for web-based scholarship has evolved significantly. In this opinion piece, we look back at three efforts that we have been involved in that aptly illustrate this evolution: OAI-PMH, OAI-ORE, and Memento. Understanding that no interoperability specification is neutral, we attempt to characterize the perspectives and technical toolkits that provided the basis for these endeavors. With that regard, we consider repository-centric and web-centric interoperability perspectives, and the use of a Linked Data or a REST/HATEAOS technology stack, respectively. We also lament the lack of interoperability across nodes that play a role in web-based scholarship, but end on a constructive note with some ideas regarding a possible path forward.They describe their evolution from OAI-PMH, a custom protocol that used the Web simply as a transport for remote procedue calls, to Memento, which uses only the native capabilities of the Web. They end with a profoundly important proposal they call Signposting the Scholarly Web which, if deployed, would be a really big deal in many areas. Some further details are on GitHub, including this somewhat cryptic use case:
Use case like LOCKSS is the need to answer the question: What are all the components of this work that should be preserved? Follow all rel="describedby" and rel="item" links (potentially multiple levels perhaps through describedby and item).Below the fold I explain what this means, and why it would be a really big deal for preservation.
Much of the scholarly Web consists of articles, each of which has a Digital Object Identifier (DOI). Herbert and Michael's paper's DOI is 10.1045/november2015-vandesompel. You can access it by dereferencing this link: http://dx.doi.org/10.1045/november2015-vandesompel. CrossRef's DOI resolver will redirect you to the current location of the article, providing location-independence. The importance of location-independent links, and the fact that they are frequently not used, was demonstrated by Martin Klein and a team from the Hiberlink project in Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. I discussed this article in The Evanescent Web.
But Herbert and Michael's paper is an anomaly. The DOI resolution service redirects you to the full text HTML of the paper. This not what usually happens. A more representative but very simple example is: http://dx.doi.org/10.2218/ijdc.v8i1.248. You are redirected to a "landing page" that contains the abstract, some information about the journal, and a lot of links. Try "View Source" to get some idea of how complex this simple example is; it links to 36 other resources. Some, such as stylesheets, should be collected for preservation. Others, such as the home pages of the journal's funders, should not be. Only one of the linked resources is the PDF of the article, which is the resource most needing preservation.
If a system is asked to ingest and preserve this DOI it needs to be sure that, whatever else it got, it did get the article's PDF. In this very simple, well-organized case there are two ways to identify the link leading to that PDF:
- The link the reader would click on to get the PDF whose target is http://www.ijdc.net/index.php/ijdc/article/view/8.1.107/300 and whose anchor text is "PDF".
- A meta-tag with name="citation_pdf_url" content="http://www.ijdc.net/index.php/ijdc/article/view/8.1.107/300".
Top half of Figure 5. |
Now, when the preservation crawler is redirected to and fetches the landing page, the HTTP headers contain a set of link entries. Fetching each of them ensures that all the resources the publisher thinks are part of the article are collected for preservation. No heuristics are needed; there is no need even to parse the landing page HTML to find links to follow.
Of course, this describes an ideal world. Experience with the meta-tags that publishers use to include bibliographic metadata suggests some caution in relying solely on these link headers. Large publishing platforms could be expected to get them right most of the time, headers on smaller platforms might be less reliable. Some best practices would be needed. For example, are script tags enough to indicate JavaScript that is part of the article, or do the JavaScript files that are part of the article need a separate link header?
Despite these caveats it is clear that even if this way of unambiguously defining the boundaries of the artefact identified by a DOI was not universal, it would significantly reduce the effort needed to consistently and completely collect these artefacts for preservation. Ingest is the major cost of preservation, and "can't afford to collect" is the major cause of content failing to reach future readers. Thus anything that can significantly reduce the cost of collection is truly a big deal for preservation.
Related efforts at the W3C include Packaging on the Web and Portable Web Publications for the Open Web Platform.
ReplyDelete