Tuesday, April 11, 2017

The Orphans of Scholarship

This is the third of my posts from CNI's Spring 2017 Membership Meeting. Predecessors are Researcher Privacy and Research Access for the 21st Century.

Herbert Van de Sompel, Michael Nelson and Martin Klein's To the Rescue of the Orphans of Scholarly Communication reported on an important Mellon-funded project to investigate how all the parts of a research effort that appear on the Web other than the eventual article might be collected for preservation using Web archiving technologies. Below the fold, a summary of the 67-slide deck and some commentary.

The authors show that sites such as GitHub, Wikis, Wordpress, etc. are commonly used to record artifacts of research, but that these sites do not serve as archives.  Further, the artifacts on these sites are poorly preserved by general Web archives. Instead, they investigate the prospects for providing institutions with tools they can use to capture their own researchers' Web artifacts. They divide the problem of ingesting these artifacts for preservation into four steps.

First, discovering a researcher's Web identities. This is difficult, because the fragmented nature of the research infrastructure leads to researchers having accounts, and thus identities at many different Web sites (ORCID, Github, ResearchGate, ...). There's no consistent way to discover and link them. They discuss two approaches:
  • EgoSystem, developed at LANL, takes biographical information about an individual and uses heuristics to find Web identities for them in a set of target Web sites such as Twitter and LinkedIn.
  • Source
    Mining ORCID for identities. Ideally, researchers would have ORCID IDs and their ORCID profiles would point to their research-related Web identities. Alas, ORCID's coverage outside the sciences, and outside the US and UK, is poor, and there is no standard for the information included in ORCID profiles.
Second, discovering artifacts per Web identity. This is easier. Once you have a researcher's Web identities, conventional Web searching and page analysis techniques can harvest artifact links quite effectively. However, there is potentially a serious problem of over-collection. For example, which of the images in a researcher's Flickr account are research-related as opposed to vacation-related?

Third, determining the Web boundary per artifact. This is the domain of Signposting, which I wrote about here. The issues are very similar to those in Web Infrastructure to Support e-Journal Preservation (and More) by Herbert, Michael and myself.

Fourth, capturing artifacts in the artifact's Web boundary. After mentioning the legal uncertainties caused by widely varying license terms among the sites hosting research artifacts, a significant barrier in practice, they show that different capture tools vary widely in their ability to collect usable Mementos of artifacts from the various sites. Building on Not all mementos are created equal: measuring the impact of missing resources, they describe a system for automatically scoring the quality of Mementos. This may be applicable to the LOCKSS harvest ingest pipeline; the team hopes to evaluate it soon.

The authors end on the question of how the authenticity of Mementos can be established. What does it mean for a Memento to be authentic? In an ideal world it would be that it was the same as the content of the Web site. But, even setting aside the difficulties of collection, in the real world this isn't possible. Web pages are different every time they are visited, the Platonic ideal of "the content of the Web site" doesn't exist.

The authors mean by "authentic" that the content obtained from the archive by a later reader is the same as was originally ingested by the archive; it hasn't been replaced by some evil-doer during its stay in the archive. They propose to verify this via a separate service recording the hashes of Mementos obtained by the archive at the time of ingest, perhaps even a blockchain of them.

There are a number of tricky issues here. First, it must be possible to request an archive to deliver an unmodified Memento, as opposed to the modified ones that the Wayback Machine (for example) delivers, with links re-written, the content framed, etc.  Then there are the problems associated with any system that relies on stored hashes for long-term integrity. Then there is the possibility that the evil-doer was doing evil during the original ingestion, so that the hash stored in the separate service is of the replacement, not the content obtained from the Web site.


No comments: