Thursday, April 18, 2019

Personal Pods and Fatcat

Sir Tim Berners-Lee's Solid project envisages a decentralized Web in which people control their own data stored in personal "pods":
The basic idea of Solid is that each person would own a Web domain, the "host" part of a set of URLs that they control. These URLs would be served by a "pod", a Web server controlled by the user that implemented a whole set of Web API standards, including authentication and authorization. Browser-side apps would interact with these pods, allowing the user to:
  • Export a machine-readable profile describing the pod and its capabilities.
  • Write content for the pod.
  • Control others access to the content of the pod.
Pods would have inboxes to receive notifications from other pods. So that, for example, if Alice writes a document and Bob writes a comment in his pod that links to it in Alice's pod, a notification appears in the inbox of Alice's pod announcing that event. Alice can then link from the document in her pod to Bob's comment in his pod. In this way, users are in control of their content which, if access is allowed, can be used by Web apps elsewhere.
In his Paul Evan Peters Award Lecture, my friend Herbert Van de Sompel applied this concept to scholarly communication, envisaging a world in which access, for both humans and programs, to all the artifacts of research would be greatly enhanced.
In Herbert's vision, institutions would host their researchers "research pods", which would be part of their personal domain but would have extensions specific to scholarly communication, such as automatic archiving upon publication.
Follow me below the fold for an update to my take on the practical possibilities of Herbert's vision.

This improved access would be enabled by metadata, generated both by the decentralized Web infrastructure and by the researchers, connecting the multifarious types of digital objects representing the progress of their research.

The key access improvements in Herbert's vision are twofold:
  • Individuals, not platforms such as Google or Elsevier, control access to their digital objects.
  • Digital objects in pods are described by, and linked by, standardized machine-actionable metadata.
Their importance is in allowing much improved access to the digital objects by machines, not so much by humans. Text mining from published papers has already had significant results, so much so that publishers are selling the service on their platforms. This balkanization isn't helpful. Herbert's vision is of a world in which all digital research objects are uniformly accessible via a consistent, Web-based API.

Herbert was skeptical that transitioning scholarly communication in this way was achievable. I agreed with him at length in both Herbert Van de Sompel's Paul Evan Peters Award Lecture and It Isn't About The Technology, but didn't address the obvious question:
How much of the improved access in Herbert's vision could be implemented in the Web we have right now, rather than waiting for the pie-in-the-sky-by-and-by decentralized Web?
Clearly, the academic publishing oligopoly and the copyright maximalists aren't going to allow us to implement the first part. Even were Open Access to become the norm, their track record shows it will be Open Access to digital objects they host (and in many cases under a restrictive license).
Elsevier's Research Infrastructure
The Web we have lacks the mechanisms for automatically generating the necessary metadata. Experience shows that the researchers we have are unable to generate the necessary metadata. How could implementing the second part of Herbert's vision be possible?

Thanks to generous funding from the Andrew W. Mellon Foundation (I helped write the grant proposal) a team at the Internet Archive is working on a two-pronged approach. Prong 1 starts from Web objects known to be scholarly outputs because, for example, they have been assigned a DOI and:
  • Ensures that, modulo paywall barriers, they and the objects to which they link are properly archived by the Wayback Machine.
  • Extracts and, as far as possible, verifies the bibliographic metadata for the archived objects.
  • Implements access to the archived objects in the Wayback Machine via bibliographic rather than URL search.
Prong 2 takes the opposite approach, using machine learning techniques to identify objects in the Wayback Machine that appear to be scholarly outputs and:
  • Extracts and, as far as possible, verifies the bibliographic metadata for the archived objects.
  • Implements access to the archived objects in the Wayback Machine via bibliographic rather than URL search.
The goal of this work is to improve archiving of the "long tail" of scholarly communication by applying "big data" automation to ensure that objects are discovered, archived, and accessible via bibliographic metadata. Current approaches (LOCKSS, Portico, national library copyright deposit programs) involve working with publishers, which works well for large commercial publishers but is too resource intensive to cover more than a small fraction of the long tail. Thus current efforts to archive scholarly outputs are too focused on journal articles, and too focused on expensive journals, and thus too focused on content that is at low risk of loss.

Fatcat entry for Joi Ito blog post
The team at the Internet Archive have an initial version of the first prong up at The home page includes links to examples of preliminary Fatcat content for various types of research objects, such as a well-known blog post by Joi Ito. The Fatcat  "About" page starts:
Fatcat is versioned, publicly-editable catalog of research publications: journal articles, conference proceedings, pre-prints, blog posts, and so forth. The goal is to improve the state of preservation and access to these works by providing a manifest of full-text content versions and locations.

This service does not directly contain full-text content itself, but provides basic access for human and machine readers through links to copies in web archives, repositories, and the public web.

Significantly more context and background information can be found in The Guide.
Now, suppose Fatcat succeeds in its goals. It would provide a metadata infrastructure that could be enhanced to provide many of the capabilities Herbert envisaged, albeit in a centralized rather than a decentralized manner. The pod example above could be rewritten for the enhanced Fatcat environment thus:
If Alice posts a document to the Web that Fatcat recognizes in the Wayback Machine's crawls as a research output, Fatcat will index it, ensure it and the things it links to are archived, and create a page for it. Suppose Bob, a researcher with a blog which Fatcat indexes via Bob's ORCID entry, writes a comment on one of her blog's post that links to Alice's document. Fatcat's crawls will notice the comment and:
  • Update the page for Bob's blog post to include a link to Alice's document.
  • Update the page for Alice's document to include a link to Bob's comment.
Because Fatcat exports its data via an API as JSON, the information about each document, including its links to other documents, is available in machine-actionable form to third-party services. They can create their own UIs, and aggregate the data in useful ways.
As a manually-created demonstration of what this enhanced Fatcat would look like take this important paper in Science's 27th January 2017 issue, Gender stereotypes about intellectual ability emerge early and influence children’s interests by Lin Bian, Sarah-Jane Leslie and Andrei Cimpian. The authors' affiliations are the University of Illinois, Champaign, New York University, and Princeton University. Here are the things I could find in about 90 minutes that the enhanced Fatcat would link to and from:
[I'm sorry I don't have time to encode all this as JSON as specified in The Guide.]

Linking together the various digital objects representing the outputs of a single research effort is at the heart of Herbert's vision. It is true that the enhanced Fatcat would be centralized, and thus potentially a single point of failure. And that it would be less timely, less efficient, and would lack granular access control (it can only deal with open access objects). But it's also true that the enhanced Fatcat avoids many of the difficulties of the decentralized version that I raised. They are caused by the presence of multiple copies of objects, for example in the personal pods of each member of a multitudinous research team, or at their various institutions.

Given that both Herbert and I express considerable skepticism as to the feasibility of implementing his vision even were a significant part of the Web to become decentralized, exploring ways to deliver at least some of its capabilities on a centralized infrastructure seems like a worthwhile endeavor.

Update: Herbert points out that related work is also being funded by the Mellon Foundation in a collaborative project between Los Alamos and Old Dominion called
The modules in the pipeline are as follows:
  • Discovery of new artifacts deposited by a researcher in a portal is achieved by a Tracker that recurrently polls the portal's API using the identity of the researcher in each portal as an access key. If a new artifact is discovered, its URI is passed on to the capture process.
  • Capturing an artifact is achieved by using web archiving techniques that pay special attention to generating representative high fidelity captures. A major project finding in this realm is the use of Traces that abstractly describe how a web crawler should capture a certain class of web resources. A Trace is recorded by a curator through interaction with a web resource that is an instance of that class. The result of capturing a new artifact is a WARC file in an institutional archive. The file encompasses all web resources that are an essential part of the artifact, according to the curator who recorded the Trace that was used to guide the capture process.
  • Archiving is achieved by ingesting WARC files from various institutions into a cross-institutional web archive that supports the Memento "Time Travel for the Web" protocol. As such, the Mementos in this web archive integrate seamlessly with those in other web archives.
Major differences between the two include:
  • Targeted at specific platforms vs. generic Web.
  • Researcher-centric vs. object-centric.
  • Content-focused vs. metadata-focused.
  • Curator-driven vs. automated collection.

1 comment:

Anonymous said...

Thanks for the write-up!

I manually added a couple of the additional resources you linked to (namely, dataset locations):

Though of course the entire point is to automate these processes... or better yet, have "the crowd" automate the parts they are interested in, and combine the results.