Tuesday, January 23, 2018

Herbert Van de Sompel's Paul Evan Peters Award Lecture

In It Isn't About The Technology, I wrote about my friend Herbert Van de Sompel's richly-deserved Paul Evan Peters award lecture entitled Scholarly Communication: Deconstruct and Decentralize?, but only in the context of the push to "decentralize the Web". I believe Herbert's goal for this lecture was to spark discussion. In that spirit, below the fold, I have some questions about Herbert's vision of a future decentralized system for scholarly communications built on existing Web protocols. They aren't about the technology but about how it would actually operate.

My questions fall into two groups; questions about how the collaborative nature of today's research maps onto the decentralized vision, and questions about how the decentralized vision copes with the abuse and fraud that are regrettably so prevalent in today's scholarly communication. For some of the questions I believe I spotted the answer in Herbert's talk, my answers are in italic.

Collaboration

Tabby's Star Flux
I was one of the backers of the "Where's The Flux" Kickstarter, astronomer Tabetha Boyajian's successful effort to crowdfund monitoring the bizarre variations of "Tabby's Star" (KIC 8462852) which, alas, are no longer thought to be caused by "alien megastructures".

A 2016 paper arising from this research is Planet Hunters X. KIC 8462852 – Where’s the flux?, with Boyajian as primary author and 48 co-authors. 11 of the authors list their affiliation as "Amateur Astronomer"; the discovery was made by the Planet Hunters citizen science project. The affiliations of the remaining 38 authors list 28 different institutions. 7 of the authors list multiple affiliations, one of them lists 3. The paper acknowledges support from at least 25 different grants from, by my count, 10 different funders. The paper combines observations from at least 7 different instruments and 2 major astronomical databases. (There is actually a more recent paper about this star with authors listing 114 different affiliations, but I was lazy).

As I understand it, Herbert's vision is to have researchers in control of access to and the content of their own output on the Web, in "pods" hosted by their institution. This sounds like a laudable goal. But in this case the output is from 49 authors who each have from 0 to 3 institutions. The question is, where is the paper? Is it:
  • In Boyajian's pod at Yale, the primary author's sole institution? She was post-doc there at the time of publication. Or at Louisiana State, where she is now faculty? Do we really want one person to have sole control over access to and the content of this collaborative work? What, for example, is to stop the primary author of a popular paper erecting a paywall? Or preventing access by a hated rival?
  • In each institutional author's pod at each of their institutions? That's 47 copies. But what about the citizen scientists? Are only credentialed researchers entitled to credit for and control of their work? Is the Planet Hunters project an institution?
  • In each author's pod at each of their 46 institutions, plus 11 personal pods some place else? That's a total of 57 copies of one paper. Lots Of Copies Keep Stuff Safe, but how can we ensure that all the copies remain the same? Presumably, each change to one copy notifies the others via their inbox, but this seems to be a security risk. To which of the putatively identical copies does the DOI resolve?
    • The primary author's? If so, what is the point of the other 56 copies?
    • One chosen at random?
    • All 57? What is the user to do with a DOI multiple resolution page with a list of 57 entries?
  •  At a single pod created specifically to hold the collaborative work? Is the primary author in control of the pod? Is there some method by which all authors vote on control of the pod?
  • At one pod at each of the 10 funders of the research? What about the citizen scientists? Which of the multiple researchers funded by each funder is in control of the pod at that funder? Or is the funder in control of the pod? How are the copies maintained in sync?
Presumably, the raw data resides at the instruments and databases from which it was sourced. Is this raw data linked by Signposting to the processed data reported in the paper? To the paper itself? How do the instruments and databases know to create the links, and from which parts of the data? Presumably, this requires the databases and instruments to implement their own pods which implement the inbox and notification protocols.

If there are multiple copies of the processed data, are they linked by Signposting? If so, how does this happen? Presumably again via the pods notifying each other.

Where does the processed data reside?
  • In Boyajian's pod at Yale? Or at Louisiana State? Much if not all the work of processing was done by others, so why does Boyajian get to control it, and move it around with her?
  • In each of the pod(s) of the authors who did the processing? The data might be voluminous, and we're now storing many copies of it, so who gets to pay the storage costs (for ever)? Which of these pods does which copy of the paper point to? If its the copy in each researcher's pod, and those who did the processing point to their own copy of the processed data, the copies of the paper in the multiple pods are no longer the same, so how are they kept in sync?
  • In a single pod created to hold the processed data (and the paper itself)? Again, who gets to control this pod and, presumably, pay the storage costs (for ever)?
These questions relate to a paper with a lot of authors. A different set of questions relate to a paper with a single author, but these are left as an exercise for the reader.

Fraud and Abuse

Digital information is malleable. Lets assume for the sake of argument that papers (and data) once published are intended to be immutable. Do authors sign their papers (and the data)? If so, there are a number of previously unsolved problems:
  • How do we implement a decentralized Public Key Infrastructure? 
  • How do we validate signatures from the long past? Doing so requires a trusted database of the time periods during which keys were valid (see Petros Maniatis' Ph.D. thesis).
  • How do we update signatures (which are based on hashes) when the hash algorithm becomes vulnerable? Do we need to get all 49 authors to update their signatures, and if so how? Or is it Boyajian's responsibility to update her signature? She's not a computer scientist, how does she know the hash algorithm that was used and that it is now vulnerable? And now 48 of the signatures are outdated. This is easy to detect, so does it matter?
  • How do we detect, and recover from, leakage of a researcher's private key?
Signatures on papers and data allow for post-publication changes to be detected. If such a change is detected, we would like to recover the previous state. Herbert envisages that publication would automatically cause the published version to be archived, creating a copy that is outside the control of the author. In the case where a copy resides in each author's pod, does publication trigger archiving of each of them? If not, how does the archive know which pod's copy to archive? I believe Herbert views signatures as impractical, and depends on archiving at the point of publication as the way to detect and recover from post-publication changes.

Since archives are required to have mechanisms to detect and recover from damage or loss, does not the archived copy become in effect the canonical one? If so, why does not the DOI resolve to the reliable, persistent, canonical copy in the archive? If so, what is the role of the copies in the authors' pods? How do we implement and sustain a decentralized archiving infrastructure? If, as is highly likely, the archiving infrastructure is or becomes centralized, what has the decentralized pod structure achieved?

If authors do not sign their papers, how are post-publication changes detected? By comparing the copies in the pods with the copy in the archive(s)? Who is responsible for regularly making this comparison, and how are they funded? What happens when a change to a copy in an author's pod is detected? How does this result in the status quo being restored?

In the real world papers and data, once published, are not immutable. According to whether the changes are in the author's or the public interest, there are four different kinds of change, shown in the table.

Four kinds of post-publication change
Public
Interest
Yes No
Author'sYes13
InterestNo42

Lets look at each of these kinds of change:
  1. There are changes that the author wishes to make and are in the public interest that they should make, such as errata. Each author is in control of their own pod and its content, so they can make the change there. Does this change trigger a new version at the archive? But in the case where the paper resides in each author's pod, how are these changes made consistently? Does each one cause a new version at the archive, requiring the archive to deduplicate versions? I believe Herbert would answer yes, the archive should deduplicate the multiple versions.
  2. There are changes that are in the interest of neither the author nor the public. For example, suppose some malefactor guesses the author's password and alters some content. If the papers aren't signed, how is this distinguished from an authorized change? Even if they are signed, how are readers and the archive to know that the author's private key hasn't leaked?
  3. There are changes that the author wishes to make but which are not in the public interest. For example, the author may wish to make changes to conceal fraud. How can such changes be distinguished from those that are in the public interest (case 1)? Do we depend on the author's pod notifying the world about a change that the author does not want the world to know about?
  4. There are changes that should be made in the public interest but which are against the wishes of the author, for example the analog in this vision of a retraction. If the DOI resolves to the copy in the author's pod, how are readers made aware that the paper has been retracted?
The last case is the most interesting. Herbert's example of Bob commenting in his pod on a document Alice published mentions that Alice gets notified and "if all is well" links to Bob's comment. What if Alice doesn't like Bob's comment? Alice is in control of her own storage, so there is no way to force her to link to Bob's comment. Presumably, in this vision, the "certification" function happens because there are entities (journals in the current system) that "like" or not papers. What's to stop authors claiming that their paper has a "like" from Nature? Of course, Nature would not have a link "like"-ing the paper. But who is going to detect the conflict between the author and Nature? And how is it to be resolved. given that only the author(s) can remove the false claim?

Herbert envisages that reviews would also be archived at the point of publication. Would the archive link from Alice's paper to Bob's unfavorable review that the copy in Alice's pod did not link to? This does not seem archive-like behavior.

Conclusion

I've raised a lot of questions here. It seems to me that there are several fundamental issues underlying them:
  • The vision depends on a trusted archival infrastructure which, as I've pointed out, we have so far failed to implement adequately, or in a decentralized enough manner.
  • The vision depends on automated processes run by some trusted institution that regularly scan looking for changes, missing links, etc.
  • The pods' inboxes are discoverable, so will get spammed and attacked. Thus they will need anti-spam and anti-virus capabilities. Incoming messages will have to be reviewed by the owner before being acted upon.
  • In general, like the original (and decentralized) Web, the vision depends on everyone involved acting from the best motives. Which, in some sense, is how we got where we are.
I agree with Herbert that his vision, as it stands, has little chance of being implemented. But he has raised a very large number of interesting questions. It is worth noting that many of them are not in fact specific to a decentralized system.

3 comments:

Dragan Espenschied said...

Many of the questions posed here, which are mostly concerned with the objecthood and object boundaries of collaborative work, have been discussed and experimented with in net art for quite some time. For instance, just like academic institutions require clearly identified responsibilities and a fixed state of a work, the art world is still pretty fixated on single creators and stable, "authentic" artworks.

I've witnessed how ideas of how art is made, evaluated, and preserved have failed the cause they're trying to support, and artists wrangling and straightening out their work in order for it to enter a museum. In the case at hand, I would propose to examine how many of these questions are based on assumptions which are not really reflecting how research is done or should be done today. At the same time, researchers could adopt some techniques pioneered by artists, like forming collectives or creating shared identities to approach issues with authorship and responsibilities; or postulate research as an ongoing performance that is never really concluded, but adaptive to for example new data coming in.

Such things have been discussed before, unfortunately without much consequence. Maybe recently the actual pressure caused by failures to contemporary research became high enough to take "where science and art meet" more seriously :)

IlyaK said...

I haven't heard the original lecture so perhaps am missing some of the context, but it seems that at least some of these issues are not unique to publications, but any distributed work. If "publication/paper" is replaced with "source code/software", then couldn't a distributed source control system, eg. git, mercurial, address some of these issues?

Each institution could host their own git installation (perhaps using a web service built on top of git like GitLab) and allows researchers to each have their own fork of the work. When there are multiple authors, the git repos of each institution could be referenced, and additional forks could be created as needed for archiving. (The archive the contains the entire version history as well) Updates to one could be resolved through merging, pull requests, etc.. On the surface, it seems that some of these problems are very similar to distributed software development, which already has established workflows of distributed work. Git also includes support for signing commits, as a starting point.

There are issues with "large files" that need not be tracked for changes in git, and that's an issue that extensions like like git-annex, git-lfs are trying to solve. But I wonder if the core issues raised here couldn't be at least partially addressed through a distributed version control system like git? Is git not "decentralized" enough?

David. said...

IlyaK, just to be clear, this post is not asking how scholarly communication should work, but only how Herbert's proposed system would work, in order to flush out the issues that a discussion of how it should work needs to address.

Your analogy between software development and research is worth pursuing, but there is a significant difference. From personal experience of both, credit for open source contributions is nice, but at least for the academics on Boyajian's team and their institutions credit for publication is life-and-death important.