Tuesday, December 19, 2017

Bad Identifiers

This post on persistent identifiers (PIDs) has been sitting in my queue in note form for far too long. Its re-animation was sparked by an excellent post at PLOS Biologue by Julie McMurry, Lilly Winfree and Melissa Haendel entitled Bad Identifiers are the Potholes of the Information Superhighway: Take-Home Lessons for Researchers, which draws attention to a paper, Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data, of which they are three of the many authors. In addition, there were two papers at this year's iPRES on the topic;
Below the fold, some thoughts on PIDs.

There are two schools of thought about PIDs:
  • Hash-based names, as used for example in IPFS, are thought to be persistent because they are computed from the content itself and are extremely likely to be unique. However, as I discussed in Blockchain as the Infrastructure for Science, on the timescales needed for digital preservation it is inevitable that the chosen hash algorithm will become vulnerable to collisions (as we saw recently with SHA-1) and thus the names will no longer be unique.
  • Minted names, as used for example by DOIs. Identifiers for the 21st century: ... is an excellent overview in ten lessons of the issues around minted names.
The paper and the PLOS Biologue post argue for having third-party resolvers (such as CrossRef) mint URIs as persistent IDs (such as DOIs), noting that URIs not specifically intended to persist don't:
It is not without special irony that our paper on community-based identifiers was done by authors who (taken together) were supported by small portions of 19 different grants. Links to these 19 grants were documented only two years ago; however, 11 links of the 19 are now dead and had to be re-curated by hand.
They follow Herbert Van de Sompel et al in noting that Persistent URIs Must Be Used To Be Persistent and that they aren't:
when referencing with links, you should be careful not to just copy the link you happen to find in the browser address bar, as that address may a) not be designed to persist, and — even if persistent — b) may nevertheless make it difficult for data integrators to know that you and someone else are referring to the same exact thing. Take for instance a cursory investigation of linked references to the SNCA gene record in NCBI; most references were just an accession (“gene:6622”) or symbol (“SNCA”) with no link. There were 38 such distinct short text representations. Links, when included at all, were represented one of twelve different ways:
The problem here is that third-party resolvers work via HTTP redirects, so that the address that shows up in the browser's address bar will never be the canonical persistent URI, it will be a URI at the publisher. Because there is no metadata connecting any of the publishers variant, non-persistent identifiers to the canonical persistent URI, the burden is on authors, reviewers and editors to know what the canonical persistent URI for each of the variants is, and correct the easy copy-and-paste from the browser's address bar.

At least for DOIs, there is agreement among authors, editors and reviewers that this is the canonical persistent URI, but as Van de Sompel et al and the LOCKSS Program observe, that doesn't prevent a significant amount of noise in DOI references found in the wild. Much of this is caused by authors copy-and-pasting from the browser address bar, and is not caught by editors or reviewers.

The post advises:
Even more permutations are bound to exist. The point is that a) we providers need to make referencing easier, and b) users should be aware of the links they use when referencing. Follow the provider’s documentation where available, or look up the durable representation at identifiers.org.
But using https://identifiers.org/ doesn't do what the author needs. If I enter:
it hands me back:
which resolves (presumably via https://dx.doi.org/) to Identifiers for the 21st century: .... But if I enter the article's publisher URI:
I get "Unknown prefix". Case b) above doesn't address the need.

It is true that PLOS Biology puts a link and the text of the canonical https://doi.org/10.1371/journal.pbio.2001414 in tiny font below the author's names in the paper, but there is no agreement among publishers about where to find this, whether it is a link or simply the text doi:10.1371/journal.pbio.2001414, so many authors will just copy-and paste from the address bar.

What is needed is a service authors, reviewers and editors could use to map from the 38 short identifier variants and the 12 URI variants for the SNCA gene to the canonical one. One might think that the third-party resolvers could do this; whenever one of their identifiers was minted they could record what it resolved to in an inverse mapping. But this isn't adequate:
  • In some cases, the third-party resolver redirects not to the eventual target but to a delegated resolver. This is the case, for example, with many major journal publishers who maintain their own internal resolver for their DOIs. The resolver whose identifier was minted may never find out that the variant URI has changed in the delegated resolver's map, and thus that the inverse database needs to be updated.
  • More fundamentally anyone, not just the original publisher, can create a variant URI pointing to the publisher's resource, or even to a copy of it somewhere else (for example, to a paper in an institutional or subject repository). The resolver owning the identifier would never know that these URIs also needed to be added to the inverse database.
This is a classic chicken vs. egg problem. If everyone was already using the canonical identifier, the inverse database would exist and work and everyone would use the canonical identifier. But since they aren't, the inverse database doesn't exist, so everyone is using variant identifiers.

Herbert Van de Sompel and Michael Nelson's Signposting proposal showed how the use of rel= links in HTTP headers could provide the inverse mapping needed, and they and I went into much more detail in our paper Web Infrastructure to Support e-Journal Preservation (and More). Alas, although the effort would be minimal, there has been little enthusiasm from publishers for implementing these links.

Remco van Veenendaal et al's Getting Persistent Identifiers Implemented By ‘Cutting In The Middle-Man’ describes an attack on one of the set of interconnected problems of PIDs; the fact that smaller publishers of resources, such as many heritage institutions, lack both awareness of the need for, and the ability to implement, them. The Dutch Heritage Network worked with system vendors and produced training materials. They conclude:
The PID information, training and education material used in workshops and presentations contributed to raising awareness, together with PID Roadmap and best practice documents. The unique PID Guide is still actively used by organisations for learning about PIDs and taking the first steps towards selecting a suitable PID solution. SURFsara’s Handle System-based PID service has become available to cultural heritage organisations at an affordable price, ensuring that all major PID solutions in the Netherlands are now available for less than €1,000 per year. Most importantly, by cutting in the middle-man, PIDs have been implemented in (4) leading collection management systems in the Netherlands, and are available to all their (inter)national customers.
This is a praiseworthy effort, but €1,000 per year is still much more than the cost to host the website the identifiers are pointing to. In discussions with new small journals, the LOCKSS Program has been told that this problem affects DOIs; the cost of a DOI is greater than the cost to host an article "for ever". And yet, if DOIs are to persist, CrossRef needs a viable business model.

Dappert and Farquhar ask four questions:
  1. What role should PID stakeholders play in order to ensure long-term preservation?
  2. How can one manage the distributed long-term responsibility?
  3. How can PIDs support entities that evolve over time?
  4. How can we preserve the PID graph as it grows over time as more links are established through incremental improvements, use and reuse?
These are good questions. As to the first, they argue that much better metadata flows are needed:
For example, currently, there are no well-established links that inform Libraries and Archives about metadata updates in external PID Services, or that enables researchers to export PID-related metadata from Libraries and Archives into Citation Managers, or that pass rights metadata from PID Services to Researchers or from Data Centers to Libraries and Archives.
It would certainly help long-term preservation if PID providers alerted archives to newly available material and other changes to their database. As we wrote in Web Infrastructure to Support e-Journal Preservation (and More):
the use of existing infrastructure (CrossRef API) combined with the introduction of new infrastructure (ResourceSync Change Lists and/or Change Notifications, Signposting) promises significant cost savings, and the potential for improvement in the quality of both the completeness of archived content, and of the metadata (including identity) describing it.
As to the last question they note that:
some PID solutions conflate resolution to a landing page that holds descriptive information (similar to the intellectual entity) with resolution to the content itself (which could be a representation, file, or bitstream). This lack of clarity hinders automatic crawls, machine-harvesting and machine-interpretation of parts of the scholarly record. Van de Somple et al. [9, 10] analyze this situation for web use of PIDs. Adopting the PREMIS model would resolve this challenge.
Experience shows that publishers are not willing to implement the rel= links required for signposting. The idea that they would provide the much more detailed metadata required by PREMIS is both unrealistic, and ignores Van de Sompel and Nelson's argument that for metadata on the Web to be effective it must be presented in a Webby way, as links.

In their introduction they write:
There has been much work towards creating the connected scholarly record and enabling validation and reuse of research outputs. In contrast, relatively little effort has gone towards ensuring long-term access to the scholarly record.
The problem here is short-termism. The benefits to investment in "creating the connected scholarly record and enabling validation and reuse of research outputs" accrue in the near future. The benefits to "ensuring long-term access to the scholarly record" accrue in the far future, and are thus discounted.


Hvdsomp said...

Great post, as always, David. I recently did a presentation that touches on the same problem domain: Achieving Link Integrity for Managed Collections. Slides are available at http://www.slideshare.net/hvdsomp/achieving-link-integrity-for-managed-collections

Hvdsomp said...

Regarding your observation that there is no uniformity among publishers regarding where to find the persistent HTTP URI (i.e. https://doi.org/xxx): The situation is significantly worse:

* Many publishers will not display the HTTP version of the DOI. They will just show a DOI:xxx string hyperlinked with the HTTP version of the DOI.
* Many publishers will not have the HTTP version of the DOI in their downloadable citations. Instead, these will have a DOI:xxx string or a URI in the publisher's domain that somehow carries the DOI, e.g. http://publisher.com/xxx
* The worse thing I've come across thus far is the DOI Bookmark in the landing pages of IEEE Computer Society. They look like this: "DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/JCDL.2017.7991576", see, for example in the paper https://www.computer.org/csdl/proceedings/jcdl/2017/3861/00/07991576.pdf. The DOI brand is (ab)used to point to a local publisher URI.

As you cited: "Persistent URIs must be used to be persistent." Publishers that actually assign DOIs don't seem to be helping to make them usable and persistent on the web.

Michael L. Nelson said...

As always, an excellent post.

Anticipating "just use hash-based names" responses... while hashes would work for PDFs, MP4s, and otherwise static files, it's also not entirely clear they would generally work for web pages with embedded resources, javascript, etc. There are trade-offs in decided what to hash.

See our post describing our most recent tech report "Difficulties of Timestamping Archived Web Pages".

David. said...

Since the early days of the LOCKSS system it has been necessary to filter out of the hashes computed during polls stuff like personalizations. See, for example, Section 3.1 of Enhancing the LOCKSS Digital Preservation Technology. So in your sense they haven't worked for many PDFs in quite a while. And I wouldn't bet that MP4s aren't being watermarked. Genuinely static content is getting rarer by the day.

But the issue of whether this is a problem for using hashes as names depends upon whether you think the hash names "the content of this URL" generically, for which they don't work, or names "the content received by this browser at this time", for which they do work. Examples of the difference would be, for example, "soft 404s" in subscription e-journals. One would want to name the content a subscriber would see, but in the absence of specific permission such as the LOCKSS system acquires, the archive would likely see a content at the URL saying "you don't have a subscription but you can purchase this article for $100".

Martin Klein said...

Just FYI, the link to "Permanence of the Scholarly Record: Persistent Identification and Digital Preservation – A Roadmap" currently points at:

where it should probably point at:

and MLN's link to his preprint points at:

which results in a 404 due to the trailing double quotes.


David. said...

My bad on the first link, which is now fixed - thanks Martin!

Blogger won't let me fix MLN's bad link in his comment so here it is fixed (I hope).