Monday, May 23, 2016

Improving e-Journal Ingest (among other things)

Herbert Van de Sompel, Michael Nelson and I have a new paper entitled Web Infrastructure to Support e-Journal Preservation (and More) that:
  • describes the ways archives ingest e-journal articles,
  • shows the areas in which these processes use heuristics, which makes them fallible and expensive to maintain,
  • and shows how the use of DOIs, ResourceSync, and Herbert and Michael's "Signposting" proposal could greatly improve these and other processes that need to access e-journal content.
It concludes with a set of recommendations for CrossRef and the e-journal publishers that would be easy to adopt and would not merely improve these processes but also help remedy the deficiencies in the way DOI's are used in practice that were identified in Martin Klein et al's paper in PLoS One entitled Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot, and in Persistent URIs Must Be Used To Be Persistent, presented by Herbert and co-authors to the 25th international world wide web conference.


David. said...

A footnote that was excised from the final version of this paper reported one result
from a study we are conducting (slowly) as a background exercise:

"For example, approximately 1.5% (7/455) of a random sample of DOIs from the CLOCKSS archive resolve to valid articles but return 404 from the CrossRef metadata API."

The study uses the metadata that LOCKSS and CLOCKSS extracts from the content as input. The output is intended to be a paper entitled Noise in e-Journal Metadata.

Our going-in assumption was that the various kinds of noise we find were accidents, the results of sloppy processes. But no, now Rik Smith-Unna discovers that at least one large publisher is deliberately inserting noise by using fake DOIs as crawler traps.

David. said...

The fake DOI culprit is Wiley, see here. Eric Hellman's and Library Loon's warnings of the horribles to come are starting to look prophetic.

David. said...

The awesome Eric Hellman has a detailed dissection of Wiley's "fake DOIs" which are in fact simply trap URLs that, because of the URL structure of Wiley's platform, look like DOIs. Eric comments:

"the trap URLs being deployed by Wiley are sophomoric and a technical embarrassment."

because, among other things, they provide a simple way for an attacker to get a University banned from accessing even Wiley's open access content and, because robots.txt is mis-configured, are indexed by search engines.

David. said...

See Geoff Bilder's discussion of "fake DOIs" on the CrossRef blog.