- describes the ways archives ingest e-journal articles,
- shows the areas in which these processes use heuristics, which makes them fallible and expensive to maintain,
- and shows how the use of DOIs, ResourceSync, and Herbert and Michael's "Signposting" proposal could greatly improve these and other processes that need to access e-journal content.
I'm David Rosenthal, and this is a place to discuss the work I'm doing in Digital Preservation.
Monday, May 23, 2016
Improving e-Journal Ingest (among other things)
Herbert Van de Sompel, Michael Nelson and I have a new paper entitled Web Infrastructure to Support e-Journal Preservation (and More) that:
Subscribe to:
Post Comments (Atom)
4 comments:
A footnote that was excised from the final version of this paper reported one result
from a study we are conducting (slowly) as a background exercise:
"For example, approximately 1.5% (7/455) of a random sample of DOIs from the CLOCKSS archive resolve to valid articles but return 404 from the CrossRef metadata API."
The study uses the metadata that LOCKSS and CLOCKSS extracts from the content as input. The output is intended to be a paper entitled Noise in e-Journal Metadata.
Our going-in assumption was that the various kinds of noise we find were accidents, the results of sloppy processes. But no, now Rik Smith-Unna discovers that at least one large publisher is deliberately inserting noise by using fake DOIs as crawler traps.
The fake DOI culprit is Wiley, see here. Eric Hellman's and Library Loon's warnings of the horribles to come are starting to look prophetic.
The awesome Eric Hellman has a detailed dissection of Wiley's "fake DOIs" which are in fact simply trap URLs that, because of the URL structure of Wiley's platform, look like DOIs. Eric comments:
"the trap URLs being deployed by Wiley are sophomoric and a technical embarrassment."
because, among other things, they provide a simple way for an attacker to get a University banned from accessing even Wiley's open access content and, because robots.txt is mis-configured, are indexed by search engines.
See Geoff Bilder's discussion of "fake DOIs" on the CrossRef blog.
Post a Comment