Thursday, September 1, 2016

CrossRef on "fake DOIs"

I should have pointed to Geoff Bilder's post to the CrossRef blog, DOI-like strings and fake DOIs when it appeared at the end of June. It responds to the phenomena described in Eric Hellman's Wiley's Fake Journal of Constructive Metaphysics and the War on Automated Downloading, to which I linked in the comments on Improving e-Journal Ingest (among other things). Details below the fold.

Geoff distinguishes between "DOI-like strings" and "fake DOIs", presenting three ways DOI-like strings have been (ab)used:
  • As internal identifiers. Many publishing platforms use the DOI they're eventually going to register as their internal identifier for the article. Typically it appears in the URL at which it is eventually published. The problem is that:
    the unregistered DOI-like strings for unpublished (e.g. under review or rejected manuscripts) content ‘escape’ into the public as well. People attempting to use these DOI-like strings get understandably confused and angry when they don’t resolve or otherwise work as DOIs.
    Platforms should use internal IDs that can't be mistaken for external IDs, because they can't guarantee that the internal ones won't leak.
  • As spider- or crawler-traps. This is the usage that Eric Hellman identified. Strings that look like DOIs but are not even intended to eventually be registered but which have bad effects when resolved:
    When a spider/bot trap includes a DOI-like string, then we have seen some particularly pernicious problems as they can trip-up legitimate tools and activities as well. For example, a bibliographic management browser plugin might automatically extract DOIs and retrieve metadata on pages visited by a researcher. If the plugin were to pick up one of these spider traps DOI-like strings, it might inadvertently trigger the researcher being blocked- or worse- the researcher’s entire university being blocked. In the past, this has even been a problem for Crossref itself. We periodically run tools to test DOI resolution and to ensure that our members are properly displaying DOIs, CrossMarks, and metadata as per their member obligations. We’ve occasionally been blocked when we ran across the spider traps as well.
    Sites using these kinds of crawler traps should expect a lot of annoyed customers whose legitimate operations caused them to be blocked.
  • As proxy bait. These unregistered DOI-like strings can be fed to system such as Sci-Hub in an attempt to detect proxies. If they are generated afresh on each attempt, the attacker knows that Sci-Hub does not have the content. So it will try to fetch it using a proxy or other technique. The fetch request will be routed via the proxy to the publisher, who will recognize the DOI-like string, know where the proxy is located and can take action, such as blocking the institution:
    In theory this technique never exposes the DOI-like strings to the public and automated tools should not be able to stumble upon them. However, recently one of our members had some of these DOI-like strings “escape” into the public and at least one of them was indexed by Google. The problem was compounded because people clicking on these DOI-like strings sometimes ended having their university’s IP address banned from the member’s web site. ... We think this just underscores how hard it is to ensure DOI-like strings remain private and why we recommend our members not use them.
    As we see every day, designing computer systems that in the real world never leak information is way beyond the state of the art.
And Bilder explains why what CrossRef means by "fake DOIs" isn't what the general public means by "fake DOIs", how to spot them, and why they are harmless.
The following is what we have sometimes called a “fake DOI”

10.5555/12345678

It is registered with Crossref, resolves to a fake article in a fake journal called The Journal of Psychoceramics (the study of Cracked Pots) run by a fictitious author (Josiah Carberry) who has a fake ORCID (http://orcid.org/0000-0002-1825-0097) but who is affiliated with a real university (Brown University).

Again, you can try it.

http://doi.org/10.5555/12345678

And you can even look up metadata for it.

http://api.crossref.org/works/10.5555/12345678

Our dirty little secret is that this “fake DOI” was registered and is controlled by Crossref.
These "starting with 5" DOIs are used by Crossref to test their systems. They too can confuse legitimate software, but the bad effects of the confusion are limited. And now that the secret is out, legitimate software can know to ignore them, and thus avoid the confusion.

No comments: