Comments on DSHR's Blog: Bad Identifiers

My bad on the first link, which is now fixed - tha...

2017-12-21T20:07:58.509-08:00

My bad on the first link, which is now fixed - thanks Martin!

Blogger won't let me fix MLN's bad link in his comment so here it is fixed (I hope).

Just FYI, the link to "Permanence of the Scho...

2017-12-21T10:46:18.822-08:00

Just FYI, the link to "Permanence of the Scholarly Record: Persistent Identification and Digital Preservation – A Roadmap" currently points at:
https://www.blogger.com/XXX

where it should probably point at:
https://ipres2017.jp/wp-content/uploads/36Angela-Dappert.pdf

and MLN's link to his preprint points at:
https://arxiv.org/abs/1712.03140"

which results in a 404 due to the trailing double quotes.

M

Since the early days of the LOCKSS system it has b...

2017-12-20T19:40:23.052-08:00

Since the early days of the LOCKSS system it has been necessary to filter out of the hashes computed during polls stuff like personalizations. See, for example, Section 3.1 of Enhancing the LOCKSS Digital Preservation Technology. So in your sense they haven't worked for many PDFs in quite a while. And I wouldn't bet that MP4s aren't being watermarked. Genuinely static content is getting rarer by the day.

But the issue of whether this is a problem for using hashes as names depends upon whether you think the hash names "the content of this URL" generically, for which they don't work, or names "the content received by this browser at this time", for which they do work. Examples of the difference would be, for example, "soft 404s" in subscription e-journals. One would want to name the content a subscriber would see, but in the absence of specific permission such as the LOCKSS system acquires, the archive would likely see a content at the URL saying "you don't have a subscription but you can purchase this article for $100".

As always, an excellent post. Anticipating "...

2017-12-20T15:44:57.145-08:00

As always, an excellent post.

Anticipating "just use hash-based names" responses... while hashes would work for PDFs, MP4s, and otherwise static files, it's also not entirely clear they would generally work for web pages with embedded resources, javascript, etc. There are trade-offs in decided what to hash.

See our post describing our most recent tech report "Difficulties of Timestamping Archived Web Pages".

Regarding your observation that there is no unifor...

2017-12-20T12:28:58.254-08:00

Regarding your observation that there is no uniformity among publishers regarding where to find the persistent HTTP URI (i.e. https://doi.org/xxx): The situation is significantly worse:

* Many publishers will not display the HTTP version of the DOI. They will just show a DOI:xxx string hyperlinked with the HTTP version of the DOI.
* Many publishers will not have the HTTP version of the DOI in their downloadable citations. Instead, these will have a DOI:xxx string or a URI in the publisher's domain that somehow carries the DOI, e.g. http://publisher.com/xxx
* The worse thing I've come across thus far is the DOI Bookmark in the landing pages of IEEE Computer Society. They look like this: "DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/JCDL.2017.7991576", see, for example in the paper https://www.computer.org/csdl/proceedings/jcdl/2017/3861/00/07991576.pdf. The DOI brand is (ab)used to point to a local publisher URI.

As you cited: "Persistent URIs must be used to be persistent." Publishers that actually assign DOIs don't seem to be helping to make them usable and persistent on the web.

Great post, as always, David. I recently did a pre...

2017-12-20T05:18:23.531-08:00

Great post, as always, David. I recently did a presentation that touches on the same problem domain: Achieving Link Integrity for Managed Collections. Slides are available at http://www.slideshare.net/hvdsomp/achieving-link-integrity-for-managed-collections