Wednesday, November 24, 2010

The Half-Life of Digital Formats

I've argued for some time that there are no longer any plausible scenarios by which a format will ever go obsolete if it has been in wide use since the advent of the Web in 1995. In that time no-one has shown me a convincing counter-example; a format in wide use since 1995 in which content is no longer practically accessible. I accept that many formats from before 1995 need software archeology, and that there are special cases such as games and other content protected by DRM which pose primarily legal rather than technical problems. Here are a few updates on the quest for a counter-example:
Never is a very long time. Black vs. white arguments of the kind that pit "never happens" against "the sky is falling" may be interesting but there are also insights to be gained from looking in the middle. Below the fold are some thoughts on what a middle ground argument might tell us.

Thursday, November 18, 2010

The Anonymity of Crowds

In an earlier post I discussed the consulting contract that Ithaka S+R is working on for GPO to project the future of the Federal Depository Library Program (FDLP). As part of this, Roger Schonfeld asked us:
about minimum levels of replication required in order to ensure long-term reliability

We get asked this all the time, because people think this is a simple question that should have a simple answer. We reply that experience leads us to believe that for the LOCKSS system the minimum number of copies is about 7, and surround this answer with caveats. But this answer is almost always quoted out of context as being applicable to systems in general, and being a hard number. It may be useful to give my answer to Schonfeld's question wider distribution; it is below the fold.

Monday, November 15, 2010

Open Access via PubMed Central

One major achievement of the movement for open access to publicly funded research was the National Library of Medicine's (NLM) open access PubMed Central repository (PMC). Researchers funded by the National Institutes of Health (NIH) are required by law to deposit the final versions of their papers in PMC within a year of publication. Some other funders (e.g. the Wellcome Trust) have similar mandates. Although these papers represent a small fraction of the biomedical literature, they represent a high-quality source of open-access content because these funding sources are competitive, and free from industry bias.

NLM also runs what look like two indexing services, PubMed and MEDLINE. In reality, MEDLINE is a subset of PubMed:
MEDLINE is the largest component of PubMed (, the freely accessible online database of biomedical journal citations and abstracts created by the U.S. National Library of Medicine (NLM®). Approximately 5,400 journals published in the United States and more than 80 other countries have been selected and are currently indexed for MEDLINE. A distinctive feature of MEDLINE is that the records are indexed with NLM's controlled vocabulary, the Medical Subject Headings (MeSH®).
These services are important traffic generators, so journals are anxious to be indexed. One of the requirements electronic-only journals must satisfy to be indexed is:
we must be satisfied that you will submit all articles published in a digital archive. We seek to ensure that our users will always have access to the full text of every article that we cite. The permanent archive must be PubMed Central or another site that is acceptable to NLM.
Thus, to be indexed and obtain the additional traffic, journals had to deposit "all articles" in PMC, which is open access, and was the only "site acceptable to NLM". A recent development threatens to cut off this supply of open access articles, leaving only the small proportion whose funder requires deposit in an open access repository. Details below the fold.

The Portico archiving service is a product of ITHAKA, a not-for-profit organization that originally spun out of the Andrew W. Mellon Foundation. JSTOR is another product of ITHAKA. JSTOR and Portico make content available only to readers of libraries that subscribe to these services. Libraries subscribe to Portico to obtain post-cancellation access to e-journal content, i.e. content to which they once subscribed but no longer do. Libraries subscribe to JSTOR to obtain access to digitized back content from journals. In a world where journals were open access, there would be no reason to do either and thus neither of these services' current business models would be viable. Although ITHAKA is not-for-profit, it is necessarily opposed to open access. William Bowen, the first chair of ITHAKA's board and the source of its initial funding from the Mellon Foundation, recently suggested that
the ultimate [funding] solution may be "a blended funding model" involving user fees, contributions, and "some ways of imposing taxes."
[My emphasis. The video of this session is mysteriously missing from the conference website so I am currently unable to verify the exact quote.].

ITHAKA has negotiated a deal with the NLM under which e-only articles deposited in Portico will be indexed in MEDLINE without being deposited in PMC. There is a caveat that publishers must also provide NLM with a PDF of the article to be kept in a "vault" at NLM – but this NLM copy is will not be open access.

Previous to this deal, to be indexed in PubMed articles had to be deposited in PMC, and thus made publicly available, even though doing so was not required by the law (or by the funder).After the deal articles whose deposit is not required may be deposited in Portico, not deposited in PMC, and thus no longer be publicly available although still indexed in MEDLINE. This agreement does not violate the Public Access Law, but it decreases public access to research. As more publishers move to e-only the implications of this agreement will continue to grow. In particular, NLM has given up the leverage (indexing in MEDLINE) it used to have in negotiations with the publishers.