Wednesday, October 20, 2010

Four years ahead of her time

As I pointed out in my JCDL2010 keynote, Vicky Reich put this fake Starbucks web page together four years ago to predict that libraries without digital collections, acting only as a distribution channel for digital content from elsewhere, would end up competing with, and losing to, Starbucks.

This prediction has now come true.

Saturday, October 16, 2010

The Future of the Federal Depository Libraries

Governments through the ages have signally failed to resist the temptation to rewrite history. The redoubtable Emptywheel pointed me to excellent investigative reporting by ProPublica's Dafna Linzer which reveals a disturbing current example of rewriting history.

By being quick to notice, and take a copy of, a new entry in an on-line court docket Linzer was able to reveal that the Obama administration, in the name of national security, forced a Federal judge to create and enter into the court's record a misleading replacement for an opinion he had earlier issued. ProPublica's comparison of the original opinion which Linzer copied and the later replacement reveals that in reality the government wanted to hide the fact that their case against an alleged terrorist was extraordinarily thin, and based primarily on statements from detainees who had been driven mad or committed suicide as a result of their interrogation. Although the fake opinion comes to the same conclusion as the real one, the arguments are significantly different. Judges in other cases could be misled into relying on witnesses this judge discredited for reasons that were subsequently removed.

Linzer's expose of government tampering with a court docket is an example of the problem on which the LOCKSS Program has been working for more than a decade, how to make the digital record resistant to tampering and other threats. The only reason this case was detected was because Linzer created and kept a copy of the information the government published, and this copy was not under their control. Maintaining copies under multiple independent administrations (i.e. not all under control of the original publisher) is a fundamental requirement for any scheme that can recover from tampering (and in practice from many other threats). Techniques such as those developed by Stuart Haber can detect tampering without keeping a copy, but cannot recover from it.

In the paper world the proliferation of copies in the Federal Depository Library Program made the system somewhat tamper-resistant. A debate has been underway for some time as to the future of the FDLP in the electronic world - the Clinton and Bush administrations were hostile to copies not under their control but the Obama administration may be more open to the idea. The critical point that this debate has reached is illuminated by an important blog post by James Jacobs. He points out that the situation since 1993 has been tamper-friendly:
But, since 1993, when The Government Printing Office Electronic Information Access Enhancement Act (Public Law 103-40) was passed, GPO has arrogated to itself the role of permanent preservation of government information and essentially prevented FDLP libraries from undertaking that role by refusing to deposit digital materials with depository libraries.
On the other hand, GPO does now make their documents available for bulk download and efforts are under way to capture them.

The occasion for Jacobs' blog post is that GPO has contracted with Ithaka S+R to produce a report on the future of the FDLP. The fact that GPO is willing to revisit this issue is a tribute to the efforts of government document librarians, but there are a number of reasons for concern that this report's conclusions will have been pre-determined:
  • Like Portico and JSTOR, Ithaka S+R is a subsidiary of ITHAKA. It is thus in the business of replacing libraries with their own collections, such as the paper FDLP libraries, with libraries which act instead as a distribution channel for ITHAKA's collections. Jacobs points out:
    You might call this the "libraries without collections" or the "librarians without libraries" model. This is the model designed by GPO in 1993. It is the model that ITHAKA, the parent organization of Ithaka S+R, has used as its own business model for Portico and JSTOR. This model is favored by the Association of Research Libraries, by many library administrators who apparently believe that it would be better if someone else took the responsibility of preserving government information and ensuring its long-term accessibility and usability, and by many depository librarians who do not have the support of their institutions to build and manage digital collections.
  • Ithaka S+R is already on record as proposing a model for the FDLP which includes GPO and Portico, but not the FDLP libraries. Jacobs:
    Ithaka S+R has already written a report with a model for the FDLP (Documents for a Digital Democracy: A Model for the Federal Depository Library Program in the 21st Century). In that report, it recommended that "GPO should develop formal partnerships with a small number of dedicated preservation entities -- such as organizations like HathiTrust or Portico or individual libraries -- to preserve a copy of its materials".
  • As Jacobs points out, the FDLP libraries are devoted to free, open access to their collections. By contrast GPO is allowed to charge access fees. Charging fees for access is the basis for ITHAKA's business models.
    Where private sector companies limit access to those who pay and GPO is specifically authorized in the 1993 law to "charge reasonable fees," FDLP libraries are dedicated to providing information without charging.
  • The process by which Ithaka S+R ended up with the contract is unclear to me. Were there other bidders? If so, was their position on the future of FDLP on the record as Ithaka S+R's was? If Ithaka S+R was the only bidder, why was this?
It is important to note that although a system for preserving government documents consisting of the GPO and "formal partnerships with a small number of dedicated preservation entities" might well improve the resistance of government documents to some threats, it provides much less resistance to government tampering than the massively distributed paper FDLP. The "small number of dedicated preservation entities" dependent on "formal partnerships" with the government in the form of the GPO will be in a poor position to resist government arm-twisting aimed at suppressing or tampering with embarrassing information.

Wednesday, October 6, 2010

"Petabyte for a Century" Goes Main-Stream

I started writing about the insights to be gained from the problem of keeping a Petabyte for a century four years ago in September 2006. More than three years ago in June 2007 I blogged about them. Two years ago in September 2008 these ideas became a paper at iPRES 2008 (PDF). After an unbelievable 20-month delay from the time it was presented at iPRES, the International Journal of Digital Preservation finally published almost exactly the same text (PDF) in June 2010.

Now, an expanded and improved version of the paper, including material from my 2010 JCDL keynote, has appeared in ACM Queue.

Alas, I'm not quite finished writing on this topic. I was too busy when I was preparing this article and so I failed to notice an excellent paper by Kevin Greenan, James Plank and Jay Wylie, Mean time to meaningless: MTTDL, Markov models, and storage system reliability.

They agree with my point that MTTDL is a meaningless measure of storage reliability, and that bit half-life isn't a great improvement on it. They propose instead NOMDL (NOrmalized Magnitude of Data Loss), i.e. the expected number of bytes that the storage will lose in a specified interval divided by its usable capacity. As they point out, it is possible to compute this using Monte Carlo simulation based on distributions of component failures that experiments have shown to fit the real world. These simulations produce estimates that are relatively credible, especially compared to the ludicrous estimates I pillory in the article.

NOMDL is a far better measure than MTTDL. Greenan, Plank and Wylie are to be congratulated for proposing it. However, it is not a panacea. It is still the result of models based on data, rather than experiments on the system in question. The major points of my article still stand:
  • That the reliability we need is so high that benchmarking systems to assure that they exceed it is impractical.

  • That projecting the reliability of storage systems based on simulations based on component reliability distributions is likely to be optimistic, given both the observed auto- and long-range correlations between failures, and the inability of the models to capture the major causes of data loss, such as operator error.


Further, there is still a use for bit half-life. Careful readers will note subtle changes in the discussion of bit half-life between the iPRES and ACM versions. These are due to incisive criticism of the earlier version by Tsutomo Shimomura. The ACM version describes the use of bit half-life thus:
"Even if we are sublimely confident that every source of data loss other than bit rot has been totally eliminated, we still have to run a benchmark of the system’s bit half-life to confirm that it is longer than [required]"
However good simulations of the kind Greenan et al. propose may be, at some point we need to compare them to the reliability that the systems actually deliver.

A Level Playing-Field For Publishers

Stuart Shieber has an interesting paper in PLoS Biology on the economics of open-access publishing. He observes the moral hazard implicit in the separation between the readers of peer-reviewed science and the libraries that pay the subscriptions to the publishers that make the peer review possible. His proposal to deal with this is that grant funders and institutions should make dedicated funds available to authors that can be used only for paying processing fees for open access journals. After all, he observes, these funders already support the subscriptions that allow subscription journals not to charge processing fees (although some still do charge such fees). His proposal would provide a more level playing field between subscription and open access publishing channels. Below the fold is my take on how we can measure the level-ness of this field.