DSHR's Blog: 2010

Wednesday, December 29, 2010

Migrating Microsoft Formats

Microsoft formats are routinely cited as examples where prophylactic format migration is required as part of a responsible digital preservation workflow. Over at Groklaw, PJ has a fascinating, long post up using SEC filings and Microsoft internal documents revealed in the Comes case to demonstrate that Microsoft's strategic use of incompatibility goes back at least 20 years and continues to this day. Her current example is the collaboration between Microsoft and Novell around Microsoft's "Open XML". This strategy poses problems that the advocates of format migration as a preservation strategy need to address. For details, follow me below the fold.

The Importance of Discovery in Memento

There is now an official Internet Draft of Memento, the important technique by which preserved versions of web sites may be accessed. The Memento team deserve congratulations not just for getting to this stage of the RFC process, but also for, on Dec. 1st, being awarded the 2010 Digital Preservation Award. Follow me below the fold for an explanation of one detail of the specification which, I believe, will become very important.

Rob Sharpe's Case For Format Migration

I'm grateful to Rob Sharpe of Tessella for responding, both on Tessella's web-site and here, to my post on the half-life of digital formats. It is nice to have an opportunity to debate. To rephrase my original post, the two sides of the debate are really:

Those who believe that format obsolescence is common, and thus format migration is the norm in digital preservation.
Those who believe that format obsolescence is rare, and thus format migration is the exception in digital preservation.

Rob makes three points in his comment:

Formats do go obsolete and the way to deal with this is format migration.
Digital preservation customers require format migration.
Format migration isn't expensive. (He expands on this on Tessella's web-site).

Follow me below the fold for a detailed discussion of these three points.

Machine-Readable Licenses vs. Machine-Readable Rights?

In the article Archiving Supplemental Materials (PDF) that Vicky Reich and I published recently in Information Standards Quarterly (a download is here), we point out that intellectual property considerations are a major barrier to preserving these increasingly common adjuncts to scholarly articles:

Some of them are data. Some data is just facts, so is not copyright. In some jurisdictions, collections of facts are copyright. In Europe, databases are covered by database right, which is different from copyright.
The copyright releases signed by authors differ, and the extent to which they cover supplemental materials may not be clear

Groups such as Science Commons (a Creative Commons project) and the Open Data Commons are working to create suitable analogs of the set of simple, widely accepted licenses that Creative Commons has created for copyright material.

For material that is subject to copyright, we strongly encourage use of Creative Commons licenses. They permit all activities required for preservation without consultation with the publisher. The legal risks of interpreting other license terms as permitting these activities without explicit permission are considerable, so even if the material was released under some other license terms we would generally prefer not to depend on them but seek explicit permission from the publisher instead. Obtaining explicit permission from the publisher is time-consuming and expensive. So is having a lawyer analyze the terms of a new license to determine whether it covers the required activities.

Efforts, such as those we cite in the article, are under way to develop suitable licenses for data, but they have yet to achieve even the limited penetration of Creative Commons for copyright works. Until there is a simple, clear, widely-accepted license in place difficulties will lie in the path of any broad approach to preserving supplemental materials, especially data. Creating such a license will be more a difficult task than Creative Commons faced, since it will not be able to draw on the firm legal foundation of copyright. Note that the analogs of Creative Commons licenses for software, the various Open Source licenses, are also based on copyright.

When and if suitable licenses become common, one or more machine-readable ways to identify content published under the licenses will be useful. We're agnostic as to how this is done; the details will have little effect on the archiving process once we have implemented a parser for the machine-readable rights expressions that we encounter. We have already done this using the various versions of the Creative Commons license for the Global LOCKSS Network.

The idea of a general "rights language" that would express the terms of a wide variety of licenses in machine-readable form is popular. But it is not a panacea. If there were a wide variety of license terms, even if they were encoded in machine-readable form, we would be reluctant to depend on them. There are few enough Creative Commons licenses and they are simple enough that they can be reviewed and approved by human lawyers. It would be too risky to depend on software interpreting the terms of licenses that had never had this review. So, a small set of simple clear licenses is essential for preservation. Encoding these licenses in machine-readable form is a good thing. That is what the Creative Commons license in machine-readable form does; it does not express the specific rights but simply points to the text of the license in question.

Encoding the specific terms of a wide variety of complex licenses in a rights language is much less useful. The software that interprets these encodings will not end up in court, nor will the encodings. The archives that use the software will end up in court facing the text of the license in a human language.

Saturday, December 4, 2010

A Puzzling Post From Rob Sharpe

I'm sometimes misquoted as saying "formats never become obsolete", but that isn't the argument I am making. Rather, I am arguing that basing the entire architecture of digital preservation systems on preparing for an event, format obsolescence, which is unlikely to happen to the vast majority of the content in the system in its entire lifetime is not good engineering. The effect of this approach is to raise the cost per byte of preserving content, by investing resources in activities such as collecting and validating format metadata, that are unlikely to generate a return. This ensures that vastly more content will be lost because no-one can afford to preserve it than will ever be lost through format obsolescence.

Tessella is a company in the business of selling digital preservation products and services based on the idea that content needs "Active Preservation", their name for the idea that the formats will go obsolete and that the way to deal with this prospect is to invest resources into treating all content as if it were in immediate need of format migration. Their market is

managing projects for leading national archives and libraries. These include ... the UK National Archives ... the British Library [the] US National Archives and Records Administration ... [the] Dutch National Archief and the Swiss Federal Archives.

It isn't a surprise to find that on Tesella's official blog Rob Sharpe disagrees with my post on format half-lives. Rob points out that

at Tessella we have a lot of old information trapped in Microsoft Project 98 files.

The obsolescence of Microsoft Project 98's format was first pointed out to me at the June 2009 PASIG meeting in Malta, possibly by Rob himself. I agree that this is one of the best of the few examples of an obsolete format, but I don't agree that it was a widely used format. What proportion of the total digital content that needs preservation is Project 98?

But there is a more puzzling aspect to Rob's post. Perhaps someone can explain what is wrong with this analysis.

Given that Tessella's sales pitch is that "Active Preservation" is the solution to your digital preservation needs, one would expect them to use their chosen example of an obsolete format to show how successful "Active Preservation" is at migrating it. But instead

at Tessella we have a lot of old information trapped in Microsoft Project 98 files.

Presumably, this means that they are no longer able to access the information "using supported software". Of course, they could access it using the old Project 98 software, but that wouldn't meet Rob's definition of obsolescence.

Are they unable to access the information because they didn't "eat their own dog-food" in the Silicon Valley tradition, using their own technology to preserve their own information? Or are they unable to access it because they did use their own technology and it didn't work? Or is Project 98 not a good example of

a format for which no supported software that can interpret it exists

so it is neither a suitable subject for their technology, nor for this debate?

Wednesday, November 24, 2010

The Half-Life of Digital Formats

I've argued for some time that there are no longer any plausible scenarios by which a format will ever go obsolete if it has been in wide use since the advent of the Web in 1995. In that time no-one has shown me a convincing counter-example; a format in wide use since 1995 in which content is no longer practically accessible. I accept that many formats from before 1995 need software archeology, and that there are special cases such as games and other content protected by DRM which pose primarily legal rather than technical problems. Here are a few updates on the quest for a counter-example:

The question arose during discussions at the Dutch National Library in September. Here is an interesting blog post in which some potential examples of such format obsolescence are suggested and refuted.
In the Q&A for a session at iPRES 2010 I asked whether anyone in the audience had been forced to migrate a widely-used format to retain legibility, as opposed to choosing to do so. No-one had.
The first response to my November CACM paper on bit preservation raised the issue of format obsolescence. It and my response ~~are due to~~ appear in the ~~December~~January letters to the editor of CACM.

Never is a very long time. Black vs. white arguments of the kind that pit "never happens" against "the sky is falling" may be interesting but there are also insights to be gained from looking in the middle. Below the fold are some thoughts on what a middle ground argument might tell us.

The Anonymity of Crowds

In an earlier post I discussed the consulting contract that Ithaka S+R is working on for GPO to project the future of the Federal Depository Library Program (FDLP). As part of this, Roger Schonfeld asked us:

about minimum levels of replication required in order to ensure long-term reliability

We get asked this all the time, because people think this is a simple question that should have a simple answer. We reply that experience leads us to believe that for the LOCKSS system the minimum number of copies is about 7, and surround this answer with caveats. But this answer is almost always quoted out of context as being applicable to systems in general, and being a hard number. It may be useful to give my answer to Schonfeld's question wider distribution; it is below the fold.

Open Access via PubMed Central

One major achievement of the movement for open access to publicly funded research was the National Library of Medicine's (NLM) open access PubMed Central repository (PMC). Researchers funded by the National Institutes of Health (NIH) are required by law to deposit the final versions of their papers in PMC within a year of publication. Some other funders (e.g. the Wellcome Trust) have similar mandates. Although these papers represent a small fraction of the biomedical literature, they represent a high-quality source of open-access content because these funding sources are competitive, and free from industry bias.

NLM also runs what look like two indexing services, PubMed and MEDLINE. In reality, MEDLINE is a subset of PubMed:

MEDLINE is the largest component of PubMed (http://pubmed.gov/), the freely accessible online database of biomedical journal citations and abstracts created by the U.S. National Library of Medicine (NLM®). Approximately 5,400 journals published in the United States and more than 80 other countries have been selected and are currently indexed for MEDLINE. A distinctive feature of MEDLINE is that the records are indexed with NLM's controlled vocabulary, the Medical Subject Headings (MeSH®).

These services are important traffic generators, so journals are anxious to be indexed. One of the requirements electronic-only journals must satisfy to be indexed is:

we must be satisfied that you will submit all articles published in a digital archive. We seek to ensure that our users will always have access to the full text of every article that we cite. The permanent archive must be PubMed Central or another site that is acceptable to NLM.

Thus, to be indexed and obtain the additional traffic, journals had to deposit "all articles" in PMC, which is open access, and was the only "site acceptable to NLM". A recent development threatens to cut off this supply of open access articles, leaving only the small proportion whose funder requires deposit in an open access repository. Details below the fold.

The Portico archiving service is a product of ITHAKA, a not-for-profit organization that originally spun out of the Andrew W. Mellon Foundation. JSTOR is another product of ITHAKA. JSTOR and Portico make content available only to readers of libraries that subscribe to these services. Libraries subscribe to Portico to obtain post-cancellation access to e-journal content, i.e. content to which they once subscribed but no longer do. Libraries subscribe to JSTOR to obtain access to digitized back content from journals. In a world where journals were open access, there would be no reason to do either and thus neither of these services' current business models would be viable. Although ITHAKA is not-for-profit, it is necessarily opposed to open access. William Bowen, the first chair of ITHAKA's board and the source of its initial funding from the Mellon Foundation, recently suggested that

the ultimate [funding] solution may be "a blended funding model" involving user fees, contributions, and "some ways of imposing taxes."

[My emphasis. The video of this session is mysteriously missing from the conference website so I am currently unable to verify the exact quote.].

ITHAKA has negotiated a deal with the NLM under which e-only articles deposited in Portico will be indexed in MEDLINE without being deposited in PMC. There is a caveat that publishers must also provide NLM with a PDF of the article to be kept in a "vault" at NLM – but this NLM copy is will not be open access.

Previous to this deal, to be indexed in PubMed articles had to be deposited in PMC, and thus made publicly available, even though doing so was not required by the law (or by the funder).After the deal articles whose deposit is not required may be deposited in Portico, not deposited in PMC, and thus no longer be publicly available although still indexed in MEDLINE. This agreement does not violate the Public Access Law, but it decreases public access to research. As more publishers move to e-only the implications of this agreement will continue to grow. In particular, NLM has given up the leverage (indexing in MEDLINE) it used to have in negotiations with the publishers.

Wednesday, October 20, 2010

Four years ahead of her time

As I pointed out in my JCDL2010 keynote, Vicky Reich put this

fake Starbucks web page together four years ago to predict that libraries without digital collections, acting only as a distribution channel for digital content from elsewhere, would end up competing with, and losing to, Starbucks.

This prediction has now come true.

Saturday, October 16, 2010

The Future of the Federal Depository Libraries

Governments through the ages have signally failed to resist the temptation to rewrite history. The redoubtable Emptywheel pointed me to excellent investigative reporting by ProPublica's Dafna Linzer which reveals a disturbing current example of rewriting history.

By being quick to notice, and take a copy of, a new entry in an on-line court docket Linzer was able to reveal that the Obama administration, in the name of national security, forced a Federal judge to create and enter into the court's record a misleading replacement for an opinion he had earlier issued. ProPublica's comparison of the original opinion which Linzer copied and the later replacement reveals that in reality the government wanted to hide the fact that their case against an alleged terrorist was extraordinarily thin, and based primarily on statements from detainees who had been driven mad or committed suicide as a result of their interrogation. Although the fake opinion comes to the same conclusion as the real one, the arguments are significantly different. Judges in other cases could be misled into relying on witnesses this judge discredited for reasons that were subsequently removed.

Linzer's expose of government tampering with a court docket is an example of the problem on which the LOCKSS Program has been working for more than a decade, how to make the digital record resistant to tampering and other threats. The only reason this case was detected was because Linzer created and kept a copy of the information the government published, and this copy was not under their control. Maintaining copies under multiple independent administrations (i.e. not all under control of the original publisher) is a fundamental requirement for any scheme that can recover from tampering (and in practice from many other threats). Techniques such as those developed by Stuart Haber can detect tampering without keeping a copy, but cannot recover from it.

In the paper world the proliferation of copies in the Federal Depository Library Program made the system somewhat tamper-resistant. A debate has been underway for some time as to the future of the FDLP in the electronic world - the Clinton and Bush administrations were hostile to copies not under their control but the Obama administration may be more open to the idea. The critical point that this debate has reached is illuminated by an important blog post by James Jacobs. He points out that the situation since 1993 has been tamper-friendly:

But, since 1993, when The Government Printing Office Electronic Information Access Enhancement Act (Public Law 103-40) was passed, GPO has arrogated to itself the role of permanent preservation of government information and essentially prevented FDLP libraries from undertaking that role by refusing to deposit digital materials with depository libraries.

On the other hand, GPO does now make their documents available for bulk download and efforts are under way to capture them.

The occasion for Jacobs' blog post is that GPO has contracted with Ithaka S+R to produce a report on the future of the FDLP. The fact that GPO is willing to revisit this issue is a tribute to the efforts of government document librarians, but there are a number of reasons for concern that this report's conclusions will have been pre-determined:

Like Portico and JSTOR, Ithaka S+R is a subsidiary of ITHAKA. It is thus in the business of replacing libraries with their own collections, such as the paper FDLP libraries, with libraries which act instead as a distribution channel for ITHAKA's collections. Jacobs points out:
You might call this the "libraries without collections" or the "librarians without libraries" model. This is the model designed by GPO in 1993. It is the model that ITHAKA, the parent organization of Ithaka S+R, has used as its own business model for Portico and JSTOR. This model is favored by the Association of Research Libraries, by many library administrators who apparently believe that it would be better if someone else took the responsibility of preserving government information and ensuring its long-term accessibility and usability, and by many depository librarians who do not have the support of their institutions to build and manage digital collections.
Ithaka S+R is already on record as proposing a model for the FDLP which includes GPO and Portico, but not the FDLP libraries. Jacobs:
Ithaka S+R has already written a report with a model for the FDLP (Documents for a Digital Democracy: A Model for the Federal Depository Library Program in the 21st Century). In that report, it recommended that "GPO should develop formal partnerships with a small number of dedicated preservation entities -- such as organizations like HathiTrust or Portico or individual libraries -- to preserve a copy of its materials".
As Jacobs points out, the FDLP libraries are devoted to free, open access to their collections. By contrast GPO is allowed to charge access fees. Charging fees for access is the basis for ITHAKA's business models.
Where private sector companies limit access to those who pay and GPO is specifically authorized in the 1993 law to "charge reasonable fees," FDLP libraries are dedicated to providing information without charging.
The process by which Ithaka S+R ended up with the contract is unclear to me. Were there other bidders? If so, was their position on the future of FDLP on the record as Ithaka S+R's was? If Ithaka S+R was the only bidder, why was this?

It is important to note that although a system for preserving government documents consisting of the GPO and "formal partnerships with a small number of dedicated preservation entities" might well improve the resistance of government documents to some threats, it provides much less resistance to government tampering than the massively distributed paper FDLP. The "small number of dedicated preservation entities" dependent on "formal partnerships" with the government in the form of the GPO will be in a poor position to resist government arm-twisting aimed at suppressing or tampering with embarrassing information.

Wednesday, October 6, 2010

"Petabyte for a Century" Goes Main-Stream

I started writing about the insights to be gained from the problem of keeping a Petabyte for a century four years ago in September 2006. More than three years ago in June 2007 I blogged about them. Two years ago in September 2008 these ideas became a paper at iPRES 2008 (PDF). After an unbelievable 20-month delay from the time it was presented at iPRES, the International Journal of Digital Preservation finally published almost exactly the same text (PDF) in June 2010.

Now, an expanded and improved version of the paper, including material from my 2010 JCDL keynote, has appeared in ACM Queue.

Alas, I'm not quite finished writing on this topic. I was too busy when I was preparing this article and so I failed to notice an excellent paper by Kevin Greenan, James Plank and Jay Wylie, Mean time to meaningless: MTTDL, Markov models, and storage system reliability.

They agree with my point that MTTDL is a meaningless measure of storage reliability, and that bit half-life isn't a great improvement on it. They propose instead NOMDL (NOrmalized Magnitude of Data Loss), i.e. the expected number of bytes that the storage will lose in a specified interval divided by its usable capacity. As they point out, it is possible to compute this using Monte Carlo simulation based on distributions of component failures that experiments have shown to fit the real world. These simulations produce estimates that are relatively credible, especially compared to the ludicrous estimates I pillory in the article.

NOMDL is a far better measure than MTTDL. Greenan, Plank and Wylie are to be congratulated for proposing it. However, it is not a panacea. It is still the result of models based on data, rather than experiments on the system in question. The major points of my article still stand:

That the reliability we need is so high that benchmarking systems to assure that they exceed it is impractical.

That projecting the reliability of storage systems based on simulations based on component reliability distributions is likely to be optimistic, given both the observed auto- and long-range correlations between failures, and the inability of the models to capture the major causes of data loss, such as operator error.

Further, there is still a use for bit half-life. Careful readers will note subtle changes in the discussion of bit half-life between the iPRES and ACM versions. These are due to incisive criticism of the earlier version by Tsutomo Shimomura. The ACM version describes the use of bit half-life thus:

"Even if we are sublimely confident that every source of data loss other than bit rot has been totally eliminated, we still have to run a benchmark of the system’s bit half-life to confirm that it is longer than [required]"

However good simulations of the kind Greenan et al. propose may be, at some point we need to compare them to the reliability that the systems actually deliver.

A Level Playing-Field For Publishers

Stuart Shieber has an interesting paper in PLoS Biology on the economics of open-access publishing. He observes the moral hazard implicit in the separation between the readers of peer-reviewed science and the libraries that pay the subscriptions to the publishers that make the peer review possible. His proposal to deal with this is that grant funders and institutions should make dedicated funds available to authors that can be used only for paying processing fees for open access journals. After all, he observes, these funders already support the subscriptions that allow subscription journals not to charge processing fees (although some still do charge such fees). His proposal would provide a more level playing field between subscription and open access publishing channels. Below the fold is my take on how we can measure the level-ness of this field.

How Green Is Digital Preservation?

At iPRES 2010 I was on a panel chaired by Neil Grindley of JISC entitled How Green is Digital Preservation?. Each of the panelists gave a very brief introduction; below the fold is an edited version of mine.

Reinforcing my point

Reinforcing the point I have made over and over and over again, that the threat of software and format obsolescence is vastly over-blown, here is a year-old Slashdot story pointing to a contemporary blog post listing a selection of ancient operating systems that simply refuse to die, with comments providing many more examples.

Tuesday, July 13, 2010

Yet more bad news about disks

Chris Mellor at The Register reviews the prospects for the 4TB disk generation and reports that manufacturers are finding the transition to the technologies it needs more difficult and expensive than expected.

This reinforces the argument of my earlier post, based on Dave Anderson's presentation (PDF) to the 2009 Library of Congress Storage workshop, that the exponential drop in cost per byte we expect from disks is about to flatten out.

Thursday, June 24, 2010

JCDL 2010 Keynote

On June 23 I gave a keynote address entitled to the joint JCDL/IACDL 2010 conference at Surfer's Paradise in Queensland, Australia. Below the fold is an edited text of the talk, with links to the resources.

Even more bad news about disks

I attended the 25th (plus one) anniversary celebrations for the Andrew project at Carnegie-Mellon University. As part of these James Gosling gave a talk. He stressed the importance of parallelism in programming, reinforcing the point with a graph of CPU clock rate against time. For many years, clock rate increased. Some years ago, it stopped increasing. Did Moore's Law stop working? Not at all, there were no strong technological barriers to increasing the clock rate. What happened was that increasing the clock rate stopped being a way to make money. Mass-market customers wanted lower power, lower price CPUs, not faster ones. So that's what the manufacturers made.

Follow me below the fold to see the analogous phenomenon happening to disks, and why this is bad news for digital preservation.

DSHR's Blog