Wednesday, December 29, 2010

Migrating Microsoft Formats

Microsoft formats are routinely cited as examples where prophylactic format migration is required as part of a responsible digital preservation workflow. Over at Groklaw, PJ has a fascinating, long post up using SEC filings and Microsoft internal documents revealed in the Comes case to demonstrate that Microsoft's strategic use of incompatibility goes back at least 20 years and continues to this day. Her current example is the collaboration between Microsoft and Novell around Microsoft's "Open XML". This strategy poses problems that the advocates of format migration as a preservation strategy need to address. For details, follow me below the fold.

Monday, December 27, 2010

The Importance of Discovery in Memento

There is now an official Internet Draft of Memento, the important technique by which preserved versions of web sites may be accessed. The Memento team deserve congratulations not just for getting to this stage of the RFC process, but also for, on Dec. 1st, being awarded the 2010 Digital Preservation Award. Follow me below the fold for an explanation of one detail of the specification which, I believe, will become very important.

Friday, December 10, 2010

Rob Sharpe's Case For Format Migration

I'm grateful to Rob Sharpe of Tessella for responding, both on Tessella's web-site and here, to my post on the half-life of digital formats. It is nice to have an opportunity to debate. To rephrase my original post, the two sides of the debate are really:
  • Those who believe that format obsolescence is common, and thus format migration is the norm in digital preservation.
  • Those who believe that format obsolescence is rare, and thus format migration is the exception in digital preservation.
Rob makes three points in his comment:
  • Formats do go obsolete and the way to deal with this is format migration.
  • Digital preservation customers require format migration.
  • Format migration isn't expensive. (He expands on this on Tessella's web-site).
Follow me below the fold for a detailed discussion of these three points.

Monday, December 6, 2010

Machine-Readable Licenses vs. Machine-Readable Rights?

In the article Archiving Supplemental Materials (PDF) that Vicky Reich and I published recently in Information Standards Quarterly (a download is here), we point out that intellectual property considerations are a major barrier to preserving these increasingly common adjuncts to scholarly articles:
  • Some of them are data. Some data is just facts, so is not copyright. In some jurisdictions, collections of facts are copyright. In Europe, databases are covered by database right, which is different from copyright.
  • The copyright releases signed by authors differ, and the extent to which they cover supplemental materials may not be clear
Groups such as Science Commons (a Creative Commons project) and the Open Data Commons are working to create suitable analogs of the set of simple, widely accepted licenses that Creative Commons has created for copyright material.

For material that is subject to copyright, we strongly encourage use of Creative Commons licenses. They permit all activities required for preservation without consultation with the publisher. The legal risks of interpreting other license terms as permitting these activities without explicit permission are considerable, so even if the material was released under some other license terms we would generally prefer not to depend on them but seek explicit permission from the publisher instead. Obtaining explicit permission from the publisher is time-consuming and expensive. So is having a lawyer analyze the terms of a new license to determine whether it covers the required activities.

Efforts, such as those we cite in the article, are under way to develop suitable licenses for data, but they have yet to achieve even the limited penetration of Creative Commons for copyright works. Until there is a simple, clear, widely-accepted license in place difficulties will lie in the path of any broad approach to preserving supplemental materials, especially data. Creating such a license will be more a difficult task than Creative Commons faced, since it will not be able to draw on the firm legal foundation of copyright. Note that the analogs of Creative Commons licenses for software, the various Open Source licenses, are also based on copyright.

When and if suitable licenses become common, one or more machine-readable ways to identify content published under the licenses will be useful. We're agnostic as to how this is done; the details will have little effect on the archiving process once we have implemented a parser for the machine-readable rights expressions that we encounter. We have already done this using the various versions of the Creative Commons license for the Global LOCKSS Network.

The idea of a general "rights language" that would express the terms of a wide variety of licenses in machine-readable form is popular. But it is not a panacea. If there were a wide variety of license terms, even if they were encoded in machine-readable form, we would be reluctant to depend on them. There are few enough Creative Commons licenses and they are simple enough that they can be reviewed and approved by human lawyers. It would be too risky to depend on software interpreting the terms of licenses that had never had this review. So, a small set of simple clear licenses is essential for preservation. Encoding these licenses in machine-readable form is a good thing. That is what the Creative Commons license in machine-readable form does; it does not express the specific rights but simply points to the text of the license in question.

Encoding the specific terms of a wide variety of complex licenses in a rights language is much less useful. The software that interprets these encodings will not end up in court, nor will the encodings. The archives that use the software will end up in court facing the text of the license in a human language.

Saturday, December 4, 2010

A Puzzling Post From Rob Sharpe

I'm sometimes misquoted as saying "formats never become obsolete", but that isn't the argument I am making. Rather, I am arguing that basing the entire architecture of digital preservation systems on preparing for an event, format obsolescence, which is unlikely to happen to the vast majority of the content in the system in its entire lifetime is not good engineering. The effect of this approach is to raise the cost per byte of preserving content, by investing resources in activities such as collecting and validating format metadata, that are unlikely to generate a return. This ensures that vastly more content will be lost because no-one can afford to preserve it than will ever be lost through format obsolescence.

Tessella is a company in the business of selling digital preservation products and services based on the idea that content needs "Active Preservation", their name for the idea that the formats will go obsolete and that the way to deal with this prospect is to invest resources into treating all content as if it were in immediate need of format migration. Their market is
managing projects for leading national archives and libraries. These include ... the UK National Archives ... the British Library [the] US National Archives and Records Administration ... [the] Dutch National Archief and the Swiss Federal Archives.
It isn't a surprise to find that on Tesella's official blog Rob Sharpe disagrees with my post on format half-lives. Rob points out that
at Tessella we have a lot of old information trapped in Microsoft Project 98 files.
The obsolescence of Microsoft Project 98's format was first pointed out to me at the June 2009 PASIG meeting in Malta, possibly by Rob himself. I agree that this is one of the best of the few examples of an obsolete format, but I don't agree that it was a widely used format. What proportion of the total digital content that needs preservation is Project 98?

But there is a more puzzling aspect to Rob's post. Perhaps someone can explain what is wrong with this analysis.

Given that Tessella's sales pitch is that "Active Preservation" is the solution to your digital preservation needs, one would expect them to use their chosen example of an obsolete format to show how successful "Active Preservation" is at migrating it. But instead
at Tessella we have a lot of old information trapped in Microsoft Project 98 files.
Presumably, this means that they are no longer able to access the information "using supported software". Of course, they could access it using the old Project 98 software, but that wouldn't meet Rob's definition of obsolescence.

Are they unable to access the information because they didn't "eat their own dog-food" in the Silicon Valley tradition, using their own technology to preserve their own information? Or are they unable to access it because they did use their own technology and it didn't work? Or is Project 98 not a good example of
a format for which no supported software that can interpret it exists
so it is neither a suitable subject for their technology, nor for this debate?