Friday, December 10, 2010

Rob Sharpe's Case For Format Migration

I'm grateful to Rob Sharpe of Tessella for responding, both on Tessella's web-site and here, to my post on the half-life of digital formats. It is nice to have an opportunity to debate. To rephrase my original post, the two sides of the debate are really:
  • Those who believe that format obsolescence is common, and thus format migration is the norm in digital preservation.
  • Those who believe that format obsolescence is rare, and thus format migration is the exception in digital preservation.
Rob makes three points in his comment:
  • Formats do go obsolete and the way to deal with this is format migration.
  • Digital preservation customers require format migration.
  • Format migration isn't expensive. (He expands on this on Tessella's web-site).
Follow me below the fold for a detailed discussion of these three points.


Rob: Formats Go Obsolete

Rob's own chosen example of format obsolescence doesn't help his case. It turns out that, even on his own definition, he is simply wrong about Project 98 being obsolete, in that there are several supported viewers for Project 98. When Chris pointed them out, Rob changed his story and admitted that Tessella can access Project 98 files by running Project 98.

I said we couldn’t read the Project 98 format but it is actually the previous version(s) we can’t read directly. What we can do (and why I got confused) is to use the Project 98 software (which is no longer supported) to convert it into a format that modern versions of Project can still read.
It turns out once more that I should have investigated the alleged obsolescence of the Project 98 format more closely. This Google search reveals the informative web site of a Java Open Source tool, Memoranda that reads and writes the format from before Project 98 (MPX), reads the Project 98 format (MPP), and reads and writes the XML-based format used from Project 2002 on (MSPDI). So, Rob is saying they can't read MPX but they can use Project 98 to convert it to MPP, and Project 2002 to convert it to MSPDI. But in the same sentence, he is saying they can read MPX, because Project 98 can read it, and they can run Project 98.

Thus when Tessella says these formats are "trapped", their reason is not that the necessary tools don't exist, but that Microsoft no longer supports these tools and they appear unwilling to use any third party tools even if they are supported by their respective vendors, or are Open Source. By any reasonable definition of "obsolete", none of the Microsoft Project formats are obsolete and, since there are Open Source renderers for all of them, there is no credible scenario by which they will ever go obsolete. Tessella's apparent dependence on Microsoft support is puzzling, except as a marketing strategy (raising unjustified alarms about format obsolescence in order to sell systems that claim to defend against it).

This illustrates fundamental contradictions in the norm approach, in that if a format is popular enough to be worth preserving, it is unlikely that the vendor can cease supporting it. Even if the vendor does cease supporting it, they need to provide a migration path to their own successor product. In order to make a format migration necessary, this successor product has to stop working, since until it does it can be used to access the old format. But if it stops working it cuts off the migration path. If necessary, emulation can be used to ensure that the successor, and indeed the predecessor, product continues to work. Thus the norm approach can be expected to work only in those cases where it is not needed, and fail in those cases where it is needed, because not even the vendor thinks it worth providing a migration path.

So far, neither Rob nor anyone else has produced evidence to refute my claim that widely used formats stopped going obsolete 15 years ago, and that those on the norm side of the argument are preparing to fight the last battle.

Rob: Customers Want Format Migration

I am sure that Rob is right that his customers want format migration. There are two possible reasons for this:
  • They want format migration because they believe that it is essential for digital preservation. One can't blame them for believing this; Tessella and the rest of the digital preservation industry on the norm side have been telling them this for the last 15 years. There have been few voices on the exception side, despite the overwhelming evidence in its favor. The customers who want format migration because they believe it essential to digital preservation are simply misguided (see the section above).
  • They want format migration for reasons other than digital preservation. They are, in other words, customers not just for digital preservation but also other services. The costs associated with these other services should not be attributed to digital preservation, but to these other services.
Rob admits that at least the second of these is actually the case:
albeit, as I said in my last post, it is more common to do this for presentation rather than preservation reasons
As we see in the next section, by buying systems that bundle other services with preservation it is likely that the customers are paying way too much for preservation.

Rob: Format Migration Is Cheap

I am sure that for Tessella's systems the additional cost of performing format migrations is small. After all, they are on the norm side of the argument so have designed their systems in the expectation that format migration will be a frequent occurrence. There would have to be something seriously wrong with their engineering if doing format migrations in their system were expensive, because the costs of doing them have been buried in the base cost of the system. Rob says
I also don't believe that this requirement imposes huge costs on the system.
but he also says that their ingest system runs with:
about a third of the time being spent performing one form of characterisation or another.
He is not worried by this because:
most systems I’ve seen have a huge amount of spare processing capacity.
Rob's business is:
managing projects for leading national archives and libraries. These include ... the UK National Archives ... the British Library [the] US National Archives and Records Administration ... [the] Dutch National Archief and the Swiss Federal Archives.
I can quite believe that organizations like these can afford to vastly over-provision their systems, but that is not the world in which most digital preservation systems operate.

Unlike Rob, whose "about a third" is the only number he provides, I have actually produced figures comparing the cost per byte of a norm system (Portico) and an exception system (Internet Archive), both operating at scale in organizations with limited budgets whose only function is digital preservation. I admit that this is a coarse comparison, but the norm system is about 20 times more expensive per byte than the exception system. Rob's arguments to the contrary amount to "developing software is free because it is amortized across many purchasers" and "running software is free because customers over-provision their systems anyway".

I'd be interested to see other cost comparisons of the two approaches but, in the absence of evidence to the contrary it seems that Rob's belief is wrong, and that the norm approach to digital preservation is significantly more expensive than the exception approach.

Summary

To sum up, Rob's case is that even though there are no longer examples of formats going obsolete, and thus format migration is not necessary for preservation, some customers might like to be able to do migrations, and thus preservation systems should be designed around preparing to, and actually doing, migrations. The extra cost of doing so doesn't matter, because customers are so unconcerned about cost that they massively over-provision their systems anyway.

My case is that, as we see from the last few years focus on "sustainability of digital preservation", the major problems in digital preservation are economic. The only evidence we have so far directly comparing the costs of systems which treat format migration as the norm with those that threat it as an exception shows that the former are considerably more expensive operating at scale. Thus it is likely that by treating the exception as the exception it is, we can significantly reduce the cost per byte of digital preservation, and thus preserve a lot more stuff.

3 comments:

David. said...

At the Open Planets Foundation blog, Andy Jackson posts a commentary on this discussion. He agrees that format obsolescence is rare, and that the developers of tools such as PLANETS need to justify their efforts in this light. This is encouraging.

He writes "I think there are some important cases where this approach may not be sufficient." I'd be very interested in the details behind this claim.

He also writes "even when obsolescence is an not an issue, there are still very many cases where I need better tools in order to manage our content". These "very many cases" may well exist, but given that they aren't related to format obsolescence and, given the tools that PLANETS is developing, aren't related to bit preservation either, the "better tools" need to be justified by reasons other than preservation.

David. said...

Oops - I butchered the link in the comment above. It should go here.

Robert Sharpe said...

David,

I’ve made a general comment on the dangers I see with relying on obscure readers to deal with old formats back on Tessella’s site (http://www.digital-preservation.com/2011/01/migration-by-any-other-name/). However, here I will respond to the specific case of MS Project.

First I should make it clear again that the versions we can’t read are pre-Project 98 versions (which I think were called Project 2 and Project 95). I know I got this wrong in my first posting on this subject so apologies that this continues to cause confusion.

Again, to be clear, these pre-98 formats can indeed be read by Project 98 but I don’t consider this a reliable solution since Project 98 has not been supported for 8 years (see http://support.microsoft.com/lifecycle/?p1=2689). Later versions of Microsoft Project (2000+) do not support the pre-98 format.

With regard to other reader tools, I downloaded the tool that you cite (Memoranda) but, unfortunately, it does not appear to offer any migration support to or from any Microsoft formats. In fact you seem to be quoting from the site http://java-source.net/open-source/project-management. This lists various tools that deal with MS Project (of which Memoranda is just the first in the list) and the description you quote appears to be from the MPXJ tool on that page. This tool is a migration library not a reader and it doesn’t claim to deal with the relevant pre-Project 98 formats.

To some extent the existence or not of a reader that can read the formats I identified is not directly relevant to the general thrust of my argument since I am sure that someone could write one with enough effort, just as someone could write a migration tool. My point is:
(i) Writing a new piece of software to interpret an old format is necessary because formats do otherwise become obsolete
(ii) Whether that software writes its output to a screen (i.e. it is a “reader”) or to another format (i.e. it is “migration tool”), it is actually doing a similar thing and is subject to similar potential errors. The real question then becomes how can you avoid just blindly trusting it to preserve information?