Saturday, October 13, 2012

Cleaning up the "Formats through tIme" mess

As I said in this comment on my post Formats through time, time pressure meant that I made enough of a mess of it to need a whole new post to clean up. Below the fold is my attempt to remedy the situation.

Format usage vs. format obsolescence

One of the things Jeff Rothenberg and I agree about is that the subject of our disagreement is format obsolescence, defined as the inability of current software to render preserved content in a format. Andy Jackson's study is not about format obsolescence at all, unlike Matt Holden's study. It is about format usage; the formats current tools use to create new content. Andy's draft paper is clear that this is what he is studying, but that clarity comes after he starts out discussing format obsolescence. As Andy says:
While it is clear that the observation of a format falling out of use does not imply it is obsolete, the fact that a format is still being used to create new material necessarily means that that format is not yet obsolete.
For example, Andy's Figure 3.5 shows that PDF 1.3 is a significant part of the PDF being generated 10 years after it was introduced. Andy interprets this to refute Jeff's aphorism "Digital Information Lasts Forever—Or Five Years, Whichever Comes First". I tend to believe that Jeff was exaggerating for effect, so don't pay the aphorism much attention. But it is true that the observed life-spans of Web format usage lend indirect support to my argument that format obsolescence on the Web is slow and rare.

Browser support for old formats

Matt's study, on the other hand, shows that 15-year-old Web image, video and audio content is still rendered perfectly well by current software, especially open-source software. This is direct and substantial support for my argument.

It is important to distinguish between the evolution of a format family, such as HTML or PDF, through successive versions and the replacement of a format by a competitor in a completely different family. Here is the current vanilla Chrome with no additional plugins rendering the very first capture by the Internet Archive of the BBC News site which was, I believe, in HTML 3.2.

Current Chrome doesn't have a problem rendering the 14-year-old HTML. That is because the designers of the successive HTML versions were constrained to remain compatible with the previous versions. They could not force an incompatible change on the browsers in the field, because "there is no flag day on the Internet" .In the context of format obsolescence it is misleading to treat HTML 2.0, 3.2, 4.0, 4.01 and so on as if they were different formats. A single HTML renderer will render them all. The introduction of, say, HTML 4.0 has no effect on the render-ability of HTML 3.2.

Note: The missing plugin for the space at the top right is Java.

Format identification tools

Several people responded to my criticism of format identification tools. Matt said:
I do agree that identification of textual formats is increasingly important, and further efforts are probably needed in this area.
I don't agree and have said so in the past. As regards Web formats, to the extent to which format identification tools agree with the code in the browsers they don't tell us anything useful, and to the extent to which they disagree with the code in the browsers they are simply wrong. Applying these tools as part of a Web preservation pipeline is at best a waste of resources and at worst actively harmful.

The reasons why this would be a waste of resources are laid out in my post on The Half-Life of Digital Formats. Briefly, adding costs, such as format identification, to the ingest pipeline in anticipation of future format obsolescence is an investment of resources whose benefits will come when the format goes obsolete. You cannot know whether or when this will happen, and the evidence we have suggests if it ever does it will be far in the future. You cannot know whether and how much the knowledge the tools generate will assist in mitigating the format's obsolescence, and the evidence we have is that it won't make a significant difference.

These factors make the cost-benefit analysis of the investment come down heavily against making it. To justify the investment you would have to be very certain of things you cannot know, and for which the evidence we have points the other way. Matt's study and the arguments in The Half-Life of Digital Formats suggest that the time before the investment would pay off is at least 15 and perhaps as much as 100 years. The net present value of the (uncertain) benefit this far in the future is very small. The right approach to the possibility of format obsolescence is watchful waiting.

The reasons why this would be actively harmful are laid out in my post on Postel's Law. Briefly, if they are used to reject for preservation files that the tools believe are malformed or have incorrect MIME types the result will be that content will not be preserved that in fact renders perfectly. For example, Andy's draft paper cites the example of DROID rejecting mis-terminated PDF, which renders just fine. Not preserving mis-terminated PDFs would be a mistake, not to mention a violation of Postel's Law.

Evolution of web formats

Andy puts words that I don't think I have ever said into my mouth when he says:
Our initial analysis supports Rosenthal’s position; that most formats last much longer than five years, that network effects to appear to stabilise formats, and that new formats appear at a modest, manageable rate.
If by "formats" Andy means "successful formats" I believe there are plausible arguments in favor of each of these positions, but I don't recall ever making them. Here are my versions of these arguments:
  • "most formats last much longer than five years" The question is, what does a format "lasting" mean in this context? For Andy's study, it appears to mean that a format forms a significant proportion of the resources crawled in a span of years. A successful new format takes time to infect the Web sufficiently to feature in crawls. Andy's data suggests several years. Although studies suggest that the mean life of a Web resource is short, this is a long-tailed distribution and there is a significant population of web resources whose life is many years. Putting these together, one would expect that it would be many years between the introduction of a successful new format and its no longer forming a significant proportion of the resources crawled.
  • "network effects to appear to stabilise formats" The preceding and following arguments make it likely that only a few of the potentially many new formats that arise will gain significant market share and thus that formats will appear to be relatively stable. The role of network effects, or rather Brian Arthur style increasing returns to scale, in this is indirect. They mean that new ecological niches for formats that arise are likely to be occupied by a single format. Since new niches arise less frequently than new formats, this acts to reduce the appearance of successful new formats.
  • "new formats appear at a modest, manageable rate" The barriers to a new format succeeding are significant. It has to address needs that the existing formats don't. It has to have easy-to-use tools for creators to use. It has to be supported by browsers, or at least plugins for them. It and its tools have to be marketed to creators. All these make it likely that, however many new formats are introduced each year, only a few will succeed in achieving significant market share.
Non-web formats

I have discussed this problem as it relates to Web formats. Jonathan Rochkind asks about non-Web formats:
Do you think this is less likely to happen in the post-web world? There are still popular formats in use that are not 'of the web', such as MS Word docs (which can not generally be displayed by web browsers) -- are old (but not that old, post-web) versions of MS Word more likely to remain renderable in this post-web world than my old WordPerfect files?
First, as regards old WordPerfect files, I addressed this in my 2009 CNI plenary here:
The Open Office that I use has support for reading and writing Microsoft formats back to Word 6 (1993), full support for reading WordPerfect formats back to version 6 (1993) and basic support back to version 4 (1986).
I believe that licensing issues mean that this comment now applies to Libre Office not Open Office.

Second, Microsoft has effectively lost the ability to drive the upgrade cycle by introducing gratuitous format incompatibility because their customers rebelled against the costs this strategy imposed on them. So, the reasons are slightly different but the result in terms of format obsolescence is similar.

Disagreeing with Jeff Rothenberg, Round 2

Andy cites a talk by Jeff Rothenberg at Future Perfect 2012 that hadn't come to my attention. A quick scan through the slides suggests that Jeff persists in being wrong about format obsolescence, and that I need to prepare another refutation.


Jonathan Rochkind said...

Thanks. For what it's worth, I have personal WordPerfect files on my hard drive that are older than 1993, probably as old as 1988 or so. I don't believe there is any good way of accessing them with original formatting -- although I have succeeded in extracting the 'text stream' from them, with reasonable fidelity, which is good enough for my purposes.

(It's also possible that a preservation organization could find ways to render/translate my old documents with formatting more-or-less intact that i have not).

I tend to think you're right that currently resources are being put into obsolescence-planning out of proportion with actual risks.

But I'm not sure that argument is well-served by the implication that no common format for electronic cultural heritage has or will become obsolete (infeasible or impossible to view due to lack of available software that runs on contemporary systems). I think content has gone obsolete, and will continue to.

But I agree that treating _all_ formats as if they could become obsolete at any time is probably not a good use of resources. It makes more sense to monitor for formats in your collection that are in _actual_ danger of becoming obsolete at any given time. Although that does require resources to know what you have (oops, format identification again) and pay attention to threatening obsolescence.

It will require some judgement calls. But archival work always has, right? It also means giving up the idea that you can ensure the continued preservation and access of_everything_ -- sometimes your judgement calls will be wrong and you'll lose things or lose the ability to access them. This too, is not actually anything new for archival work (only with non-existent infinite resources could you ensure the preservation of 'everything', not even 'everything' that you intentionally accessioned -- it's always a question of economics, which is where much of your work has been so useful) -- not new, but new,but neither is the desire to ignore this fact and think/act as if you can in fact preserve everything.

Matt said...


I misinterpreted your flagging of the failure of DROID to identify textual formats well as an admonition to do better! As an ex-manager of the DROID project (versions 4-6), I hope I can be forgiven for this reaction...

In fact, I agree with your position that format obsolescence should not be a principal focus of digital preservation activity.

A bigger driver for file migration at the National Archives was the requirement to make the collection broadly accessible. This involved converting many different formats into a much smaller set of formats, for which most people would already have direct or plugin support in their browsers. Clearly, text formats in this context need no migration - they are already broadly accessible. I also agree that web formats are better interpreted by browsers in the first place.

I do still believe there is a need for better file identification, and particularly for textual formats. Some of the use cases we were looking at in the DROID 5 and 6 projects were around information risk management.

It was common to find organisations with terrabytes of expensive data stuffed with largely unknown aging content. They wanted to be able to profile the content to get a handle on what they had, with the ultimate goal of being able to delete large portions of it, and to only store and ultimately archive what was essential.

Getting an idea of the types of files in an aging file store, and where they are located helps considerably in guiding more focussed manual review of the content. Duplicate detection also plays a part, which is why DROID 6 included the ability to generate an MD5 hash of the file content.

File identification alone does not provide all (or even most) of the answers in this sort of activity, but it can help. The absence of good text format recognition was flagged as important by several organisations who piloted using DROID in this way. Essentially, there was just a big hole in the profile results given back by DROID, with fairly large numbers of files going unrecognised. This is increasingly important as so many distinct formats are textual these days.

More sophisticated tools that focus on content rather than file type may ultimately do better here, for example, the sorts of e-disovery software used by legal teams to trawl through vast document stores. But these are currently very expensive, and information managers do not typically have vast budgets.

Matt said...

I should add that my last reply is entirely my own opinion, and does not represent the position of the National Archives, either now or when I was employed there.

euanc said...

Hi David,

In regards to your statements stating that if open source software can "open" files of a format then that format is not obsolete:

"Opening" is not always enough. Often open source software will alter the content presented to users (alter relative to how it was when it was presented through the original rendering software). I posted some examples of this here.

Equally important (but slightly off topic), it is difficult to automate testing to know when it is or is not enough, meaning that we may not be able to rely on assuming that it will be good enough without manual testing of each file (something untenable at large scales).

I would argue (but I won't put words into Jeff's mouth - though I have discussed this with him), that this is what Jeff means by the formats being obsolete, i.e. that no currently supported software renders files of those formats with verifiable integrity. It is only that last portion that differs from your understanding of his stance ("with verifiable integrity").

I believe that the community has an opportunity to ensure formats will never go obsolete by continuing to support the original intended rendering software. By using emulators to run the old software the formats that rely on the formats will never go obsolete. But in order to do that there has to be active engagement with emulation (and in particular the software licensing issues) on a large scale.

I appreciated your comments about format identification. I have suggested (and presented on this at the same Future Perfect Conference Jeff spoke at) that Format identification and in particular, characteristation, are often unnecessary if you peruse a completely emulation based preservation strategy. If you can ascertain the original/intended rendering software then in many cases you don't need to know much more about the files for preservation purposes. I have a speculative post on the implications of this here.


David. said...

Thanks to all for the comments. I'm sorry for the slow moderation - I'm still on my travels. I'll respond when I get time to think before typing.

euanc said...

An amendment to my comment above, you never used the term "open". I should have referenced your understanding of the term "render" and asked what you thought that meant rather than suggested you used the term "open".
It does come across as though you use them relatively synonymously though.

David. said...

Jonathan: I'm very far from suggesting that "no common format for electronic cultural heritage has or will become obsolete". Jeff Rothenberg was right in 1995 to point out that formats in use before that time had gone obsolete. There are current formats that are problematic because they are DRM-ed (e.g. games) or highly proprietary (e.g. mechanical CAD). It is possible that widely used Web formats will eventually go obsolete. However:

- We know that for Web formats (and suspect that for popular non-Web formats) obsolescence is a rare and very slow process.
- As a result, investing resources now in preparing for the possible obsolescence of these formats is simply a waste of resources.
- The techniques being devoted to preparing for obsolescence, such as format identification tools and format specification registries, are unlikely to be effective if and when the formats go obsolete.
- Techniques (running old browsers in virtual machines, and on-demand migration) likely to be effective have been demonstrated. They do not require significant investment of resources now.

I would be interested to know the formats behind your belief that "content has gone obsolete, and will continue to." On the other hand, I agree that we don't have the resources to preserve everything that should be preserved. That makes wasting resources on preparing for the unlikely events a serious problem, since those resources represent content that should have been preserved but won't be.

David. said...

Matt: It may well be that format identification tools are useful in the process of throwing stuff away, although I would have thought that their observed error rates would suggest caution in this application. But throwing stuff away isn't preservation. Preservation resources shouldn't be diverted to assisting it.

David. said...

euanc: I think setting "verifiable integrity" as the goal for preservation is a serious mistake. The Web does not provide "verifiable integrity" in rendering resources. Different browsers render content differently. The same browser on different hardware renders content differently (would you want a pixel-identical rendering on Macs with old and retina displays?). Different versions of the same browser render the same content differently. And in any case no two visitors to the same Web page see the same content these days. Chasing after the unattainable goal of "verifiable intgerity" is both a vast resource sink, and beside the point. Is the Web's lack of "verifiable integrity" something that prevents users from getting what they need from it?

The Internet succeeded where X.25 failed because it took a minimal, low-cost, "best-efforts" approach. The Web succeeded where Project Xanadu failed because it took a minimal, low-cost, "best-efforts" approach. We should learn from these examples. It was easy for proponents of X.25 to disparage TCP/IP as unreliable and inadequate. It was easy for Ted Nelson to disparage the Web as unreliable and inadequate. Many of these criticisms were true, but they were beside the point. Insisting on perfection in an imperfect world is a recipe for failure. Would the world be better off if we preserved a small fraction of our heritage perfectly, or a much larger fraction in a rough and ready way that matched the way users experienced it in the real world?

We need minimal, low-cost, "best-efforts" approaches to digital preservation. I believe that in many cases this is what institutions are actually doing. But they don't talk about it because they will be criticized as not doing "real preservation". See, for example, Jeff Rothenberg's dismissal of LOCKSS networks and Portico.

euanc said...

Hi again David,

There are many problems I see with your stance (which is a very popular position on this topic), such as the potentially excessive cost of just-in-case migration vs just-in-time emulation. But I'll address just one core point in this comment, that migration based preservation essentially cannot preserve the things we want to preserve without excessive manualmintervention and therefore cost:

1) we know that many objects can be altered in ways that change the content presented to the users when the object is changed to involve rendering using different software than the intended rendering environment or migrated using standard migration tools (I consider it best practice to assume the rendering/interaction environment is part of the object until proven otherwise).

2) Changing the content that is presented to users is changing the object (what else is the object but the set of content?) and therefore not preserving it.

2) We have no adequate tools for automatically checking whether the content we are trying to preserve has changed from the original. To check manually is practically impossible at the scales any significant archive deals with.

3) We therefore have no way of automatically knowing whether or not any particular object has had its meaning changed and therefore has not been preserved.

4) Knowing that the content of an object may be different to the original (and often is), and not knowing that it is not different, introduces a level of uncertainty to the integrity of every object that has undergone a similar change process. This uncertainty will in many/most cases be unacceptable as it equates to not knowing whether we have preserved the object or not (because of 3) above)

4)I believe that not knowing whether we have preserved the object or not is no better than not preserving it at all.

5) We can and ought to do better. Emulation is practical, scalable, benefits heavily from economies of scale, requires little initial investment on most institutions' behalf's, costs can be end-user loaded, preventing the issue of contemporary users having to pay for services that they don't benefit from, and finally, emulation provides a better chance of being able to be certain that we have preserved the objects (not changed their meaning).

I agree that
"We need minimal, low-cost, "best-efforts" approaches to digital preservation".
However I don't believe that migration fits any of those criteria as well as emulation does.


Jonathan It is quite straight forward to find and run old versions of Word Perfect in an emulator. If you want to try that I'd be happy to help.

Matt said...


the observed identification error rates don't matter so much in this sort of application, as no automatic deletion is being done on the basis of the results. The goal is to provide a profile of what kinds of files and their associated metadata exist in which locations in a large file store. Any decision to delete information is only taken after a review of the content.

In fact, digital preservation resource was not expended on this. The DROID 5 and 6 projects were co-funded by the PLANETS project and by the Digital Continuity project, with Digital Continuity paying the lion's share. Digital Continuity was explicitly set up to look at how organisations manage digital data before archiving, and this money did not come from an existing digital preservation budget.

This does have a big side benefit when it comes to archiving, as one of the problems of digital preservation at TNA was the sheer volume of data which was being thrown at the archive, due to organisations simply not knowing what they had, and essentially washing their hands of their legal obligations by giving it all (i.e. dumping it) to the archives. This considerably increased the expense and complication of any eventual digital preservation.

So in fact, digital preservation benefited directly (in terms of funding for DROID development), and indirectly (in terms of better and lower volumes of data arriving at the archives) from a non-digital preservation resource!

David. said...

euanc: you are still assuming that there is "one true rendering" of the object, which migration may change into something else. This may have been true pre-Web, but it definitely is not true in the Web environment. Which of the thousands of views of a web page, each of which is different, is the "one true rendering" that you are attempting to preserve? How do you know which were the "one true browser with the one true set of plugins running on the one true hardware" that you need to preserve in order to emulate the "one true rendering"? In the Web environment the rendering pipeline is under the control of the reader, not the publisher, and is not even knowable by the archive.

I said we have two techniques for rendering preserved content that are known to work well enough, emulation and on-demand migration. In particular cases one might work a bit better than the other. I'm not interested in the emulation vs. migration religious wars; neither is perfect, both are useful.

The point I'm making is that since we have two usable techniques neither of which needs significant up-front investment, making significant up-front investments is a waste of resources.

euanc said...

Thansk for the reply David.

I agree with most of what you said except that:
In cases where there may have been multiple renderings we should try to save the important ones, if we can decide, and shouldn't limit ourself to stop preserving just one. And with web sites that can be quite easy: just install a bunch of different browsers with different add-ins on the same (or a few) emulated desktop(s) and let the end user try them all.