Format usage vs. format obsolescence
One of the things Jeff Rothenberg and I agree about is that the subject of our disagreement is format obsolescence, defined as the inability of current software to render preserved content in a format. Andy Jackson's study is not about format obsolescence at all, unlike Matt Holden's study. It is about format usage; the formats current tools use to create new content. Andy's draft paper is clear that this is what he is studying, but that clarity comes after he starts out discussing format obsolescence. As Andy says:
While it is clear that the observation of a format falling out of use does not imply it is obsolete, the fact that a format is still being used to create new material necessarily means that that format is not yet obsolete.For example, Andy's Figure 3.5 shows that PDF 1.3 is a significant part of the PDF being generated 10 years after it was introduced. Andy interprets this to refute Jeff's aphorism "Digital Information Lasts Forever—Or Five Years, Whichever Comes First". I tend to believe that Jeff was exaggerating for effect, so don't pay the aphorism much attention. But it is true that the observed life-spans of Web format usage lend indirect support to my argument that format obsolescence on the Web is slow and rare.
Browser support for old formats
Matt's study, on the other hand, shows that 15-year-old Web image, video and audio content is still rendered perfectly well by current software, especially open-source software. This is direct and substantial support for my argument.
It is important to distinguish between the evolution of a format family, such as HTML or PDF, through successive versions and the replacement of a format by a competitor in a completely different family. Here is the current vanilla Chrome with no additional plugins rendering the very first capture by the Internet Archive of the BBC News site which was, I believe, in HTML 3.2.
Current Chrome doesn't have a problem rendering the 14-year-old HTML. That is because the designers of the successive HTML versions were constrained to remain compatible with the previous versions. They could not force an incompatible change on the browsers in the field, because "there is no flag day on the Internet" .In the context of format obsolescence it is misleading to treat HTML 2.0, 3.2, 4.0, 4.01 and so on as if they were different formats. A single HTML renderer will render them all. The introduction of, say, HTML 4.0 has no effect on the render-ability of HTML 3.2.
Note: The missing plugin for the space at the top right is Java.
Format identification tools
Several people responded to my criticism of format identification tools. Matt said:
I do agree that identification of textual formats is increasingly important, and further efforts are probably needed in this area.I don't agree and have said so in the past. As regards Web formats, to the extent to which format identification tools agree with the code in the browsers they don't tell us anything useful, and to the extent to which they disagree with the code in the browsers they are simply wrong. Applying these tools as part of a Web preservation pipeline is at best a waste of resources and at worst actively harmful.
The reasons why this would be a waste of resources are laid out in my post on The Half-Life of Digital Formats. Briefly, adding costs, such as format identification, to the ingest pipeline in anticipation of future format obsolescence is an investment of resources whose benefits will come when the format goes obsolete. You cannot know whether or when this will happen, and the evidence we have suggests if it ever does it will be far in the future. You cannot know whether and how much the knowledge the tools generate will assist in mitigating the format's obsolescence, and the evidence we have is that it won't make a significant difference.
These factors make the cost-benefit analysis of the investment come down heavily against making it. To justify the investment you would have to be very certain of things you cannot know, and for which the evidence we have points the other way. Matt's study and the arguments in The Half-Life of Digital Formats suggest that the time before the investment would pay off is at least 15 and perhaps as much as 100 years. The net present value of the (uncertain) benefit this far in the future is very small. The right approach to the possibility of format obsolescence is watchful waiting.
The reasons why this would be actively harmful are laid out in my post on Postel's Law. Briefly, if they are used to reject for preservation files that the tools believe are malformed or have incorrect MIME types the result will be that content will not be preserved that in fact renders perfectly. For example, Andy's draft paper cites the example of DROID rejecting mis-terminated PDF, which renders just fine. Not preserving mis-terminated PDFs would be a mistake, not to mention a violation of Postel's Law.
Evolution of web formats
Andy puts words that I don't think I have ever said into my mouth when he says:
Our initial analysis supports Rosenthal’s position; that most formats last much longer than five years, that network effects to appear to stabilise formats, and that new formats appear at a modest, manageable rate.If by "formats" Andy means "successful formats" I believe there are plausible arguments in favor of each of these positions, but I don't recall ever making them. Here are my versions of these arguments:
- "most formats last much longer than five years" The question is, what does a format "lasting" mean in this context? For Andy's study, it appears to mean that a format forms a significant proportion of the resources crawled in a span of years. A successful new format takes time to infect the Web sufficiently to feature in crawls. Andy's data suggests several years. Although studies suggest that the mean life of a Web resource is short, this is a long-tailed distribution and there is a significant population of web resources whose life is many years. Putting these together, one would expect that it would be many years between the introduction of a successful new format and its no longer forming a significant proportion of the resources crawled.
- "network effects to appear to stabilise formats" The preceding and following arguments make it likely that only a few of the potentially many new formats that arise will gain significant market share and thus that formats will appear to be relatively stable. The role of network effects, or rather Brian Arthur style increasing returns to scale, in this is indirect. They mean that new ecological niches for formats that arise are likely to be occupied by a single format. Since new niches arise less frequently than new formats, this acts to reduce the appearance of successful new formats.
- "new formats appear at a modest, manageable rate" The barriers to a new format succeeding are significant. It has to address needs that the existing formats don't. It has to have easy-to-use tools for creators to use. It has to be supported by browsers, or at least plugins for them. It and its tools have to be marketed to creators. All these make it likely that, however many new formats are introduced each year, only a few will succeed in achieving significant market share.
I have discussed this problem as it relates to Web formats. Jonathan Rochkind asks about non-Web formats:
Do you think this is less likely to happen in the post-web world? There are still popular formats in use that are not 'of the web', such as MS Word docs (which can not generally be displayed by web browsers) -- are old (but not that old, post-web) versions of MS Word more likely to remain renderable in this post-web world than my old WordPerfect files?First, as regards old WordPerfect files, I addressed this in my 2009 CNI plenary here:
The Open Office that I use has support for reading and writing Microsoft formats back to Word 6 (1993), full support for reading WordPerfect formats back to version 6 (1993) and basic support back to version 4 (1986).I believe that licensing issues mean that this comment now applies to Libre Office not Open Office.
Second, Microsoft has effectively lost the ability to drive the upgrade cycle by introducing gratuitous format incompatibility because their customers rebelled against the costs this strategy imposed on them. So, the reasons are slightly different but the result in terms of format obsolescence is similar.
Disagreeing with Jeff Rothenberg, Round 2
Andy cites a talk by Jeff Rothenberg at Future Perfect 2012 that hadn't come to my attention. A quick scan through the slides suggests that Jeff persists in being wrong about format obsolescence, and that I need to prepare another refutation.