Tuesday, October 9, 2012

Formats through time

Two interesting and important recent studies provide support for the case I've been making for at least the last 5 years that Jeff Rothenberg's pre-Web analysis of format obsolescence is itself obsolete. Details below the fold.

The British Library's Andy Jackson presented a study at IPRES that looked at the formats included in the 2.5-billion resource UK Web Domain dataset from 1994 to 2010. He used DROID and Apache Tika to identify the formats, apparently ignoring the MIME types recorded for them.

Tika failed to identify less than 1% and DROID was well under 10% for more recent files. The most interesting observation was that both tools were bad at identifying non HTML/XML text formats such as CSS and JavaScript. This is a very poor performance. Browsers do not complain that they can't identify 10% or even 1% of the files they encounter on the Web, even when browsing files from the early days of the Web from the Internet Archive's Wayback Machine. The fact that actual renderers are far better at identifying formats than the format identification tools upon which the digital preservation literature places so much emphasis should be a concern. I discussed the risks posed by failures in these tools in posts in 2007 and 2009.

Andy's presentation contains some interesting graphs showing the evolution of formats such as HTML, with newer versions taking over through time. But it would be misleading to treat this normal evolution within a backwards-compatible family of format versions as being an instance of format obsolescence; the changes between versions do not in these cases impact the render-ability of older content by browsers.

Andy concludes:
  • Network effects do appear to stabilize formats.
  • Once popular formats are fading nevertheless.
  • More sophisticated approach required.
Andy's first conclusion supports my argument. The Web is a hostile environment for format obsolescence for easily understandable reasons. I agree with Andy's second and third conclusions, but they don't affect my argument.

For the preservation purposes the question is not whether the formats being used by newly created Web resources are different from those used by older resources. It is whether the old resources can be rendered by modern browsers. and Andy's study doesn't investigate the render-ability of different formats.

Matt Holden of INA in his presentation to UNESCO's "Memory of the World in the Digital Age" conference addressed precisely this question. INA collects French audio-visual Web history, and Matt:
  • Extracted all files with MIME types indicating audio, video or image content collected in 1996 and 1997. These files are 15+ years old, from the very early days of audio and video on the Web, so would be expected to suffer most from format obsolescence. He found 4607 files from 1996 and 12206 from 1997.
  • Used file identification software to verify the content FIDO identified 97.3% of the 1996 files and 98.3% of the 1997 files; TIKA identified 85.1% of the 1996 files and 87.4% of the 1997 files. These numbers are not directly comparable to Andy's numbers, since the files were pre-filtered by MIME type. But like Andy's, they aren't encouraging numbers.
  • Identified the remaining files by hand. Overall, the 1996 files used 9 formats, none of which were problematic. The 1997 files used 11 formats, one of which was problematic.
  • Rendered and migrated the files in the problematic format, a video format called VivoActive,  using the open source Mplayer/Mencoder combo. This required a little bit of juggling between 32-bit and 64-bit libraries.
This is a severe test of my predictions. The files in question are from the early history of the Web and are audio-visual, a genre that would be expected to be the most subject to format obsolescence. The result, as Matt says, is that the:

Vast majority of the audiovisual files identified can apparently still be [rendered] without difficulty.

When Matt says "the "vast majority" he means 99.8%; only 34 of 16813 files were in the VivoActive format. Even those 0.2% could be rendered by common open-source software.

I believe these two studies are a significant experimental test of, and support for, the following predictions that I made on theoretical grounds:
The focus on, and resources devoted to, format obsolescence are disproportionate to the threat. The real threat is economic. The approach that the LOCKSS system demonstrated in 2005 of transparent, on-access format migration is the appropriate, economical  way to deal with the format obsolescence that does occur.




18 comments:

bibwild said...

> Browsers do not complain that they can't identify 10% or even 1% of the files they encounter on the Web, even when browsing files from the early days of the Web from the Internet Archive's Wayback Machine.

Generally, browsers do not only have access to only the file on disk -- they have access to HTTP content-type headers which advertise the format of that file.

In the absence of those headers (or when presented with _incorrect_ headers), browsers are more often going to do the wrong thing, or throw up their hands and say "I don't know what to do with this" too. As often as the 'preservation' tools? I don't know, but I woudn't _assume_ they'd do better.

I think you are over-estimating the potency of browsers at identifying formats though. Browsers encounter files in a particular context -- if the browser is downloading a file because it was referenced in a 'link rel=stylesheet' tag, then it's either CSS, or it's an error. And when the browser follows that URL, it gets HTTP headers. If the HTTP headers say it's CSS but are wrong, then the browser will assume it's CSS anyway, it won't magically figure out it's something else. If the HTTP headers do NOT say it's CSS, then the browser will either assume it's CSS _anyway_, or the browser will refuse to do anything with it, under no obligation to figure out what it 'really' is.

If a browser is directed to a Microsoft Word document, it knows it's a Microsoft Word document _only_ via the HTTP headers; if the HTTP headers are missing, the browser wont' figure out it's an MS Word document anyway, it'll just throw up it's hands. If the HTTP headers say it's an MS Word doc but it's not, the browser will _still_ hand it off to MS Word, it won't have any way to figure out what it 'really' is (nor is it expeted to).

Certainly one lesson might be trying to capture files _from_ the web, and preserve the context that browsers have available to them in this scenario, such as HTTP headers. But I think you are over-estimating the success rate of browsers here, and browsers operate in a particular constrained environment with particularly constrained expectations/requirements as well.

David. said...

You might have noticed that in the post I specifically point out that Andy Jackson's study was of the 2.5-billion resource UK Web Domain dataset and Matt Holden's was of French audio-visual Web history ... files with MIME types indicating audio, video or image content.

Both of these datasets are of files collected from the Web together with their headers.

David. said...

The Register reports on Andy Jackson's study, and correctly points out that it supports my case. Like Andy, they confuse incremental evolution of file formats within a family such as HTML with format obsolescence. The one does not imply and almost never causes the other.

David. said...

I forgot to link in this post to my post from 2 years ago that tried to estimate the half-life of digital formats, and showed the awful economics of spending money to now to prepare for future obsolescence given that you can't be sure whether or when it will happen.

David. said...

Andy Jackson's paper is here.

Andy Jackson said...

Just wanted to point out that I do not 'confuse' format version usage with format obsolescence. I presented that trend as an interesting feature that may deserve further study, but was careful not to draw conclusions about obsolescence from it. As you say, and as I stated in the talk, a proper understanding of obsolescence requires a better understanding of the consumption of digital media.

I did use the server-supplied MIME type to compensate for the issue with both tools failing to detect test formats. Note I agree with bibwild in that browsers are not _that_ good at this. They can spot CSS and JS using various hints plus a full parser. They are no better as spotting CSV that DROID or Tika. This is a hard, unsolved problem.

Matt said...

It is entirely true that DROID (and probably most other signature based file identification tools) have real problems with text formats.

The issue is that a signature based identification is looking for particular sequences at known locations in the file. Text formats do not typically have this feature, or they are hard to disambiguate from other text files which just happen to contain similar text.

In the DROID 6 project at the UK National Archives, we did look at better forms of text identification. We did quite a bit of research into this area, and came up with sample code which performed *much* better than existing solutions. Unfortunately, due to time and resource, this research was not productionised in DROID 6. Possibly some of it will appear in the new DROID 7 project.

The research approach focussed on several aspects of textual formats.

1) Identification of text encoding, and distinguishing binary files of unknown format from textual files of unknown format.

2) Structured text files: e.g. CSV. These have characteristics which can be heuristically detected (although it's surprisingly complicated to avoid false positives but still identify real files for such 'simple' formats).

3) Syntactic text files: e.g. HTML, Javascript, CSS, RTF. These files rely on certain keywords being present. Scanning for a lot of these is very time consuming using naiive approaches. Instead, an Aho-Corasick automaton was created with thousands-to-millions of keywords related to particular formats, and a statistical approach was taken to distinguish formats on this basis. This was surprisingly successful, for example allowing the near perfect disambiguation of RTF format versions based purely on which keywords were located.

Anyway, I do agree that identification of textual formats is increasingly important, and further efforts are probably needed in this area... assuming, of course, that we care about format obsolescence any more...?

Jonathan Rochkind said...

I don't actually disagree with your fundamental thesis, that formats on the web are less likely to go 'obsolescent' for useful defnitions of 'obsolescent' than originally thought.

I am just suspicious of the in-passing suggestion that browsers have some special ability to identify formats (or correctly display files of originally unknown format) that format identification tools do not. I'd want to at least see some study of that that attempts to set things up to draw validly useful conclusions.

If one has MIME/IANA content-type available, AND one considers "success" to be determining the format corresponding to the reported content-type (regardless of whether the content-type was actually correct for the file stream), then obviously it's trivial to create a tool with a 100% success rate. This may be all that browsers do.

If on the the other hand it matters when the content-type is missing or incorrect -- 1% of files in the corpus had an incorrect content-type header, it would not surprise me if most browsers 'error rate', in terms of the actual file stream not just the reported content type, approached 1% too. I wouldn't assume otherwise, anyway, without some investigation.

Jonathan Rochkind said...

Formats not 'of the web' though may be more likely to go obsolescent, what are your thoughts? I myself have old WordPerfect files around which I have no convenient way to display, and which it's quite possible there is no good way to display (conveniently or not) with fidelity to original formatting (short of emulating old OS's and getting an old copy of WordPerfect).

Of course, these files pre-date the web. Do you think this is less likely to happen in the post-web world? There are still popular formats in use that are not 'of the web', such as MS Word docs (which can not generally be displayed by web browsers) -- are old (but not that old, post-web) versions of MS Word more likely to remain renderable in this post-web world than my old WordPerfect files?

DClipsham said...

Hello David, thank you for your post.

At The National Archives we are acutely aware of the current limitations of DROID. To these ends we have recently employed a full time File Format Researcher/Signature Developer (me), and my main focus is to improve the coverage of PRONOM/DROID.

We agree with your assertion that the focus on the perceived threat of obsolescence has historically been disproportionate to the actual threat, which is why the focus of PRONOM research has somewhat shifted from obsolescence ‘solutions’, such as migration pathways, and subjective declarations that format X or Y is ‘at risk’, to a more signature-centric approach geared primarily towards maximising identification coverage and accuracy.

We have an open Wiki for users of DROID to submit requirements for the next major release, DROID 7, and Requirement0013 aims to tackle the limitations of identifying text-based formats, including CSS, JS etc. http://droid7.wikispaces.com/requirement0013

The crux of the issue with such formats is thus:

The current DROID engine relies on pattern matching of certain byte sequences in order to assert that a file is an example of a particular format type.

Text based formats, such as XML, HTML, CSS, JS and so forth are harder to characterise, because they either do not necessarily include explicit, required sequences of bytes or strings, or when they do, the ‘rules’ are often loosely applied.

In the case of CSS, a perfectly valid CSS document can contain a sparse amount of data. A CSS file is a CSS file because the browser has been told to treat it as such. While we can determine certain commonalities within a format like CSS, these commonalities are by no means unique to CSS and share characteristics with many other programming languages, so the possibility for false-positives is great.

So it’s an issue we’ll continue to wrangle with and hope to one day soon get right.

Any ideas, from yourself or your readers will be gratefully received and I would encourage anybody with an interest in the future development of DROID to contribute to the Wiki - http://droid7.wikispaces.com

All the best,
David

David. said...

Thank you all. The reason for the regrettable delay in moderating your excellent comments is that I'm travelling with limited net access. I'll respond in detail when I arrive, but I should quickly apologize to Andy for my poor choice of words.

Gary McGath said...

Lots of good points here. Another point to consider is that the real test of obsolescence isn't just whether a file can be recognized, but whether it can be rendered successfully. In the days of HTML 2, this was often a problem as soon as the file hit the Web, with sites displaying messages that amounted to "You're using Netscape? Get Internet Explorer if you want to read our site, loser."

Rendering obsolescence is a harder thing to test, since it's rarely all-or-nothing with HTML. And it points in a direction other than format obsolescence as a primary concern.

David. said...

Sigh! This post turns out to be an object lesson in why I should not write blog posts while I'm scrambling to get stuff off my plate before leaving on a trip. Sorting out the mess will take another whole post, but here are some ideas that will go into it.

Now I've had the chance to read Andy's paper and think about it I understand things better. I believe Andy's study in interesting and useful, but Andy's use in his paper of my disagreement with Jeff Rothenberg as introductory material to illustrate the need for studies of formats through time, while a valid argument, confuses the reader.

One of the things Jeff and I agree about is that the subject of our disagreement is format obsolescence, defined as the inability of current software to render preserved content in a format. Andy's study is not about format obsolescence at all, unlike Matt's study. It is about format usage, the formats current tools use to create new content. Andy's paper is clear that this is what he is studying, but that clarity comes after he starts out discussing format obsolescence.

Many readers miss the distinction. Gary's comment is an example. The HTML2 problems he describes were caused by the introduction of a new format, not by the obsolescence of an old one. The Register's article is another example.

Andy and my post are both wrong in saying that his study supports the case I have been arguing against Jeff Rothenberg. Andy puts words that I don't think I have ever said into my mouth when he says "Our initial analysis
supports Rosenthal’s position; that most formats last much longer than five years, that network effects to appear to stabilise formats, and that new formats appear at a modest, manageable rate." If by "formats" Andy means "successful formats" I believe there are plausible arguments in favor of each of these positions, but I don't recall ever making them.

These positions are not relevant to my disagreement with Jeff, since neither of us have been predicting anything about the usage of formats, only about the obsolescence of formats. The usage of formats is interesting but it doesn't affect the preservation of content.

Fortunately, Matt's study actually does support my case against Jeff!

Andy Jackson said...

Hm, yes, that phrasing was rather bad, and I've mixed secondary conclusions in and made them appear as if you were saying them. I'll revise the text.

While it is clear that the observation of a format falling out of use does not imply it is obsolete, the fact that a format is still being used to create new material necessarily means that that format is not yet obsolete. This is why I argued that my evidence supports your position over Rothenberg's, rather than arguing that it proved it.

Clearly, fully understanding obsolescence requires us to understand how resources are consumed rather than created, but I must admit my talk was somewhat clearer about that than the paper is. I'll try to improve this situation as I revise it.

David. said...

In the revision it would be good to provide the reader with the "plausible arguments" I referred to above for the positions you implicitly attributed to me, since they do appear to be supported by your data. I will try to find time to blog my version of these arguments, although of course unlike my arguments about format obsolescence they won't be predictions but an attempt to explain existing data.

Andy Jackson said...

Also, in response to Matt, I'd like to say that even if obsolescence is less of an issue, reliable identification is still useful. Many formats allow various dependencies on other resources to be embedded within them which should be evaluated during ingest. Normalisation is still a perfectly reasonable 'hedge', and in certain cases validation is critical. All these things require identification as a first step, in order to pass the resources down the chain correctly.

In response to Gary, I'm not sure what else you could call this software dependance other than format obsolescence. I can't come up with any sensible, workable and realistic definition of 'format' that does not involve conceding that formats are always defined by software. Format may also be documented by specification, but I've yet to come across a case where the specification does anything other than lag implementation. i.e. as far as I can tell, all specs are 'retro-specs'.

David. said...

My definition of format obsolescence is that content in the format can't be rendered by current software, so I agree with Andy's comment. I also agree that format specifications aren't important in practice.

David. said...

A follow-up post is here.