The British Library's Andy Jackson presented a study at IPRES that looked at the formats included in the 2.5-billion resource UK Web Domain dataset from 1994 to 2010. He used DROID and Apache Tika to identify the formats, apparently ignoring the MIME types recorded for them.
Andy's presentation contains some interesting graphs showing the evolution of formats such as HTML, with newer versions taking over through time. But it would be misleading to treat this normal evolution within a backwards-compatible family of format versions as being an instance of format obsolescence; the changes between versions do not in these cases impact the render-ability of older content by browsers.
- Network effects do appear to stabilize formats.
- Once popular formats are fading nevertheless.
- More sophisticated approach required.
For the preservation purposes the question is not whether the formats being used by newly created Web resources are different from those used by older resources. It is whether the old resources can be rendered by modern browsers. and Andy's study doesn't investigate the render-ability of different formats.
Matt Holden of INA in his presentation to UNESCO's "Memory of the World in the Digital Age" conference addressed precisely this question. INA collects French audio-visual Web history, and Matt:
- Extracted all files with MIME types indicating audio, video or image content collected in 1996 and 1997. These files are 15+ years old, from the very early days of audio and video on the Web, so would be expected to suffer most from format obsolescence. He found 4607 files from 1996 and 12206 from 1997.
- Used file identification software to verify the content FIDO identified 97.3% of the 1996 files and 98.3% of the 1997 files; TIKA identified 85.1% of the 1996 files and 87.4% of the 1997 files. These numbers are not directly comparable to Andy's numbers, since the files were pre-filtered by MIME type. But like Andy's, they aren't encouraging numbers.
- Identified the remaining files by hand. Overall, the 1996 files used 9 formats, none of which were problematic. The 1997 files used 11 formats, one of which was problematic.
- Rendered and migrated the files in the problematic format, a video format called VivoActive, using the open source Mplayer/Mencoder combo. This required a little bit of juggling between 32-bit and 64-bit libraries.
Vast majority of the audiovisual files identified can apparently still be [rendered] without difficulty.
When Matt says "the "vast majority" he means 99.8%; only 34 of 16813 files were in the VivoActive format. Even those 0.2% could be rendered by common open-source software.
I believe these two studies are a significant experimental test of, and support for, the following predictions that I made on theoretical grounds:
- Format obsolescence, in the sense of the inability of current software to render the format, is a rare, slow process on the Web. (2009)
- What format obsolescence does occur on the Web will be restricted to formats with very little content in them. (2009)
- Formats with open source renderers are effectively immune from format obsolescence. (2007)
- Format identification tools are an inadequate basis for digital preservation strategies. (2009)