Wednesday, September 16, 2015

"The Prostate Cancer of Preservation" Re-examined

My third post to this blog, more than 8 years ago, was entitled Format Obsolescence: the Prostate Cancer of Preservation. In it I argued that format obsolescence for widely-used formats such as those on the Web, would be rare. If it ever happened, would be a very slow process allowing plenty of time for preservation systems to respond.

Thus devoting a large proportion of the resources available for preservation to obsessively collecting metadata intended to ease eventual format migration was economically unjustifiable, for three reasons. First, the time value of money meant that paying the cost later would allow more content to be preserved. Second, the format might never suffer obsolescence, so the cost of preparing to migrate it would be wasted. Third, if the format ever did suffer obsolescence, the technology available to handle it when obsolescence occurred would be better than when it was ingested.

Below the fold, I ask how well the predictions have held up in the light of subsequent developments?

Research by Matt Holden at INA in 2012 showed that the vast majority of even 15-year old audio-visual content was easily rendered with current tools. The audio-visual formats used in the early days of the Web would be among the most vulnerable to obsolescence. The UK Web Archive's Interject prototype's Web site claims that these formats are obsolete and require migration:
  • image/x-bitmap and  image/x-pixmap, both rendered in my standard Linux environment via Image Viewer.
  • x-world/x-vrml, versions 1 and 2, not rendered in my standard Linux environment, but migration tools available.
  • ZX Spectrum software, not suitable for migration.
These examples support the prediction that archives will contain very little content in formats that suffer obsolescence.

Click image to start emulation
The prediction that technology for access to preserved content would improve is borne out by recent developments. Two and a half years ago the team from Freiburg University presented their emulation framework bwFLA which, like those from the Olive Project at CMU and the Internet Archive, is capable of delivering an emulated environment to the reader as a part of a normal Web page. An example of this is Rhizome's art piece from 2000 by Jan Robert Leegte untitled[scrollbars]. To display the artist's original intent, it is necessary to view the piece using a contemporary Internet Explorer, which Rhizome does using bwFLA.

Viewed with Safari on OS X
Increasingly, scrollbars are not permanent but pop up when needed. Viewing the piece with, for example, Safari on OS X is baffling because the scrollbars are not visible.

The prediction that if obsolescence were to happen to a widely used format it would happen very slowly is currently being validated, but not for the expected reason and not as a demonstration of the necessity of format migration. Adobe's Flash has been a very widely used Web format. It is not obsolete in the sense that it can no longer be rendered. It is becoming obsolete in the sense that browsers are following Steve Jobs lead and deprecating its use, because it is regarded as too dangerous in today's Internet threat environment:
Five years ago, 28.9% of websites used Flash in some way, according to Matthias Gelbmann, managing director at web technology metrics firm W3Techs. As of August, Flash usage had fallen to 10.3%.

But larger websites have a longer way to go. Flash persists on 15.6% of the top 1,000 sites, Gelbmann says. That’s actually the opposite situation compared to a few years ago, when Flash was used on 22.2% of the largest sites, and 25.6% of sites overall.
If browsers won't support Flash because it poses an unacceptable risk to the underlying system, much of the currently preserved Web will become unusable. It is true that some of that preserved Web is Flash malware, thus simply asking the user to enable Flash in their browser is not a good idea. But if Web archives emulated a browser with Flash, either remotely or locally, the risk would be greatly reduced.

Even if the emulation fell victim to the malware, the underlying system would be at much less risk. If the goal of the malware was to use the compromised system as part of a botnet, the emulation's short life-cycle would render it ineffective. Users would have to be warned against input-ing any sensitive information that the malware might intercept, but it seems unlikely that many users would send passwords or other credentials via a historical emulation. And, because the malware was captured before the emulation was created, the malware authors would be unable to update it to target the emulator itself rather than the system it was emulating.

So, how did my predictions hold up?
  • It is clear that obsolescence of widely used Web formats is rare. Flash is the only example in two decades, and it isn't obsolete in the sense that advocates of preemptive migration meant.
  • It is clear that if it occurs, obsolescence of widely used Web formats is a very slow process. For Flash, it has taken half a decade so far, and isn't nearly complete.
  • The technology for accessing preserved content has improved considerably. I'm not aware of any migration-based solution for safely accessing preserved Flash content. It seems very likely that a hypothetical technique for migrating Flash  would migrate the malware as well, vitiating the reason for the migration.
Three out of three, not bad!


Dragan Espenschied said...

Great post again!

I think it might be productive to re-think file formats and obsolescence.

The presented examples, image/x-bitmap, image/x-pixmap, x-world/x-vrml, and ZX Spectrum software, are very different types of things.

While classic X11 bitmaps can be rendered with viewers available on *nix, hardly anybody is using these operating systems. For Macintosh and Windows users, these images are not rendering, so unavailable for most people. It can be called obsolete or whatever else, but these users cannot see these images.

Second, VRML is possible to play on current versions of Windows. The VRML standard is incredibly complex, and I haven't seen a tool that would convert VRML97 to another format, including all the scripting, interaction and networking. There is not even a "format" to migrate to I would say. VRML is, like the Spectrum software— software. There were only three players, like VMs or operating systems, that could actually do everything VRML97 demands, Cosmo, Cortona and blaxun.

The most widely used format on the web, at least before the JavaScript explosion, was HTML; a pretty well documented standard. But web authors never cared about standards, they wrote HTML that could only be interpreted fully by Netscape or Internet Explorer, or made assumptions about default font pixel sizes, or the resolution of screens, or all of the above combined.

For some of the archived HTML out there that is not that terrible, but for many, accessing them on a contemporary system, has a totally distorting effect. The "file format" is not "obsolete", but the files are, or there is just a mismatch in between file and software used. The scrollbar piece is a great example.

I think obsolescence is becoming a weird concept. During the years of Windows XP monoculture, it was easy to assert obsolescence. Today I think we need to understand in what context something does or doesn't work. Can you see VRML on a Windows 10 workstation without administrator privileges, or a jailbroken iPhone 6s? :) And if it technically works, does it make sense there?

IlyaK said...

Very relevant post!

I wanted to add a new emulator-like tool, just released today, which I'm calling Netcapsule:

Not quite an emulator, but it allows anyone using the Docker container system (on Linux or in VM) to run virtual browsers in containers, which are delivered to a user's real browser.

For now, the system supports Mosaic, Netscape and latest Firefox, running on Linux. This is just a proof-of-concept and there exists the possibility to extend it to other browsers, including actual emulators.

The system uses the Memento api to allow for cross-archive browsing, and dynamic switching of temporal space as well as the url.

Additional browsers could be configured with specific extensions or plugins and deployed as a custom Docker images for specialized use cases.