Saturday, April 27, 2013

Software obsolescence doesn't imply format obsolescence

Tim Anderson at The Register celebrates the 20th anniversary of Mosaic:
Using the DOSBox emulator (the Megabuild version which has network connectivity via an emulated NE2000 NIC) I ran up Windows 3.11 with Trumpet Winsock and got Mosaic 1.0 running.
This illustrates two important points:
  • Tim had no trouble resuscitating a 20-year-old software environment using off-the-shelf emulation.
  • The 20-year-old browser struggled to make sense of today's web. But today's browsers have no difficulty at all with vintage web pages.
The fact that the software that originally interpreted the content is obsolete (a) does not meant that there is significant difficulty in running it, and (b) does not mean that you need to use emulation to run it in order to interpret the content, because the obsolescence of the software does not imply the obsolescence of the format. Backwards compatibility is a feature of the Web, for reasons I have been pointing out for many years.


euanc said...

That is a great example of how straight forward emulation can be. That is one of the reasons I believe emulation has great promise for preserving digital content.

unfortunately, while today's browsers may interpret much of the web very well:
a) we know they are not fully backwards compatible
b) we don't know when they are misrepresenting content
c)Browsers' ability to successfully present old content with a high level of integrity does not hold true for many other software environments, e.g. modern word processors on modern Operating Systems.

euanc said...

The more important point I should have made in that last comment is that while software obsolescence does not imply format obsolescence it does imply content loss.

Software going obsolete is like the death of a language, or worse, it is like the last person who understood a language dying. While others may understand similar languages and be able to interpret much of what is in a written text, they may miss or misinterpret parts of the text, and those parts may be vital to conveying its meaning, or there may be no way to make sense of the text at all.

Software obsolescence implies content loss because files in any format have to be interpreted and have their content presented to users using software. When that software is no longer available (or no longer practically available) then the content that relies on that software interpreting those files is effectively lost.

On the other hand, as far as format obsolescence goes, I agree with you wholeheartedly, it never really seems to happen.

David. said...

You should have been explicit that your claim "browsers are not fully backwards compatible" rests on the blink tag, which was never generally supported or part of the standard, and which has been offically deprecated. Content using blink would not have blinked on many widely used browsers at the time it was created. It is therefore a very weak example for a blanket claim that "browsers are not fully backward compatible".

The blink tag is an example of the fact that for Web content there is no canonical rendering, because there is no one "official" renderer. That is part of the reason why, for Web content, it is meaningless to say "we don't know when they are misrepresenting content" because that implies that, when the content was created, there was a canonical rendering up to which your disfavored one of the many current browsers fails to live.

There wasn't. Even if you emulate a computing environment contemporary with the content you still have to choose one of the many browsers then current. Your choice of this in the future doesn't reflect reality in the past; each reader would have made their own choice of browser and got their own different rendering. Who are you in the future to say that the readers who disagreed with you were wrong?

As far as content published on the Web or elsewhere goes, you need to let go of the idea that it, or its format, are owned by some particular version of some particular software. It is unrealistic and unhelpful.

As far as the problems described in Visual Rendering Matters are concerned, they relate to document formats such as Word Perfect 5.1 from 1991. Formats such as these from before the phase change from private to published are indeed a problem, a problem which is different from the current problem of digital preservation. It is digital archaeology. It uses different techniques, and has a very different cost structure. I made the importance of the phase change clear in my 2009 Spring CNI Plenary.

Emulation is certainly a very valuable tool for digital archaeology because before the phase change formats were indeed owned by particular versions of particular software, so resurrecting that software is essential. But showing that in some special cases from two decades ago software obsolescence causes format obsolescence does not mean that the general statement "software obsolescence doesn't imply format obsolescence" is false.

Criticizing current digital preservation efforts for not using the techniques of digital archaeology is as useful as criticizing historians for finding documents in archives rather than digging them up.

euanc said...

Hi David,

The "blink" example was the first example that came up when I searched in Google. It may be a non-standard element but the fact was it was widely used so the digital preservation community needs to have a solution for preserving pages that include it. Also, I'm sure there are other examples but regardless, the point is that we do not know and we should know what has and hasn't changed if we are going to say we have preserved something.
Your point about there being multiple possible renderings is perfectly correct. However just because there were multiple possible renderings of an item doesn't mean we shouldn't be trying to provide at least one representative and accurate rendering even if we can't provide every possible rendering. It is surely better than not even addressing the issue at all, and possible (probably) presenting content to users as "true and accurate" even though it never existed in that form when the content was in use.

There has been very little research done into these rendering issues with more recent digital objects but we have many anecdotal examples of rendering issues from, e.g. using word processors on OSX vs using word processors on Windows or in the cloud (see the comments on the second linked article also). Such examples at least imply that the issue has not gone away. To simply dismiss the issues identified in that report because the examples are old does not resolve the issue. Furthermore the community still needs solutions for those old files. Such files are often the exact types of files Archives are receiving due to their age and the lag in archival transfer policies.

I also don't understand the (seemingly arbitrary) differentiation between digital archaeology and digital preservation. If your preservation processes change the content then how are you doing digital preservation? If instead, to do digital preservation properly, you need to use tools from digital archaeology, why is this a problem? That is simply the reality that has to be addressed. And that is the problem here, nobody appears to be addressing this reality and, for example, trying to establish cost models and processes to enable it to be managed.

David. said...

Euan, we have multiple representative and authentic renderings of Web content that uses blink. Some current browsers support it (despite its being deprecated) and some don't, which is exactly the situation that obtained when the content was created. In other words, nothing about the rendering has changed since the content was created, which I think is the goal of preservation.

Why are you insisting on nominating one of these renderings as the "official" one, and thus changing the situation? Presumably, you think that the content should always blink, which would falsify history.

It is not possible to certify a rendering of published content as "true and accurate" in a world where, when it was published, the choice of rendering engine was up to the reader. So, if in your mind nothing less than "true and accurate" rendering is acceptable, there is no point in preserving any Web content at all. That is a consistent but not very practical viewpoint, and if you stick to it we can stop discussing things because we have nothing to say to each other.

Further, this demand for a "true and accurate" rendering is impossible even in your world. Lets take the example of a document that JHOVE says is Word Perfect 5.1. First, JHOVE could be wrong. If you certify a rendering as "true and accurate" based on a wrong identification by JHOVE, you are wrong. Second, even if JHOVE is right about the format, there is no guarantee that the tool used to write that format was in fact Word Perfect 5.1. You cannot certify a rendering as "true and accurate" based on its being Word Perfect 5.1 and your belief that therefore it was written by version 5.1 of Word Perfect, because there were many tools that could have written a Word Perfect 5.1 document, including later versions of Word Perfect and other competing software.

Even if you were right that the author used Word Perfect 5.1, and you run Word Perfect in an emulation, the emulation is just software and may not be perfect. It almost certainly does not match the hardware the writer used to run Word Perfect 5.1, because the the emulator writers choose the hardware they emulate on the basis of ease of emulation, (NE2000 NICs, VGA displays, ...) not because they were the most popular. And in any case you probably don't know the hardware configuration the document author used even if that hardware was available in your emulation.

So your "true and accurate" certification is not worth much. Of course, you can argue that these nit-picking details are irrelevant to the big picture of digital preservation. But then why are you allowed to ignore a few nit-picking details, but I am not allowed to ignore a few very similar nit-picking details? At least I am being honest in pointing out that all that is feasible in the real world is best-efforts preservation.

euanc said...

"we have multiple representative and authentic renderings of Web content that uses blink. Some current browsers support it (despite its being deprecated) and some don't"

For the particular "blink" function we may well have representative and authentic renderings available. However more generally we don't know that we have any representative and authentic renderings if we never test them.

I'm not suggesting we nominate a single environment as official, rather that we provide at least one that we believe is authentic and represents one of the environments that was used to interact with the objects when they were in use originally.
At the moment we rarely, if ever, seem to do that. Ideally we would provide multiple environments and users could try them all if they wanted to.

The interaction environment also does not need to be the creating environment. A representative interaction environment for PDF files is unlikely to include Adobe Creative Suite, rather it is likely to include a contemporaneous version of acrobat reader. Identifying representative interaction environments can be done with the aid of tools like JHOVE but these should be confirmed with the content creators/owners upon ingest. And yes, JHOVE, DROID etc may be wrong. But we should not be relying on such tools anyway. It just further emphasises the need, where possible, to get information from donors/transferring agencies about the software environments needed to interact with the objects to be ingested into the archive.

You might argue that that is impractical. But I would reply that it is not necessarily as difficult as it might seem initially. For things like websites the user gets to choose the environment so it would likely be ok for the archive to choose a representative environment that was contemporaneous with the sites. For documents etc it may often be perfectly ok to go with the leading product that reasonably interacted with the files when they were in use. Either way, providing such an environment provides a much more authentic and representative experience of the content than modern using software that didn't exist at the time the ojects were used.
For government/business archives it will likely be quite feasible to take regular snapshots of standard desktop images and use these as the environments in the future. And with the growing prevalence of desktops-as-a-service it will likely be even easier in the future as archivists will merely have to take one of the desktop images used in the virtualisation tool.

All of this does highlight a need to raise awareness in the wider community about documenting interaction environments when objects are created. This may take time but that doesn't mean it isn't worth doing.

As far as the hardware issues are concerned. I have addresed those here. It basically comes down to the assumption that a representative environment does not have to use the exact same hardware as it is reasonable to assume that that was not expected of environments at the time the objects were in use.

Nit picking about the details is important because as they say: the devil is in the details. So I'm pleased you are doing that. All I'm trying to push for is that we give users at least one environment that was likely (or ideally definitely-as specified by the donor/creator/owner etc) used to interact with the objects at the time they were created/in use. It seems to be a reasonable thing to request given the changes in content that can be seen when objects are interacted with using modern or otherwise inappropriate environments.