Comments on DSHR's Blog: Postel's Law

In my experience, the most common cause of JHOVE's...

2009-03-18T07:27:00.000-07:00

In my experience, the most common cause of JHOVE's reporting files as "not valid" is that dates are not represented in the correct format. This is harmless with all software I've heard of.

JHOVE 2 is planned to do better at reporting just what is wrong with a file, and to allow configuration to overlook problems that are deemed irrelevant. (I wrote most of the code for JHOVE 1.x, but am involved only in an advisory capacity for JHOVE 2.)

Thanks to everyone for a stimulating and useful di...

2009-02-18T12:10:00.000-08:00

Thanks to everyone for a stimulating and useful discussion on both this and the previous post. I'm now preparing a talk on these and related issues which will draw from these discussions. Once I'm over the hump of this, and when my hands permit, I will return to these topics in a new post.

David has asked for referrences to the Univ of Col...

2009-02-02T03:54:00.000-08:00

David has asked for referrences to the Univ of Cologne work, under Prof Manfred Thaller, on effects of bit errors on various types of files.
I heard about this work in detail at a conference, Archiving2008, in Bern in July 2008.
http://www.imaging.org/conferences/archiving2008/program.cfm
The specific paper was:
Analysing the Impact of File Formats on Data Integrity
Volker Heydegger, University of Cologne (Germany)
A PPT of that talk is here:
old.hki.uni-koeln.de/people/herrmann/forschung/heydegger_archiving2008.ppt
And a PDF of the text is here:
http://old.hki.uni-koeln.de/people/herrmann/forschung/heydegger_archiving2008_40.pdf

That work is about the effects of unrecoverable errors in a file, with so-called bit rot being one cause. The effect of such errors is random: some bits cause more problems than others. So Heydegger iterated tests thousands of times, and the metric for 'effect of the error' was how many bytes were affected per bit of original error. There is one major finding, which is obvious with hindsight: on files with a simple structure (TIFF uncompressed images, and I've done the same for WAV audio), usually only one to three bytes are affected per bit of 'bit rot' (one pixel or one audio sample is erroneous). There is no spreading or magnification of the error. On compressed files (JPG, MPG ...) the magnification factors are enormous -- around 1000 for JPEG, much higher for PNG and JPEG2000. One bit of error would affect 1/6 of a lossless JPEG2K filem and 1/4 to 1/3 of a lossy one.
Measuring the actual functional impact of errors, not just counting the number of bytes affected, is the next stage of this work. Regards, Richard Wright BBC R&D

Hello, David,There are many threads entwined in th...

2009-02-01T10:36:00.000-08:00

Hello, David,

There are many threads entwined in this and in your previous post, about the importance of format specifications, the usefulness of characterization tools, and the nature of preservation activity. As we have discussed in previous emails growing out of the original question about PDF files and their validity, the usefulness and limitations of those tools and of the first version of JHOVE in particular are a matter of active research at Portico and at other preservation institutions. What I would like to do here is try to drill down a little on some of the generalizations in your thought-provoking post.

One might well question, as commenter drpizza has, whether or not Postel's Law is applicable across all layers of the OSI stack -- does it in fact yield the same benefits at the application layer as at the datagram layer for which it was formulated? The challenges faced by those who must conserve web-harvested files, especially, as you note, HTML files, are precisely one of those economic “externalities” you mention in other posts – the hidden costs of a practice for which the (preservation) bill is now coming due. In part, this is because observance of Postel's law, at least so far as HTML is concerned, has been decidedly backwards. Browsers, for reasons concerned more with business advantage than good web citizenship, were conservative in what they accepted – they did what they could to conserve the brand. And the poor developers, writing all that JavaScript code to detect browser brand and version and to provide variant pages accordingly, were obliged to be liberal in what they sent. The problem is compounded by the often differing degree to which these same browsers were liberally accepting of (formally) invalid HTML, and the different ways in which they handled or rendered these malformations. There may be nothing to be done about existing artifacts. But we should at least consider whether tolerence of non-normative behavior leads towards promotion of that behavior.

Even assuming that Postel's law is operative at the application layer (i.e. an instance of a file format), I would argue that it is a simplification to say that digital preservation is always and only on the “accept” side of the law – or that such acceptance must always be “liberal”. That is a matter of use cases, or, perhaps more properly, business cases. What “side” preservation is on, and whether its mode of behavior is blue-state or red-state, is at least in part a function of the remove at which preservation occurs from the origination of the object, and of the leverage which the preservation community of interest has on the producer of the object.

Sometimes preservation is on the “accept” side -- and in “liberal” mode -- only. General web harvesting can be an instance of this type, being remote in time and in influence from the producer of the artifacts being preserved. This will be a matter of degree, but on the whole, such preservation might have no choice but to be liberal in what it accepts. It saves what it can harvest, and hopes, like Mr. Micawber, that “something will turn up” to ensure the objects are viewable in the distant future. It might or might not be useful, for the management of such a collection, to have available the characterization information supplied by JHOVE or some other tool, but regardless of what those data indicate about the “health” of the object, it will be accepted.

Sometimes preservation is on the “accept” side, but is “conservative” in what it accepts. A repository undertaking a large-scale digitization project is an instance of such a case. It will reject artifacts whose technical metadata indicate they do not conform to the format specification (either the public specification of the format, or an institution-specific profile of that format). In such an undertaking, technical metadata from a tool like JHOVE would be of great use in enforcing quality control before ingest into the archive, whether the digitization is performed in-house or by a vendor.

Sometimes preservation is on both sides. A corporation in a position to standardize internal practice, and with a mandate to preserve business records and business communications, might well be conservative in both what it “sends” (the original wording in Postel's law), and in what it accepts. Portico, as you mentioned, does in fact, “liberally accept” and preserve publisher-provided PDF files irrespective of the validation category returned by JHOVE. But we are also “conservative in what we do.” Along with those PDF files, we also preserve multiple manifestations of the content of those files, some in better (JHOVE) shape than the corresponding PDF file; and we migrate manifestations in publisher-specific SGML and XML formats to an NLM standard format (all the while preserving those original manifestations as well), along with the technical (JHOVE) metadata for each of these artifacts. We see the technical metadata as part of an arsenal of tools to assess and ultimately to mitigate risks to assets in the archive. Such mitigation has on occasion included a feedback loop to content providers who are willing and able to provide corrections for technically defective files.

Preservation of electronic journals, by Portico and by other preservation institutions, is at least potentially closer to the corporate (conservative sender/conservative accepter) end of the spectrum than a general web harvest. The academic library community has made a substantial investment in electronic scholarly journals. This surely constitutes a customer base with at least some leverage on the publishing community to produce preservation-friendly artifacts – to pressure producers to, in fact, be conservative in what they send, and thereby making it less expensive, over the life of the artifact, for preservation institutions to be liberal in what they accept.

As I remarked when discussing your post with digital preservation colleagues last week, it is important to probe and question, as you have done here, the usefulness and the limits of the tools we employ. How do we make them better? Should we be using them at all? We do not want to be in the position of the man looking for his lost keys under the lamppost because that’s where the light is. But I think we need to be equally chary of categorical assertions of a single right way, a single tool, a single approach to preservation. The house of preservation, like the house of fiction, has not one window, but a million. We have still to learn, in part from inspection of those tools and the data they supply, which windows will provide us the best view of our past in the future.

Best regards,
Sheila

Sheila Morrissey
Senior Research Developer, Portico
sheila.morrissey@portico.org

The key difference between the files that Richard ...

2009-01-27T04:55:00.000-08:00

The key difference between the files that Richard is dealing with (uncompressed audiovisual material) and the kind of materials published on the Web (for example) is that Richard's have large amounts of redundancy in them. That's why Richard's collections are so huge. It is also why video and audio compression work so well, because they can remove the redundancy.

The "timebase corrector" exploits the redundancy in an uncompressed signal and the simple formats Richard describes to repair damage. Thus, unlike the case of PDF, it is possible to build a tool capable of "converting these non-conforming files into conforming files whose renderings are the same", or rather, whose renderings are acceptable.

Unfortunately, storage isn't free. Even Richard struggles to buy enough for his needs. In most cases archives will either get content in complex formats with very little redundancy, or succumb to the temptation to compress out what redundancy exists. Thus, as with PDF, in most cases there will not be a "repair" tool and the Postel's Law problem will apply.

Richard has in the past cited interesting work by, I think, Manfred Thaller of Koln on the extent of the damage caused by single byte changes. I haven't found a link to it [Richard?]. Basically, the higher the compression the greater the span of damage from a single erroneous byte.

All -- Regarding what to do with Class B and D fil...

2009-01-25T03:30:00.000-08:00

All -- Regarding what to do with Class B and D files (the files that don't render correctly). It may not be a correct analogy, but I still think in terms of how playback equipment coped with problems in the signal coming back from, especially, videotape machines. We used a "timebase corrector" -- which essentially took a signal with gaps and instabilities, and did the best it could to produce, on output, a conforming signal that would record and transmit (ie render) properly. I know there are such things as file recovery/repair software programmes, but they aren't in general use (in my experience). In EVERY professional videotape transfer setup, a timebase corrector was used. It was a mark of a professional job. I'm expecting a BBC digital library to do the same: not just test files at input, but correct them (if I can find, or BBC R&D can develop, appropriate "correctors"). The huge difference between audiovisual files and text, is that there IS a definition of what a video signal should be, at the pixel and line and frame level, so such files are fully correctable -- IF there is structural information about how the video information is placed in the file. A software 'timebase corrector' for video is perfectly understandable and achievable, and indeed something like it is often incorporated in video players. But nothing like it is incorporated in the 'stack' of software that handles files in a traditional digital storage system. They have their own error detection and correction, but nothing specific about the audiovisual structure -- and hence no ability to partly recover a corrupted or damaged or "improperly written in the first place" file. So for me, files are checked not for 'conformance' but literally for errors, and in hopes of correcting those errors. That's an approach which depends entirely on files with well-defined errors - but I see that as a key characteristic of audiovisual and any other "rigid-format" file (providing of course the files are uncompressed and 'laid down' in a simple raster format).

The list of things Leonard and I agree on seems to...

2009-01-25T00:32:00.000-08:00

The list of things Leonard and I agree on seems to be:

1. Whatever we feel about it, in the real world there are a lot of PDF files that don't conform to the standards.

2. Whatever we feel about it, we expect most of these non-conforming files to render properly in the renderers that are in use.

3. There is no practical way of converting these non-conforming files into conforming files whose renderings are the same.

These points of agreement are enough to pose the problems for digital preservation. Note, this argument is only about preservation, not about other uses.

1. What is to be done with the knowledge that a file is non-conforming? Clearly, preserving only conforming files is the wrong answer.

2. Can we build an adequate renderer based only on the knowledge contained in the specification? Clearly, the answer is no. We also need the techniques that have been developed to render non-conforming files. As far as I know, these techniques are documented only in the source code of the renderers.

I agree with your categorization of files (A, B, C...

2009-01-24T12:02:00.000-08:00

I agree with your categorization of files (A, B, C and D) and I also agree with you that the majority of files fall into A or B with concerns on C and D.

While I don't have actual numbers to back this up, I can say that from personally dealing with hundreds of thousands of PDFs in the decade+ that I've been working with PDFs, the VAST majority are C. This is because some viewers (such as Adobe Acrobat/Reader) take GREAT pains to ensure that broken files will render - if at all possible. Why? Because users will ALWAYS blame the "reader" and not the "producer" - and that leads to tech support calls that (like all companies) we'd prefer to avoid, especially on our free products.

That said, we've been included a PDF validator in Acrobat Professional for a number of years now (since Acrobat Pro 7). While some developers have been using it, as they should, to check their documents - not enough do :(.

Also, it constantly amazes me the number of developers who insist on writing a new PDF creation library (be it open source or commercial) rather than leveraging the work of folks that have come before and already dealt with the problems.

Leonard

I think I'm not being as clear as I hoped.There ar...

2009-01-24T08:08:00.000-08:00

I think I'm not being as clear as I hoped.

There are four possible classes of files: [A] files that conform to the standard and render correctly, [B] files that conform to the standard but don't render correctly, [C] files that don't conform to the standard but do render correctly, [D] files that don't conform to the standard and don't render correctly.

I believe the majority of files are in A and pose no problems for preservation.

In theory there are no files in class B, but due to bugs in the renderers its likely that the set isn't completely empty.

At least for PDF, the union of C and D is known to contain a significant number of files. What we don't know is how many are in C and how many in D. My hypothesis is that the majority are in C, and I am suggesting an experiment that could disprove this hypothesis. That is, to randomly sample the files in ( C union D), render them, and look at them.

Alas, I don't have access to the files and even if I did, don't have resources to do the work. It concerns me that the teams who have both aren't doing the experiment. This is perhaps because of the embarrassment that would result if my hypothesis was not disproved.

Richard seems to be saying that he has a significant number of files in the union of B and D - i.e. he knows they don't render but he doesn't know whether they conform to the standard. I suspect from talking to Richard in the past that in his case they are in D and result from corruption of compressed data.

But the more important question is what to do with files that aren't in class A? If there is a way to push back on the creator to clean up the mess, then detecting the problem is clearly a good thing.

But in the case of e-journal PDF and, I believe, in most of Richard' cases, there's no way to get the creator to fix it. So, should the file be rejected or preserved as-is? Advocates of rejection seem to place unwarranted faith in the checking tools.

I'm no expert on PDFs, but I had assumed check...

2009-01-21T13:04:00.000-08:00

I'm no expert on PDFs, but I had assumed checking a file for 'errors' at time of input to a digital repository -- which then had the job of being responsible for the content held within the file forever and ever -- was A Good Thing.

But that means knowing what should and should not be worried about, which is of course what you're addressing.

I would certainly think it useful to know that a file would NOT render properly. I see an alarming number of MPEG files of various sorts that hang or play inconsistently or throw error messages of various sorts (from various players). For permanence, I'd certainly like to send any auduiovisual file to a checker that would give me high confidence that the file was indeed playable, and unlikely to have embedded defects that would cause problems in the future.

We have a new EC project PrestoPRIME that's just started, and I hope to write something sensible about the requirements for such a 'checker'. I was assuming that how JHOVE (2) handled image and audio files would be my starting point. It's not about the 'letter of the law' -- it's about identifying problems at 'point of entry'. Which comes back to the thorny issue you raise, of what is and isn't a problem.

But clearly (from all our experience of dodgy audiovisual files) there are problems, and I hope with help from the text-file people and all your years of experience that PrestoPRIME can come up with a meaningful 'checker' Richard Wright BBC R&D

Sorry for the delay in responding - I'm still havi...

2009-01-20T19:33:00.000-08:00

Sorry for the delay in responding - I'm still having hand problems.

In response to DrPizza, "perpetuating the problems of today" is what digital preservation is all about.

In response to leonardr, it isn't me that is "looking at file format standards as ONLY focusing on the file format and NOT on the behavior of conforming readers". It is the developers and users of tools such as JHOVE. I am pointing out that the use of such tools is not helping the task of digital preservation. In particular, I am pointing out that these tools do not tell us how legible the preserved content will eventually be.

Hrm.I think Postel's law is one of the main culpri...

2009-01-16T03:12:00.000-08:00

Hrm.

I think Postel's law is one of the main culprits for the shockingly poor quality of software in general.

Postel's law requires programmers--who, to be blunt, find it hard enough to write programs that meet, let alone surpass, the specification--to write programs that not only deal with predictable, in-spec input, but also arbitrary, unpredictable, out-of-spec input.

I do not think that placing this burden on developers is a good one. Writing software to deal with known inputs is hard enough. Writing software to deal with unknown inputs takes this task that's already at the limit of human ability and pushes it over the edge.

I am quite sure that if, for example, web browsers had taken a hard-line stance from day one, then we would not today see browsers with the same stability or compatibility problems. It's true that the specs themselves are not perfect, but I suspect that hard-line adherence here will also motivate better detection and resolution of flaws in the specification anyway.

Perhaps you would say that the ship has already sailed, and that the data already exists. Well, OK, there is some amount of extant out-of-spec data. But that's nothing compared to the amount of data that will be created in the future. There is no sense in perpetuating the problems of today.

I have many things to comment on about your post, ...

2009-01-15T17:36:00.000-08:00

I have many things to comment on about your post, and will try to do so soon - but I did want to respond immediately to one thing...

You seem to be looking at file format standards as ONLY focusing on the file format and NOT on the behavior of "conforming readers". Each of the various ISO PDF standards (PDF/X - 15930, PDF/A - 19005, PDF/E - 24517 and PDF 1.7 - 32000) _ALL_ include NOT ONLY the requirements for the file format BUT ALSO specific requirements for a conforming reader of that format. So it's NOT JUST about parsing the format according BUT ALSO about following the rules for rendering it correctly and reliably.

Leonard Rosenthol
PDF Standards Architect
Adobe Systems

This post sheds an interesting light on its predec...

2009-01-15T15:22:00.000-08:00

This post sheds an interesting light on its predecessor, in which I discuss the plight of the poor sucker chosen to implement a renderer for a format, based only on a preserved specification, long after the renderers that were originally used for it have become obsolete.

If the poor sucker believes the specification, he will produce something that conforms to it. This renderer will object to non-conforming input, so will not render a substantial proportion of the preserved files in the format at all. The correct approach to building a renderer is to observe Postel's Law, and do the best you can with whatever input you get. The techniques for doing so will be embedded in the code of the renderers that were in use when the format was current, they will not be in the specification at all. If there is an open-source renderer, these techniques can be used directly. Otherwise, they have to be re-discovered ab initio, a difficult and error-prone task especially since there is no way of comparing the output of the new renderer with any of the originals.

Also, I should have tipped the hat to Keith Johnson, who pointed me to the history of the Robustness Principle.