"Be conservative in what you do; be liberal in what you accept from others."
Its important not to lose sight of the fact that digital preservation is on the "accept" side of Postel's Law, but it seems that people often do.
On the Digital Curation Centre Associates mail list, Adil Hasan started a discussion by asking:
"Does anyone know [whether] there has been a study to estimate how many PDF documents do not comply with the PDF standards?"
No-one in the subsequent discussion knew of a comprehensive study, but Sheila Morrissey reported on the results of Portico's use of JHOVE to classify the 9 million PDFs they have received from 68 publishers as one of not well formed, well-formed and not valid, and well-formed and valid. A significant proportion were classified as either not well formed or well-formed and not valid.
These results are not unexpected. It is well known that much of the HTML on the Web fails the W3C validation tests. Indeed, a 2001 study reportedly concluded that less than 1% of it was valid SGML. Alas, I couldn't retrieve the original document via this link, but our experience confirms that much HTML is poorly formed. For this very reason LOCKSS uses a crawler based on work by James Gosling at Sun Microsystems to develop techniques for extracting links from HTML that are very tolerant of malformed input; an application of Postel's Law.
Follow me below the fold to see why, although questions like Adil's are frequently asked, devoting resources to answering them or acting upon the answers is unlikely to help digital preservation.
Why, in a forum devoted to digital curation, would anyone ask about the proportion of PDF files that don't conform to the standards? After all, the PDF files they are asking about are generated by tools. No-one writes PDF by hand. So if they don't conform to the standards, it is because the tool that generated them had a bug in it. Why not report the bug to the tool creator? Because even if the tool creator fixed the bug, the files the tool generated before the fix was propagated would still be wrong. There's no way to recall and re-create them, so digital curators simply have to deal with them.
The saving grace in this situation is that the software, such as Adobe Reader, that renders PDF is constructed according to Postel's Law. It does the best it can to render even non-standard PDF legibly. Because it does so, it is very unlikely that a bug in the generation tool will have a visible effect. And if the bug doesn't have a visible effect, it is very unlikely to be detected, reported and fixed.
Thus we see that a substantial proportion of non-conforming PDF files is to be expected. And it is also to be expected that the non-conforming files will render correctly, since they will have been reviewed by at least one human (the author) for legibility.
Is the idea to report the bugs, which don't have visible effects, to the appropriate tool vendors? This would be a public-spirited effort to improve tool quality, but a Sysiphean task. And it wouldn't affect digital curation of PDF files since, as we have seen, it would have no effect on the existing population of PDF files.
Is the idea to build a PDF repair tool, which takes non-conforming PDF files as input and generates a conforming PDF file that has an identical visual rendering as output? That would be an impressive feat of programming, but futile. After all, the non-conforming file is highly likely to render correctly without modification. And if it doesn't, how would the repair tool know what rendering the author intended?
Is the idea to reject non-conforming files for preservation or curation? This is the possibility that worried me, as it would be a violation of Postel's Law. To see why I was worried, substitute HTML for PDF. It is well-known that a proportion, perhaps the majority, of web sites contain HTML that fails the W3C conformance tests but that is perfectly legible when rendered by all normal browsers. This isn't a paradox; the browsers are correctly observing Postel's Law. They are doing their best with whatever they are given, and are to be commended for doing so. Web crawls by preservation institutions such as national libraries and the Internet Archive would be very badly advised to run the W3C tests on the HTML they collect and reject any that failed. Such nit-picking would be a massive waste of resources and would cause them to fail in their mission of preserving the Web as a cultural artifact.
And how would an archive reject non-conforming files? By returning them to the submittor with a request to fix the problem? In almost all cases there's nothing the submittor can do to fix the problem. It was caused by a bug in a tool he used, not by error on his part. All the submittor could do would be to transmit the error report to the tool vendor and wait for an eventual fix. This would not be a very user-friendly archive.
So why do digital curators think it is important to use tools such as JHOVE to identify and verify the formats of files? Identifying the format is normally justified on the basis of knowing what formats are being preserved (interesting) and flagging those thought to be facing obsolescence (unlikely to happen in the foreseeable future to the formats we're talking about). But why do curators care that the file conforms to the format specification rather than whether it renders legibly?
The discussion didn't answer this question but it did reveal some important details:
First, although it is true that JHOVE flags a certain proportion of PDF files as not conforming to the standards, it is known that in some cases these are false positives. It is not known what JHOVE's rate of false negatives is, which would be cases in which it did not flag a file that in fact did not conform. It is hoped that JHOVE2 (PDF), the successor to JHOVE which is currently under development, will have lower error rates. But there don't appear to be any plans to measure these error rates, so it'll be hard to be sure that JHOVE2 is actually doing better.
Second, no-one knows what proportion of files that JHOVE flags as not conforming are not legible when rendered using standard tools such as Adobe Reader or Ghostscript. There are no plans to measure this proportion, either for JHOVE or for JHOVE2. So there is no evidence that the use of these tools contributes to future readers' ability to read the files which is, after all, the goal of curation. Wouldn't it be a good idea to choose a random sample among the Portico PDFs that JHOVE flags, render them with Ghostscript, print the results and have someone examine them to see if they were legible?
Third, although Portico classifies the PDF files it receives into the three JHOVE categories, it apparently observes Postel's Law by accepting PDF files for preservation irrespective of the category they are in. If so, they are to be commended.
Fourth, there doesn't seem to be much concern about the inevitable false positives and false negatives in the conformance testing process. The tool that classifies the files isn't magic, it is just a program that purports to implement the specification which, as I pointed out in a related post, is not perfect. And why would we believe that the programmer writing the conformance tester was capable of flawless implementation of the specification when his colleagues writing the authoring tools generating the non-conformances were clearly not? Lastly, absence of evidence is not evidence of absence. If the program announces that the file does not conform, it presumably identifies the non-conforming elements. They can be checked to confirm that the program is correct. Otherwise, it presumably says OK. But what it is really saying is "I didn't find any non-conforming elements. So the estimate from running the program is likely to be an under-estimate - there will be false negatives, non-conforming files that the program fails to detect.
The real question for people who think that JHOVE-like tools are important, either as gatekeepers or as generators of metadata, is "what if the tool is wrong?" There are two possible answers. Something bad happens. That makes the error rate of the tool a really important, but unknown, number. Alternatively, nothing bad happens. That makes the tool irrelevant, since not using it can't be worse than using it and having it give wrong answers.
Thus, to be blunt for effect, we have a part of the ingest pipeline that is considered to be important which classifies files into three categories with some unknown error rate. There is no evidence that these categories bear any relationship to the current or eventual legibility of these files by readers. And the categories are ignored in subsequent processing. Why are we bothering to do this?
This post sheds an interesting light on its predecessor, in which I discuss the plight of the poor sucker chosen to implement a renderer for a format, based only on a preserved specification, long after the renderers that were originally used for it have become obsolete.
If the poor sucker believes the specification, he will produce something that conforms to it. This renderer will object to non-conforming input, so will not render a substantial proportion of the preserved files in the format at all. The correct approach to building a renderer is to observe Postel's Law, and do the best you can with whatever input you get. The techniques for doing so will be embedded in the code of the renderers that were in use when the format was current, they will not be in the specification at all. If there is an open-source renderer, these techniques can be used directly. Otherwise, they have to be re-discovered ab initio, a difficult and error-prone task especially since there is no way of comparing the output of the new renderer with any of the originals.
Also, I should have tipped the hat to Keith Johnson, who pointed me to the history of the Robustness Principle.
I have many things to comment on about your post, and will try to do so soon - but I did want to respond immediately to one thing...
You seem to be looking at file format standards as ONLY focusing on the file format and NOT on the behavior of "conforming readers". Each of the various ISO PDF standards (PDF/X - 15930, PDF/A - 19005, PDF/E - 24517 and PDF 1.7 - 32000) _ALL_ include NOT ONLY the requirements for the file format BUT ALSO specific requirements for a conforming reader of that format. So it's NOT JUST about parsing the format according BUT ALSO about following the rules for rendering it correctly and reliably.
PDF Standards Architect
I think Postel's law is one of the main culprits for the shockingly poor quality of software in general.
Postel's law requires programmers--who, to be blunt, find it hard enough to write programs that meet, let alone surpass, the specification--to write programs that not only deal with predictable, in-spec input, but also arbitrary, unpredictable, out-of-spec input.
I do not think that placing this burden on developers is a good one. Writing software to deal with known inputs is hard enough. Writing software to deal with unknown inputs takes this task that's already at the limit of human ability and pushes it over the edge.
I am quite sure that if, for example, web browsers had taken a hard-line stance from day one, then we would not today see browsers with the same stability or compatibility problems. It's true that the specs themselves are not perfect, but I suspect that hard-line adherence here will also motivate better detection and resolution of flaws in the specification anyway.
Perhaps you would say that the ship has already sailed, and that the data already exists. Well, OK, there is some amount of extant out-of-spec data. But that's nothing compared to the amount of data that will be created in the future. There is no sense in perpetuating the problems of today.
Sorry for the delay in responding - I'm still having hand problems.
In response to DrPizza, "perpetuating the problems of today" is what digital preservation is all about.
In response to leonardr, it isn't me that is "looking at file format standards as ONLY focusing on the file format and NOT on the behavior of conforming readers". It is the developers and users of tools such as JHOVE. I am pointing out that the use of such tools is not helping the task of digital preservation. In particular, I am pointing out that these tools do not tell us how legible the preserved content will eventually be.
I'm no expert on PDFs, but I had assumed checking a file for 'errors' at time of input to a digital repository -- which then had the job of being responsible for the content held within the file forever and ever -- was A Good Thing.
But that means knowing what should and should not be worried about, which is of course what you're addressing.
I would certainly think it useful to know that a file would NOT render properly. I see an alarming number of MPEG files of various sorts that hang or play inconsistently or throw error messages of various sorts (from various players). For permanence, I'd certainly like to send any auduiovisual file to a checker that would give me high confidence that the file was indeed playable, and unlikely to have embedded defects that would cause problems in the future.
We have a new EC project PrestoPRIME that's just started, and I hope to write something sensible about the requirements for such a 'checker'. I was assuming that how JHOVE (2) handled image and audio files would be my starting point. It's not about the 'letter of the law' -- it's about identifying problems at 'point of entry'. Which comes back to the thorny issue you raise, of what is and isn't a problem.
But clearly (from all our experience of dodgy audiovisual files) there are problems, and I hope with help from the text-file people and all your years of experience that PrestoPRIME can come up with a meaningful 'checker' Richard Wright BBC R&D
I think I'm not being as clear as I hoped.
There are four possible classes of files: [A] files that conform to the standard and render correctly, [B] files that conform to the standard but don't render correctly, [C] files that don't conform to the standard but do render correctly, [D] files that don't conform to the standard and don't render correctly.
I believe the majority of files are in A and pose no problems for preservation.
In theory there are no files in class B, but due to bugs in the renderers its likely that the set isn't completely empty.
At least for PDF, the union of C and D is known to contain a significant number of files. What we don't know is how many are in C and how many in D. My hypothesis is that the majority are in C, and I am suggesting an experiment that could disprove this hypothesis. That is, to randomly sample the files in ( C union D), render them, and look at them.
Alas, I don't have access to the files and even if I did, don't have resources to do the work. It concerns me that the teams who have both aren't doing the experiment. This is perhaps because of the embarrassment that would result if my hypothesis was not disproved.
Richard seems to be saying that he has a significant number of files in the union of B and D - i.e. he knows they don't render but he doesn't know whether they conform to the standard. I suspect from talking to Richard in the past that in his case they are in D and result from corruption of compressed data.
But the more important question is what to do with files that aren't in class A? If there is a way to push back on the creator to clean up the mess, then detecting the problem is clearly a good thing.
But in the case of e-journal PDF and, I believe, in most of Richard' cases, there's no way to get the creator to fix it. So, should the file be rejected or preserved as-is? Advocates of rejection seem to place unwarranted faith in the checking tools.
I agree with your categorization of files (A, B, C and D) and I also agree with you that the majority of files fall into A or B with concerns on C and D.
While I don't have actual numbers to back this up, I can say that from personally dealing with hundreds of thousands of PDFs in the decade+ that I've been working with PDFs, the VAST majority are C. This is because some viewers (such as Adobe Acrobat/Reader) take GREAT pains to ensure that broken files will render - if at all possible. Why? Because users will ALWAYS blame the "reader" and not the "producer" - and that leads to tech support calls that (like all companies) we'd prefer to avoid, especially on our free products.
That said, we've been included a PDF validator in Acrobat Professional for a number of years now (since Acrobat Pro 7). While some developers have been using it, as they should, to check their documents - not enough do :(.
Also, it constantly amazes me the number of developers who insist on writing a new PDF creation library (be it open source or commercial) rather than leveraging the work of folks that have come before and already dealt with the problems.
The list of things Leonard and I agree on seems to be:
1. Whatever we feel about it, in the real world there are a lot of PDF files that don't conform to the standards.
2. Whatever we feel about it, we expect most of these non-conforming files to render properly in the renderers that are in use.
3. There is no practical way of converting these non-conforming files into conforming files whose renderings are the same.
These points of agreement are enough to pose the problems for digital preservation. Note, this argument is only about preservation, not about other uses.
1. What is to be done with the knowledge that a file is non-conforming? Clearly, preserving only conforming files is the wrong answer.
2. Can we build an adequate renderer based only on the knowledge contained in the specification? Clearly, the answer is no. We also need the techniques that have been developed to render non-conforming files. As far as I know, these techniques are documented only in the source code of the renderers.
All -- Regarding what to do with Class B and D files (the files that don't render correctly). It may not be a correct analogy, but I still think in terms of how playback equipment coped with problems in the signal coming back from, especially, videotape machines. We used a "timebase corrector" -- which essentially took a signal with gaps and instabilities, and did the best it could to produce, on output, a conforming signal that would record and transmit (ie render) properly. I know there are such things as file recovery/repair software programmes, but they aren't in general use (in my experience). In EVERY professional videotape transfer setup, a timebase corrector was used. It was a mark of a professional job. I'm expecting a BBC digital library to do the same: not just test files at input, but correct them (if I can find, or BBC R&D can develop, appropriate "correctors"). The huge difference between audiovisual files and text, is that there IS a definition of what a video signal should be, at the pixel and line and frame level, so such files are fully correctable -- IF there is structural information about how the video information is placed in the file. A software 'timebase corrector' for video is perfectly understandable and achievable, and indeed something like it is often incorporated in video players. But nothing like it is incorporated in the 'stack' of software that handles files in a traditional digital storage system. They have their own error detection and correction, but nothing specific about the audiovisual structure -- and hence no ability to partly recover a corrupted or damaged or "improperly written in the first place" file. So for me, files are checked not for 'conformance' but literally for errors, and in hopes of correcting those errors. That's an approach which depends entirely on files with well-defined errors - but I see that as a key characteristic of audiovisual and any other "rigid-format" file (providing of course the files are uncompressed and 'laid down' in a simple raster format).
The key difference between the files that Richard is dealing with (uncompressed audiovisual material) and the kind of materials published on the Web (for example) is that Richard's have large amounts of redundancy in them. That's why Richard's collections are so huge. It is also why video and audio compression work so well, because they can remove the redundancy.
The "timebase corrector" exploits the redundancy in an uncompressed signal and the simple formats Richard describes to repair damage. Thus, unlike the case of PDF, it is possible to build a tool capable of "converting these non-conforming files into conforming files whose renderings are the same", or rather, whose renderings are acceptable.
Unfortunately, storage isn't free. Even Richard struggles to buy enough for his needs. In most cases archives will either get content in complex formats with very little redundancy, or succumb to the temptation to compress out what redundancy exists. Thus, as with PDF, in most cases there will not be a "repair" tool and the Postel's Law problem will apply.
Richard has in the past cited interesting work by, I think, Manfred Thaller of Koln on the extent of the damage caused by single byte changes. I haven't found a link to it [Richard?]. Basically, the higher the compression the greater the span of damage from a single erroneous byte.
There are many threads entwined in this and in your previous post, about the importance of format specifications, the usefulness of characterization tools, and the nature of preservation activity. As we have discussed in previous emails growing out of the original question about PDF files and their validity, the usefulness and limitations of those tools and of the first version of JHOVE in particular are a matter of active research at Portico and at other preservation institutions. What I would like to do here is try to drill down a little on some of the generalizations in your thought-provoking post.
Even assuming that Postel's law is operative at the application layer (i.e. an instance of a file format), I would argue that it is a simplification to say that digital preservation is always and only on the “accept” side of the law – or that such acceptance must always be “liberal”. That is a matter of use cases, or, perhaps more properly, business cases. What “side” preservation is on, and whether its mode of behavior is blue-state or red-state, is at least in part a function of the remove at which preservation occurs from the origination of the object, and of the leverage which the preservation community of interest has on the producer of the object.
Sometimes preservation is on the “accept” side -- and in “liberal” mode -- only. General web harvesting can be an instance of this type, being remote in time and in influence from the producer of the artifacts being preserved. This will be a matter of degree, but on the whole, such preservation might have no choice but to be liberal in what it accepts. It saves what it can harvest, and hopes, like Mr. Micawber, that “something will turn up” to ensure the objects are viewable in the distant future. It might or might not be useful, for the management of such a collection, to have available the characterization information supplied by JHOVE or some other tool, but regardless of what those data indicate about the “health” of the object, it will be accepted.
Sometimes preservation is on the “accept” side, but is “conservative” in what it accepts. A repository undertaking a large-scale digitization project is an instance of such a case. It will reject artifacts whose technical metadata indicate they do not conform to the format specification (either the public specification of the format, or an institution-specific profile of that format). In such an undertaking, technical metadata from a tool like JHOVE would be of great use in enforcing quality control before ingest into the archive, whether the digitization is performed in-house or by a vendor.
Sometimes preservation is on both sides. A corporation in a position to standardize internal practice, and with a mandate to preserve business records and business communications, might well be conservative in both what it “sends” (the original wording in Postel's law), and in what it accepts. Portico, as you mentioned, does in fact, “liberally accept” and preserve publisher-provided PDF files irrespective of the validation category returned by JHOVE. But we are also “conservative in what we do.” Along with those PDF files, we also preserve multiple manifestations of the content of those files, some in better (JHOVE) shape than the corresponding PDF file; and we migrate manifestations in publisher-specific SGML and XML formats to an NLM standard format (all the while preserving those original manifestations as well), along with the technical (JHOVE) metadata for each of these artifacts. We see the technical metadata as part of an arsenal of tools to assess and ultimately to mitigate risks to assets in the archive. Such mitigation has on occasion included a feedback loop to content providers who are willing and able to provide corrections for technically defective files.
Preservation of electronic journals, by Portico and by other preservation institutions, is at least potentially closer to the corporate (conservative sender/conservative accepter) end of the spectrum than a general web harvest. The academic library community has made a substantial investment in electronic scholarly journals. This surely constitutes a customer base with at least some leverage on the publishing community to produce preservation-friendly artifacts – to pressure producers to, in fact, be conservative in what they send, and thereby making it less expensive, over the life of the artifact, for preservation institutions to be liberal in what they accept.
As I remarked when discussing your post with digital preservation colleagues last week, it is important to probe and question, as you have done here, the usefulness and the limits of the tools we employ. How do we make them better? Should we be using them at all? We do not want to be in the position of the man looking for his lost keys under the lamppost because that’s where the light is. But I think we need to be equally chary of categorical assertions of a single right way, a single tool, a single approach to preservation. The house of preservation, like the house of fiction, has not one window, but a million. We have still to learn, in part from inspection of those tools and the data they supply, which windows will provide us the best view of our past in the future.
Senior Research Developer, Portico
David has asked for referrences to the Univ of Cologne work, under Prof Manfred Thaller, on effects of bit errors on various types of files.
I heard about this work in detail at a conference, Archiving2008, in Bern in July 2008.
The specific paper was:
Analysing the Impact of File Formats on Data Integrity
Volker Heydegger, University of Cologne (Germany)
A PPT of that talk is here:
And a PDF of the text is here:
That work is about the effects of unrecoverable errors in a file, with so-called bit rot being one cause. The effect of such errors is random: some bits cause more problems than others. So Heydegger iterated tests thousands of times, and the metric for 'effect of the error' was how many bytes were affected per bit of original error. There is one major finding, which is obvious with hindsight: on files with a simple structure (TIFF uncompressed images, and I've done the same for WAV audio), usually only one to three bytes are affected per bit of 'bit rot' (one pixel or one audio sample is erroneous). There is no spreading or magnification of the error. On compressed files (JPG, MPG ...) the magnification factors are enormous -- around 1000 for JPEG, much higher for PNG and JPEG2000. One bit of error would affect 1/6 of a lossless JPEG2K filem and 1/4 to 1/3 of a lossy one.
Measuring the actual functional impact of errors, not just counting the number of bytes affected, is the next stage of this work. Regards, Richard Wright BBC R&D
Thanks to everyone for a stimulating and useful discussion on both this and the previous post. I'm now preparing a talk on these and related issues which will draw from these discussions. Once I'm over the hump of this, and when my hands permit, I will return to these topics in a new post.
In my experience, the most common cause of JHOVE's reporting files as "not valid" is that dates are not represented in the correct format. This is harmless with all software I've heard of.
JHOVE 2 is planned to do better at reporting just what is wrong with a file, and to allow configuration to overlook problems that are deemed irrelevant. (I wrote most of the code for JHOVE 1.x, but am involved only in an advisory capacity for JHOVE 2.)
Post a Comment