Monday, July 13, 2009

Spring CNI Plenary: The Video

CNI has now posted the video of Cliff Lynch's introduction, my plenary presentation, and the questions.

How Are We Ensuring the Longevity of Digital Documents? from CNI Video Editor on Vimeo.



I gave a significantly shortened version of this talk at the Sun PASIG meeting in Malta June 26.

Read More......

Thursday, June 4, 2009

Hard Disk Drives: The Good, the Bad and the Ugly

Jon Elerath just published a wonderful paper in the June 2009 Communications of the ACM entitled "Hard Disk Drives: The Good, the Bad and the Ugly". Everyone, especially anyone who believes bit preservation is a solved problem, should read it. He clearly communicates the incredible complexity of the technology inside the familiar 3.5" drive form factor.

Elerath reviews the range of hard disk failure modes, and shows how difficult it will be for disk manufacturers to maintain the drive reliability constant as disks get bigger. And even if they succeed in keeping drive reliability constant while the disk gets bigger, the bit reliability they deliver goes down. He says:

Multi-terabyte capacity drives using perpendicular recording will be available soon, increasing the probability of both correctable and uncorrectable errors by virtue of the narrowed track widths, lower flying heads, and susceptibility to scratching by softer particle contaminants.
Thus, as I have been saying for a while, just as we are trying to preserve larger and larger numbers of bits, the technologies we use to make those bits reliable are not keeping pace. Elerath concludes:
Only when these high-probability [failure] events are included in the optimization of the RAID operation will reliability improve. Failure to address them is a recipe for disaster.
I agree that RAID technology needs to adapt to the decreasing bit reliability and longer time to repair of newer disk drives. But, as I argued in my iPRES2008 paper (pdf), even if we do a good job of adapting RAID to cope with these problems we will still be many orders of magnitude below the reliability levels digital preservation needs.

Read More......

Wednesday, May 6, 2009

Sheila Morrissey's comment

Portico's Sheila Morrissey posted a valuable comment on the post that provided the background and sources for my CNI plenary. It set out the conventional wisdom against which I was arguing, but at such length that I felt it was inhibiting discussion. It was also difficult to respond to by adding a comment, among other reasons because there was no easy way to connect my responses to their targets in the comment. I therefore saved the text of Sheila's comment, deleted it from the original post, and reproduced it below the fold, together with my responses. Portico has posted a version of her comment here.



My talk was in three parts. They argued:
  • that Jeff Rothenberg, whose Scientific American article drew attention to the problem of preserving digital documents, naturally saw format obsolescence as the major problem because the desktop publishing paradigm that was predominant as he wrote in 1995 saw documents and their formats as the property of applications,

  • that developments since, primarily the Web, have transformed documents and their formats into publishing infrastructure, and that we have many years of experience showing that economic and technical forces prevent infrastructure, such as network protocols, file systems and operating systems kernels, changing in incompatible ways, thus diminishing the threat of format obsolescence,

  • and that our current techniques, based on Jeff's analysis, are inadequate to society's needs for digital preservation, in part because they are too expensive, and in part because they do not address the technical and even conceptual problems of the effect of the Web on publishing.

To argue against the second part, Sheila would need to show that format obsolescence has in fact been causing documents to be lost. To argue against the third part, Sheila would need to show that our current techniques are in fact affordable and adequate to the technical problems. Sheila does neither of these.

In my overview of the talk I compared it to Steven Sondheim's Into The Woods, because in both cases the happy ending is in the middle. Anyone who was there, as Sheila was not, would tell you that the part that looked forward was at least as pessimistic about the future as Jeff's article. Sheila writes as if this invited CNI plenary were a sales pitch for a particular system, rather than an indictment of all current systems. For example,
LOCKSS is exactly co-extensive with what you propose as the totally sufficient preservation strategy: just get the bits and save the bits, and use open-source software to do so.
Sheila seems unable to address the actual text of my post, erecting instead a straw-man created from manipulated quotes.

Sheila, it must be pointed out, works for an organization whose business model depends on persuading libraries that the problems of digital preservation are so complex and difficult that they cannot and should not attempt to tackle them for themselves, but should pay experts to handle them instead. The techniques they use to perform this persuasion can be divided into three groups:

  • Denigration: downplaying libraries' capabilities,

  • Exaggeration: making the problems look more difficult than they are, primarily by claiming that many tasks are essential which are in fact optional,

  • Obfuscation: confusing the uninitiated reader by using undefined jargon and failing to cite definitions and sources, thus making the arguments accessible only to experts.

Note in particular that Sheila provides only three links, two to a comment she posted earlier, and one to a paper I wrote 6 years ago. She makes many sweeping assertions but fails to back any of them up with evidence, a common way to Obfuscate, and especially ironic in a comment on a post supplying the sources for the assertions I made in my talk.

Hello David,

Just some thoughts on your slides and comments, as together we all grapple with the challenges we face to ensure the legacy of digital artifacts --

In the words of that profound cultural historian, Yogi Berra,

“If you don't know where you're going, you'll wind up somewhere else.”And, as he might have added: “If you don't know where you are, you'll have a hard time getting there.”
Sheila starts with Obfuscation. Or is it just because I'm an Englishman that the relevance of this escapes me?
The lesson of what you call modern history (1995 to 2009), you say, is that all those preservation risks we anticipated in the bad old days of ancient history (the pre-1995 assumptions that we'll have “documents in app-specific formats, hardware & O/S will change rapidly in ways that break applications, apps for rendering formats have a short life”) have been conjured away by the market's invisible hand.
Well, no, I did not say that. The idea that "all those preservation risks ... have been conjured away by the market's invisible hand" is Sheila's Exaggeration. The slide from which she takes the quote was describing Jeff Rothenberg's dystopian vision of the future for documents, not my views. Further, she omits two key bullets. The full quote is:
  • Documents are in app-specific formats,

    • Typical formats are proprietary

    • Attempts to standardize formats will fail

  • Hardware & O/S will change rapidly

    • In ways that break applications

  • Apps for rendering formats have a short life
I provided examples and sources to back up the contention that these predictions have not come true.
Sadly, it's not so.

It was not so in 2003, as you have described elsewhere, when it was found necessary to send old Ethernet cards to enthusiastic early adopters of LOCKSS, because the Ethernet chips in the inexpensive but new hardware they purchased were incompatible with the older OpenBSD drivers in the LOCKSS software distribution.
Implying that a few support calls more than 6 years ago resolved, if memory serves, at a cost of $2.99 plus postage and packing shows that libraries can't do digital preservation is an example of Denigration. But in any case, Sheila is also using Exaggeration here. Her example is of forwards incompatibility. The old software was incapable of driving the new chips. The digital preservation problem is one of backwards not forwards incompatibility; new software incapable of rendering old formats. One release later, and OpenBSD could drive those chips.
And it's not so in 2009, either. We still have multiple hardware architectures and operating systems – not just mainframes and servers and desktops, not just Z10 and Windows XP and Windows Vista and Solaris and the many flavors of Linux – but also cell phones and PDAs, IPODs and Walkmans and Zunes, navigation appliances like TomTom, eBook readers like Kindle, game boxes like Xbox and Wii. And we have whole zoo of new applications and new formats to go with them. Some are merely delivery formats, but some are actual content formats. Some are open; some are proprietary. We have the many variants of geospatial data (GDF, CARiN, S-Dal, PSF, and the myriad sub-species of ShapeFiles). We have formats for eBook readers (proprietary ones like Kindle's MobiPocket-based AZW format, in addition to open ones like OPF).
Are you feeling Obfuscated yet? The implication is that special, expert digital preservation techniques are needed for each and every one of the devices and operating systems named. But there is no explanation of, or cited sources for, why this should be so. Note for example that she points to various MP3 players without noting that the very success of MP3 players as a product is based on their sharing a format which, because of their success, is now very hard to change.
The many proprietary CAD tools and formats currently in use, which have no satisfactory non-proprietary open-source substitutes, already complicate the present communication, not to say the future preservation, of architectural design artifacts.
Mackenzie Smith raised this very point in the question period after my talk. I answered that the CAD market is still comparatively immature, and that one can expect that the customers in this market will eventually rebel against the costs of proprietary lock-in, as Microsoft's have done. But that whether or not they do is actually irrelevant to the problem of preservation. All these systems run on hardware for which there are open source emulators. If the binaries of the CAD tools and their O/S are preserved, the tools can be run using emulation. As I said in the talk, Jeff was right about emulation just wrong about the reason for it.
The fact is, new content formats are being created all the time – some of them proprietary; some with only proprietary rendition tools; many in wide use in both online and offline content repositories of interest to one digital preservation community or another. The technology market does not stand still. The market will always recapitulate exactly the same process that W. Brian Arthur analyzed and that you cite in your talk, describing the "office suite wars": get out there with your own product, do what you have to do to suck up market share, and crowd out whomever you can, whether they got there before you, or came in after you.
Again, Sheila Exaggerates by confusing the introduction of new formats (necessarily accompanied by renderers) with the obsolescence of formats, which is the problem Jeff identified as key to digital preservation.

I pointed out in my post that the market does not "always recapitulate exactly the same process that W. Brian Arthur analyzed", using the Boeing/Airbus, Intel/AMD, and Nvidia/ATI duopolies as examples. Arthur's book did explain the behavior of markets such as desktop publishing, more sophisticated customers have since developed more sophisticated strategies.
And after the market finishes strip-mining a particular application or content domain, what then? The digital preservation community will be left, then as now, to clean up the mess. There will always be artifacts in some defunct format, for which there will have been no sufficient market to ensure freely available open-source renderers. And there will be artifacts nominally in the format that prevails that will be defective. So, for example, we’ll likely be dealing with "crufty" self-published eBooks that fly below Google's content-acquisition radar, or with “crufty” geospatial databases, in whatever dominant format, just as today web harvesters have to contend with "crufty" HTML.
The talk argues that our current digital preservation techniques are massively inadequate to the task at hand. One major reason is that the focus in research and development of digital preservation has been on problems that are not relevant to the overwhelming majority of the content to be preserved. Clearly, there will always be some amount of content in formats so far from the mainstream that the cost of preserving them outweighs the benefits of doing so. But that fact doesn't invalidate the argument. If we spend all our R&D solving the problems of the cruft and by doing so fail to address the big problems set out in the third part of the talk we will have achieved nothing useful.
Simply depositing contemporary open source renderers in a Source-Forge-like depository is no warrant those renderers will perform as needed in the future, even assuming your proposed comprehensive emulation infrastructure. And if those tools fail then, when they are called on “just in time”, where will the living knowledge of these formats be found to make up the deficit?
Sheila here fails to explain any mechanism by which open source renderers would become obsolete, nor to explain how the "living knowledge", by which she appears to mean format specifications, would be used to make up the deficit when it could not do so by creating an open source renderer before the format went obsolete. This topic was covered in these posts.
The market is not going to provide us with a silver bullet that will solve all the problems the market itself creates. Nor will the market alone afford us a way around the unhappy fact that preservation assets have to be managed. And that means, especially for scholarly artifacts, that we're still going to need metadata: technical metadata, descriptive metadata, rights metadata, provenance metadata.
Sheila Exaggerates again, implying that I believe "The market is ... going to provide us with a silver bullet" but failing to cite anywhere I said anything like that. And she Obfuscates by providing a list of metadata types with no indication of what value they add to the preservation process and whether it justifies the cost of acquiring and preserving them.

Also, I am not proposing a "comprehensive emulation infrastructure". I am merely observing that providing a "comprehensive emulation infrastructure" is now a necessary part of the IT mainstream.
An earlier post explains why I don't think an open-source renderer, even if one can be found, is a sufficient substitute for technical metadata.
Sheila Exaggerates her comment to which she links, which doesn't explain that at all. It instead recommends that preservation systems should either (somehow) force publishers to produce content that conforms more closely to format specifications, or do as Portico does in some cases and expensively store multiple versions of the content.
I think the library community, and the academic community in general, would at least want to consider how much less effective search tools will be without descriptive metadata (which, incidentally, need not be, and are not currently, necessarily hand-crafted). Even in an ideal world of open access, it is possible to imagine categories of digital assets for which we will be obliged to track rights metadata. And given, as you say, that the Web is Winston's Smith's dream machine, perhaps provenance metadata, event metadata, collection management metadata, all still have important roles to play in digital preservation, too.

So we have to ask: Is the lesson of history, ancient and modern, that “formats and metadata are non-problems”? That “all we have to do is just collect and keep the bits, and all will be well”? Or does history caution us, with respect to formats and metadata, that it might be a little early yet to be standing in our flight suits on the deck of the good ship Digital Preservation, under a banner reading “Mission Accomplished”?
Once again, Sheila Exaggerates by being a bit careless with quotes. The actual quotes in the slides are:

  • What are the non-problems?

    • Or rather, the problems not big enough to matter

Non-problems

  • Formats

    • Any format with an open-source renderer is not at risk

  • Metadata (at least for documents)

    • Hand-generated metadata

      • Too expensive, search is better & more up-to-date

    • Program-generated metadata

      • Why save the output? You can save the program!

and:

  • Just collect and keep the bits

    • Not collecting is the major reason for stuff being lost

  • If you keep the bits, all will be well

    • Current tools will let you access them for a long time

Sheila hasn't provided any actual evidence to rebut the history of open source stability I provided, nor any showing that the collection and preservation of all of the multifarious types of metadata adds enough value to the preservation process to justify the costs involved. Nor has she refuted my contention that the major reason stuff is being lost is because it isn't being collected. Nor has she provided any evidence that anyone, least of all me, is "standing in our flight suits on the deck of the good ship Digital Preservation, under a banner reading 'Mission Accomplished'". She cannot have read the last third of the talk if she believes I think the mission is accomplished.
Are there perhaps some other lessons we can learn from a wider scan of the past and present of digital preservation?
There certainly are, and the entire third part of the talk was devoted to them. Unfortunately, Sheila proceeds to ignore most of them, such as the problems of intellectual property, and the problems caused by the shift from static documents to dynamic, interlinked collections of services.
Digital preservation is not free.This is one of the key points of your talk, and it's a really crucial insight. We need more and better numbers to help us ensure that we get the biggest possible preservation bang for our always-too-scarce preservation bucks.

So I am sorry, given that you are perhaps the person best positioned to have the necessary data to hand, that, along with giving your estimates of the costs of the Internet Archive and Portico, you did not take the occasion to detail, or indeed even to mention, the cost of C|LOCKSS as a preservation solution. This is all the more a pity as LOCKSS is exactly co-extensive with what you propose as the totally sufficient preservation strategy: just get the bits and save the bits, and use open-source software to do so.
The entire last third of my talk was devoted to reasons why there is currently no such thing as a "totally sufficient preservation strategy". One major reason is that all current approaches are far too expensive to meet society's needs. To show that I took two examples, the Internet Archive and Portico. The Internet Archive is clearly the lowest-cost system I could have chosen. I am not claiming that Portico is the highest-cost system; indeed it is probably the most efficient system using the conventional approach. I used these two examples to show two things:

  • That even the Internet Archive's costs are far too high to meet society's need with funds that could reasonably be expected to be available.

  • That the additional tasks required by the conventional approach Portico implements so efficiently are very large, and thus that they should either be eliminated, or made the subject of aggressive cost-reduction.
Adding other examples would have been redundant; the talk was already an hour long.
Open source projects after all are not magically exempt from economic laws. This is as true of the open-source tools that Portico uses, for example, as it is of the fascinating open-source emulators under development in the Dioscuri project, or, as you have elsewhere observed, of open-source software originally intended as a commercial product (such as Open Office, which, as you noted, entailed a 10-year development effort, and significant on-going subsidy from SUN Microsystems). Free software, as the saying goes, means “free as in speech, not free as in beer”. All vital, useful open source projects entail costs to get them going, and costs to keep them alive. If we are to judge how economical a preservation solution is, or to amortize the cost or preserved content over its lifetime, we need to know

  • what the sunk costs are

  • what the institutional subsidy in the form of what might be termed charge-backs was and is and is projected to be

  • what the ongoing costs are

  • what the projected costs are

  • what the incremental costs of adding both content nodes and subscribers to the network are

  • what fiscal reserves have to be set aside to ensure the continuance of the solution

Obfuscation in action - of course it would be good if the experts knew all these things and could decide which systems were sustainable in advance. But it doesn't address the point the talk was making, which is that we know for certain that all our current approaches, even the Open Source ones such as the Internet Archive's, are far too expensive.
Digital preservation is not one thing.It is many use cases, and many tiers of many solutions. It is many, and many different kinds of, participants, making many different kinds of preservation choices. Different content, different needs, different valuations, different scenarios dictate different cost/benefit analyses, different actions, different solutions.
Obfuscation again - the subject is so diverse that there is really nothing definite that can be said, even by an expert such as Sheila.
Of course we have to consider the challenges of scale, as you say in your talk. And of course this means we have to find ways to employ the economies of scale, and then to consider any risks the inevitable centralization to employ those economies entails. But we would do well to remember that, in Arthur's analysis of the convergence of the high technology market to a single market “solution”, that single solution is often neither the “best” nor the most “efficient” one.

The values of the marketplace, and the solutions of the marketplace, are not the values and solutions of the digital preservation community. The digital preservation community cannot afford a preservation monoculture, which could well be a single point of failure, whether that means a monolithic technology that could fail, or malignant political control of intellectual and cultural assets. We need CLOCKSS and Portico, the Stanford Digital Library and Florida's DAITSS, the British Library and the Koninklijke Bibliotheek and the German National Library, the Internet Archive, LOCKSS private networks, consortia collections, “retail” preservation-in-the-cloud.
Sheila Exaggerates as usual, attacking the straw man of the market. Nowhere in the talk do I argue that the market will solve the digital preservation problem, or that a single solution is either good or inevitable. But I would point out that, as Sheila says "Digital preservation is not free", and thus it cannot ignore the market by simply assuming that anything anyone wants to try will be funded by magic. She also Obfuscates, citing a slew of systems without linking to them so that the uninitiated could easily determine what their differences and similarities might be.
Digital preservation is not a technology.

It is a human activity. It is a collective action. It is people fulfilling the social compact to preserve and to pass along the socially meaningful digital creations of our time, our place, our cultures, our communities. It means people making choices – choices about what to preserve and what not to preserve; choices about what level of investment is appropriate for what categories of objects; choices about how to “divide and conquer” the universe of potentially preservable digital objects; choices about what entities and what technologies are trustworthy agents of preservation.

The market has its place. But surely if we have learned anything from recent events, it is to distrust the glamour of “the new new thing”, the can’t-fail digital preservation appliance that will hoover up all the bits in the world, and collect them in a preservation dust bag. The complexity of the digital preservation solution will have to match the complexity of the problem.

I agree with Sheila that digital preservation is "fulfilling the social compact to preserve and to pass along the socially meaningful digital creations of our time, our place, our cultures, our communities." I'm just pointing out that the techniques we currently use to do the job, mostly based on a decade-and-a-half-old analysis of the problem which history has largely invalidated, are completely inadequate to the task. And that relentlessly adding complexity is not going to help solve that problem.
Best regards,

Sheila


Sheila Morrissey
Senior Research Developer, Portico
sheila.morrissey@portico.orgorrissey@portico.org

Read More......

Friday, April 10, 2009

Spring CNI Plenary: The Remix

This post provides the text of the slides, sources and commentary for the opening plenary that I just gave at the CNI Spring Task Force meeting. The actual slides are available here (PDF). Follow me below the fold for the full details.

Kirk McKusick's IEEE Award

  • 30 years of the Unix file system
    • Disks 1,000,000x bigger
    • Code 4x bigger, much faster, more reliable
  • Reads every disk it ever wrote
    • No incompatible change to on-disk format
    • No incompatible change to API
  • For widely used software
    • Costs of incompatibility outweigh benefits
    • Strict compatibility makes Kirk's life easier

Kirk McKusick was awarded the 2009 IEEE Reynold B. Johnson Information Storage Systems Award at Usenix's 2009 FAST conference.

Shifting Sands
"... digital documents are evolving so rapidly that shifts in the forms of documents must inevitably arise. New forms do not necessarily subsume their predecessors or provide compatibility with previous formats."
  • Jeff Rothenberg "Ensuring the Longevity of Digital Documents" Scientific American Vol. 272 No. 1 1995
As Jeff wrote this, Kirk's file system was 16 years old, with no incompatible changes to the API or on-disk format.

The quotation is from the Jeff Rothenberg's original article "Ensuring the Longevity of Digital Documents" Scientific American Vol. 272, No. 1, 1995. A 1999 update is here, but the update doesn't change the argument of the talk.

The Meme
  • Incompatibility is inevitable, a force of nature
    • Why did Jeff think this in 1995?
    • Is it true in 2009?
  • If this meme isn't true
    • What causes incompatibility?
    • Are these causes operating now?
Incompatibility is not inevitable, it is a choice someone made. If they are rational, they assessed the costs and the benefits. Incompatible changes to widely used software impose costs on each user; if there are many users, aggregating these costs overwhelms any possible benefit. This is especially true when the benefits, even if large, accrue only to a few users.

Talk in 3 Parts
  • Ancient History: before 1995
    • Jeff Rothenberg's 50-year look forward from 1995
    • What he predicted & why
  • Modern History: from 1995 to 2009
    • Impacts of Jeff's article
    • What else happened
    • How Jeff rates as a prophet & why
  • The Future: following Jeff's example
    • Looking forward to identify the real problems


Ancient History
"History is not what you thought. It is what you can remember. All other history defeats itself."
  • From the Compulsory Preface to 1066 And All That, W. C. Sellar & R. J. Yeatman
1066 And All That is a classic of English humor. Anyone baffled by it should consult the po-faced Wikipedia entry.

Jeff Rothenberg's Scenario
  • In 2045, descendants find a CD
    • Try to recover document from it leading to Jeff's fortune
  • Threat: Media degradation
    • Bits on the CD suffer "bit rot"
  • Threat: Media obsolescence
    • No hardware capable of reading the bits available
  • Threat: Format obsolescence
    • No software capable of rendering the bits available
The first two threats are easy to explain and defend against by regularly migrating the bits from older to newer media. The third threat was harder to explain and defend against, so it dominates the article.

Jeff on Format Obsolescence
  • Defenses
    • Format Migration
    • Emulation
  • Format migration disapproved
    • "Finally, [format migration] suffers from a fatal flaw. ... Shifts of this kind make it difficult or impossible to translate old documents into new standard forms."
  • Emulation approved subject to caveat
    • "specifications for the outdated hardware ... must be saved in a digital form independent of ... software"
Note how, because the hardware specifications are themselves digital documents to be preserved, Jeff has deftly reduced the emulation strategy to a previously unsolved problem.

Jeff's Dystopian Vision
  • Documents survive in off-line media
  • The media have a short lifetime
  • The media readers have a short lifetime
  • Documents are in app-specific formats
    • Typical formats are proprietary
    • Attempts to standardize formats will fail
  • Hardware & O/S will change rapidly
    • In ways that break applications
  • Apps for rendering formats have a short life


Two Words: Desktop Publishing
  • The publishing medium was paper
  • Design goal of Word & WordPerfect files:
    • Save the state of the word processor
  • Formats - exclusive property of applications
    • Other apps interpreting them - threat to biz model
  • Then people started e-mailing the files:
    • Got there quicker, could be edited & returned
It is evident reading Jeff's article that the way the document describing his hidden fortune got on to the CD was via a desktop publishing system. If you think back to 1995, desktop publishing was all the rage.

IT in 1995


Modern History
"A preoccupation with the future not only prevents us from seeing the present as it is but often prompts us to rearrange the past."
  • Eric Hoffer


Impacts of Jeff's Vision
  • Scientific American article = lots of attention
  • Governments, foundations started funding
    • Mellon Foundation
    • NSF, Library of Congress, National Archives ...
  • Now have systems in production
    • Using both strategies Jeff identified
  • Internet Archive started the next year
    • Using neither of them
It is somewhat odd that, despite Jeff's preference for emulation, many more of the existing systems use format migration.

The Web
  • May 1995: HighWire puts JBC on-line
    • Pioneers academic e-journals
The graph is from Netcraft. It shows that Netcraft didn't even start tracking the Web until after Jeff's article had been published, and that the real explosive growth of the Web didn't start until after Jeff's update appeared in 1999.

Off-line or On-Line
  • In Jeff's vision documents survived off-line
    • Coming on-line for occasional manipulation or copying
    • Copy-ability was extrinsic to the medium
  • Now, if it is worth keeping, it is on-line
  • Off-line backups are temporary
  • Copy-ability is intrinsic to the on-line medium
  • No-one cares what the physical medium is
    • Disk, flash memory, RAM, ...
    • Just that it obeys the access protocols
To be sure, some material worth keeping is not on-line, at least not in the sense of being accessible via the Web. For example, the Stanford Digital Repository contains material that has been deposited on condition that it not be made accessible. Some of this represents preservation masters for content that is on-line in a presentation format. In other cases, it is content that ideally would be on-line if only that were permitted, for example content embargoed for a period, or material that would be on-line if only the resources to put it on-line were available.

Microsoft vs. its Users
  • MSFT Office biz model has to drive upgrades
    • Introduce gratuitous format incompatibility by default
    • New machine writes document old machine can't read
    • Old machine buys upgrade, MSFT happy
  • Users carry the cost of incompatibility
    • Unhappy - anti-trust probe ('90) & consent decree ('94)
    • Users ('02-'05) force ODF standard for documents
    • MSFT ('07) does OOXML, but concedes the basic point
  • Experience with MSFT misled Jeff
    • Even MSFT's ability to obsolete formats now limited
Two books about Microsoft's anti-trust struggle with the US Justice Dept. are Ken Auletta's World War 3.0 and John Heilemann's Pride Before The Fall.

Note that format obsolescence happens when support for a format is removed, not when support for a successor format is added. Microsoft's business model depended on adding support for new formats not on removing support for old formats; making the new version of Office incapable of reading documents produced by its predecessor would have been self-defeating.

Evidence that Microsoft can no longer remove support for old formats, as opposed to add support for new formats is in this post from last year.

Documents or Content
  • Jeff's documents were property of a program
    • A Word file is data to be manipulated (only) by Word
    • Proprietary format changeable on a whim
  • Now documents are content to be published
    • Charge to upgrade browser so it can't read old content?
    • Browser free, content free, Office biz model dead
  • Goal of publishing: reach as many readers as you can
    • Gratuitous incompatibility is now self-defeating
    • Publishing IE-only pages gets you flamed


Virtual Machines
  • H/W virtualization has long history (VM/370!)
    • Software too (Basic!)
  • In 1995 it wasn't mainstream
    • Intel was just putting necessary stuff into X86
  • Now virtual hardware is mainstream
    • Old hardware can be emulated easily with open source
  • Mainstream software now written for VMs
    • Java, C#, ...
  • Jeff was right about emulation
    • But preservation wasn't the reason for doing it


Open Source
  • In 1995 Open Source wasn't mainstream
    • Now it's basic strategy for all but 2 big IT companies
  • Open Source renderers for all major formats
    • Even those with DRM! (Legal status obscure)
  • Open Source is best preserved of all
    • ASCII, source code control, can rebuild stack as it was
  • Open Source isn't backwards incompatible
    • For same reason as "no flag day on the Internet"
  • Format with Open Source renderer is safe
    • Executable "preservation metadata"
For a discussion of the importance of open source for preservation, see this post.

This argument may not apply to console games and other forms of content protected by Digital Rights Management (DRM). Although in practice most forms of DRM have been cracked (for a particularly revealing description of the necessary reverse-engineering process, see Bunnie Huang's fascinating book Hacking the Xbox. Thus, although in most cases it is technically possible to preserve access to DRM-protected content, the legality of doing so is often challenged. Presumably, the challenges wouldn't be mounted if the open source renderers didn't render the content. There is more on DRM in this post.

20/20 Hindsight
  • Documents survive on-line, on the Web
    • Off-line used only for temporary backups
  • Migration between on-line media is inherent
    • Readers are bundled with storage technology
  • Formats are standard & app-independent
    • Proprietary formats get open-source renderers
  • Format obsolescence never happens
    • No flag day on the Internet
  • I.e: Jeff wrong in every particular


The Big Picture

  • IT markets have increasing returns
    • Usually called "network effects" - Metcalfe's Law
  • IT markets have path dependence
    • Many players early
      • Randomly one gets bigger, network effects take over
    • IT markets subject to capture (MSFT, INTC)
      • Captured markets slow change down (e.g. Vista)
    • History misled Jeff to overestimate change
W. Brian Arthur's book Increasing Returns and Path Dependence in the Economy" is an important description of the behavior of technology markets. It explains how, as illustrated in the graph that I created, they are initially fragmented, with multiple products competing with comparable shares of a small market. At some point, for random reasons, one gets enough bigger market share for the increasing returns to scale (or network effects) to take over. Once they do, one product rapidly gains share in a rapidly expanding market. Others initially benefit from the growing market even as they lose market share, but rapidly start losing their existing customers to the winner.

At this point, as shown by the arrow on the graph, it is in the interest of the winning product to make switching from their competitor's products as easy as possible.

This analysis works very well for markets with large numbers of relatively unsophisticated customers. Markets with a small number of sophisticated customers have figured out strategies for fighting back. For example, in the airliner business the airlines have understood that it is in their long-term interest to buy from both Boeing and Airbus; allowing either to fail would impose unacceptable monopoly costs. Similar behavior can be seen in the market for CPU chips (Intel vs. AMD) and graphics chips (NVIDIA vs. ATI).

Yes We Can!
  • Jeff being wrong is Good News!
    • Collections that survive aren't as hard as we thought
  • Just collect and keep the bits
    • Not collecting is the major reason for stuff being lost
  • If you keep the bits, all will be well
    • Current tools will let you access them for a long time
  • Just go do it!


The Future
"Prediction is very difficult, especially about the future."
  • Neils Bohr


The Real Problems Were ...
  • Scale
    • Not individual documents but vast collections of them
  • Cost
    • Preservation not by individuals but large organizations
  • Intellectual Property
    • If content worth saving someone is making money from it


Scale
  • Jeff looked at micro-level preservation
    • A single document on a single CD
  • Society needs macro-level preservation
    • Information is now industrial scale
    • Data centers the size of car factories
    • As much power as an aluminum smelter
  • 1 copy of 1 important database = $1M/yr
    • In storage costs alone
  • Document-at-a-time preservation impractical
    • Curators must get huge collections per day's work
Storage cost issues are addressed in the series of posts on A Petabyte For A Century and the resulting iPRES paper (190K PDF).

Metcalfe's Law
  • The lesson of Google
    • More value in connections than in documents themselves
    • Preserving individual documents loses this value
    • Need to preserve collections including the connections
  • Another instance of Metcalfe's Law
    • Value of a network goes as # of nodes squared
    • Isolated document is a network of 1 node
  • Google's other lesson - it's expensive
    • We lack good cost data for digital preservation at scale
    • Use two extremes to get a ballpark estimate
The two extremes are archive.org and Portico. I should stress that both systems are well engineered to meet their different goals using their chosen techniques. I am not criticizing them, I'm simply using them as bounds on the costs of operating at scale.

Scale Implies Cost
  • Internet Archive:
    • contains 2PB, growing 240TB/yr
    • Google collects the Web monthly then discards it
    • archive.org collects the Web monthly then keeps it
    • 2 snapshot copies + 1 coming up
    • $10-14M/yr operation so ~$0.5 per GB per year
  • Portico:
    • All academic literature ~50TB, growing ~5TB/yr
    • Portico still working on ingesting back content
    • $6-8M/yr operation so >$10 per GB per year
My cost numbers for archive.org come from a recent article in The Economist's Technology Quarterly, and for Portico from a guesstimate based on their tax returns.

How Many $ Do We Need?
  • archive.org should be cheaper than Portico
    • It isn't doing all that "preservation" stuff
    • Better bit preservation than archive.org important
  • But does all the other stuff justify 20x cost per byte?
  • How much do we need to save? An exabyte?
    • 0.3% of the data generated in 2007, 0.05% of 2011
    • @ archive.org = $5B/yr, @ Portico = $100B/yr
    • The world doesn't have even $5B/yr to spend on this
Much less $100B/yr. The point is that, even if we could do adequate quality preservation with archive.org's cost structure, we'd still be much too expensive to address society's need for preservation. With the cost structures more normally associated with preservation at scale, we're much, much further away from addressing it.

Intellectual Property
  • Most content worth saving is making money
    • Lawyers won't risk that; don't want you to keep a copy
  • They have massaged the law to their ends
    • You must get permission, so you must talk to lawyers
      • Or you are vulnerable to DMCA take-down like IA
  • 1 hour of 1 lawyer ~ 5TB of disk
    • 10 hours of 1 lawyer could store the academic literature
  • For preservation, much uncertainty
    • Effort devoted to high byte/lawyer-hour content
  • Please use Creative Commons licenses!
The real problem is that the need to talk to the copyright owner's lawyers applies even if the content is open access motivates preservation of content for which a single lawyer's conversation obtains permission for a great deal of content. So, for example, even if it takes a lot of lawyer time to talk to Elsevier, the cost per unit of content preserved is small. Whereas even if the cost to talk to a small open access publisher is small, the cost per unit of content will be prohibitive. Once again, the economic forces push towards preservation of the content that is not at risk of loss.

Looking Forwards
  • What are the non-problems?
    • Or rather, the problems not big enough to matter
  • What are the big problems?
    • Preserving the world the way it is now
      • Not the way it used to be
    • Finding enough money
      • And working out how much that is
    • Surviving not having enough money
      • By turning more things into non-problems


Non-Problems
  • Formats
    • Any format with an open-source renderer is not at risk
  • Metadata (at least for documents)
    • Hand-generated metadata
      • Too expensive, search is better & more up-to-date
    • Program-generated metadata
      • Why save the output? You can save the program!
There are extended discussions of the usefulness of format metadata in this post, and of the relative value of open source renderers as against format specifications in this post. For a discussion of the questionable value of format metadata for preservation see this post.

Services not Documents
  • "Preservation" implies static, isolated object
  • Web 2.0 is dynamic, interconnected
    • Each page view is unique, mash-ed up from services
    • Pages change as you watch them
  • What does it mean to preserve a unique, dynamic page?
For a discussion of the importance of context in preserving the Web see this post.

Things Worth Preserving
  • User Generated Content
    • To understand 2008 election you need to save blogs
    • To do that you need to save YouTube, photo sites, ...
      • So that the links to them keep working ...
    • Technical, legal, scale obstacles almost insuperable
  • Multi-player games & virtual worlds
    • Even if you could get the data and invest in the servers
    • They're dead without the community - Myst (1993)
  • Dynamic databases & links to them
    • e.g. Google Earth mash-ups - is Google Earth forever?
Do you remember Myst from 1993? It was a beautiful virtual world that you explored. Pretty soon you figured out that you were the only person there. Some time after that you figured out that the goal of the game was to figure out why you were the only person there. We've come a long way since then, Myst would not make it against World of Warcraft or Second Life.

For a discussion of the problem of preserving the materials future scholars will need to study elections, see this post.

Economics
  • 2008 Preservation Buzzword: Sustainability
    • We can't afford to preserve the stuff we know how to
  • Future stuff will be much more expensive
    • There'll be a lot more bytes of it
    • Each byte will be more difficult & more expensive
  • Bytes vulnerable to money supply glitches
    • Data needs to be endowed if it is to survive hard times
    • Endowing up front means preserving less
  • Collection development: what must be kept?
    • But it has really bad scaling problems
Bytes are a lot more vulnerable to disruptions in the money supply that paper. They are like divers in old-fashioned diving suits, dependent on air continuously pumped down from the surface. We need to make preserved bytes more like SCUBA divers, carrying their own tank of air with them that only needs to be refilled at intervals. Endowing data is discussed in this post.

Digital Preservation Difficult
  • Conceptually
    • What does it mean to preserve dynamic content?
  • Technically
    • Need to preserve services not content. How?
  • Legally
    • Preservation requires permission
    • How do you even find everyone you need to ask?
  • Economically
    • Just storing the bits needs industrial infrastructure
      • Beyond resources of universities, national libraries
    • Are services like S3 reliable enough?
    Alyssa Henry's FAST keynote, in which she offered numbers for availability but pointedly not for reliability is discussed in this post.

    Digital Preservation Important
    • Paper's attributes built in to society
      • Durable, write-once, tamper-evident, highly replicated, ...
    • Society needs fixed, tamper-evident record
      • E.g. laws, contracts, evidence, ...
        • Paper provides this as a side-effect
    • The Web is Winston Smith's dream machine
      • All govt. information on a single web server (FDsys)
      • Point-&-click to rewrite history
    FDsys is discussed in this post.

    Practical Next Steps
    • Everyone - just go collect the bits:
      • Not hard or costly to do a good enough job
      • Please use Creative Commons licenses
    • Preserve Open Source repositories:
      • Easy & vital: no legal, technical or scale barriers
    • Support Open Source renderers & emulators
    • Support research into preservation tech:
      • How to preserve bits adequately & affordably?
      • How to preserve this decade's dynamic web of services?
        • Not just last decade's static web of pages


      Additional Material

      Here is some additional material I prepared but which I cut to get down to the time allowed.

      Did Documents Get Lost?

      I was expecting a question asserting that I was wrong to suggest that formats in wide use in 1995 had not gone obsolete.

      The Open Office that I use has support for reading and writing Microsoft formats back to Word 6 (1993), full support for reading WordPerfect formats back to version 6 (1993) and basic support back to version 4 (1986).

      I am sure that there are many formats that were in use in 1995 that are now difficult to render because current tools lack support for them. I have argued for a long time that there are few, if any, formats in wide use in 1995 that are difficult to render with current tools. I'm still looking for counter-examples.

      But even if there were counter-examples, it wouldn't invalidate my case. It is easy to emulate 1995 PCs, and quite possible to emulate most other architectures current in 1995 using virtual machine technology. See, for example, this BBC story about a collaboration between Microsoft, the British Library and the British National Archives to access old formats by running virtual instances of old Microsoft operating systems and the relevant applications.

      The only question is, did someone keep the bits for the operating system and the application as well as the document?

      As regards media, the media in wide use in 1995 that are less common today are 3.5" floppies (still on the shelves at Fry's), ZIP drives (as I write this there are 306 of the original ZIP drives on eBay), and DAT tape (40 drives on eBay).

      Read More......

      Monday, March 30, 2009

      Spring CNI Plenary

      I can finally reveal the mysterious talk I referred to in this comment; it is the opening plenary at CNI's Spring Task Force meeting one week from today. In essence, the talk is a look back at Jeff Rothenberg's 1995 Scientific American article "Ensuring the Longevity of Digital Documents" which asks:

      • What led Jeff to his dire predictions?
      • Would one make the same dire predictions now?
      • If not, what dire predictions would one make, and why?
      This talk arose because Cliff Lynch invited me to give a talk at UC Berkeley's School of Information last November. He liked it enough to ask me to give it at CNI and I agreed, thinking I had already written it. But once I thought about it more, and considered the very different audience, I realized that it needed to be almost completely rewritten. I'm still revising it, based on feedback from giving it to the Stanford Library staff.

      CNI will post the slides after the talk, and I plan to post here a commentary on them providing links to sources and additional details. You will be able to see how the discussions here were a very valuable resource. Thank you all.

      Read More......

      Monday, March 23, 2009

      2009 FAST conference

      I attended the 2009 FAST Conference in San Francisco almost a month ago. This post was delayed because, as I mentioned in this comment, I've been working on an important talk which draws from recent discussions on this blog. The papers are on the web following Usenix's commendable open access policy. Follow me below the fold for my highlights.

      FAST opened with a richly-deserved IEEE award for my friend Kirk McKusick, recognizing his 30-year stewardship of the Fast File System and his many contributions to the evolution of file systems in general. There's a prevalent mindset that software is unstable and evanescent; it is nice to draw attention to areas that have demonstrated long-term stability. In 30 years of consistent gradual improvement that have kept FFS competitive with the best in the industry, as the underlying disks have grown from megabytes to terabytes and the code has grown from 12K to 55K lines, there have been no incompatible changes in either the application interface or the on-disk format. Today's FFS could read any disk it has ever written, even had one by some miracle survived from 30 years ago.

      This was followed by a fascinating keynote by Alyssa Henry, who manages Amazon's S3 storage service. As of a few months ago, it held over 40 billion objects and was growing exponentially. Designing systems to survive at these scales and growth rates is a major challenge; you will never know the important parameters, and failures of all kinds are an everyday occurrence. Alyssa enumerated the system's design goals:

      • Dont lose or corrupt objects
      • 99.99% uptime
      • Scale as the competitive advantage
      • Security, authentication and logs
      • Low latency compared to the Internet's latency
      • Simple API
      • Cheap and pay-as-you-go so as to eliminate under-utilization.

      She also identified a number of important techniques:

      • Redundancy: but only as much as you need as although it will increase durability and availability it will also increase cost and complexity.
      • Retry: making operations idempotent means that they can be retried to leverage redundancy.
      • Surge Protection: rate limiting, exponential backoff and cache TTL reduce the impact of the inevitable load peaks.
      • Eventual consistency: sacrifice some consistency to improve availability, sacrifice some availability to improve durability. For example, writes are not acknowledged to the application until enough redundant writes have completed, but at that time only some of the indexes that provide access to the written data may have been updated.
      • Routine failure: make sure to exercise all code paths all the time, so for example when equipment or software is end-of-lifed or taken down for maintenance they just pull the plug, don't try to do a graceful shutdown.
      • Integrity checking: the application delivers a checksum with the data which is checked on ingest, regularly during storage, and on dissemination.
      • Telemetry: internal, external, real-time and historical, per host and aggregate data is essential to managing the system.
      • Autopilot: humans fail and are slow. Don't blame person, blame the tool design.

      Her conclusion was that "Storage is a lasting relationship, it requires trust. Reliability comes from engineering, experience, and scale." It was interesting that although she gave a numeric goal of 4-nines availability, and said that S3 offers a guarantee of 3-nines with financial penalties if they don't deliver, she carefully avoided giving any numbers for reliability except that their goal was 100%. Well, Duh! Although she did state that they measured it, she didn't give any indication of what the measurements showed. Nor did she show any willingness to offer any stronger statement for reliability than their existing EULA, which basically states that if they lose your data it is your problem, not theirs. It is hard for me to see any responsible way to use S3 as a long-term data repository without some commitment to or measurement of a level of reliability.

      The papers were interesting, but there was only one directly relevant to digital preservation. This was Tiered Fault Tolerance for Long-Term Integrity by Byung-Gon Chun, Petros Maniatis, Scott Shenker and John Kubiatowicz (disclosure, I have been a co-author with Petros). The paper essentially answers the question:
      What is the minimum necessary hardware support that would allow a guarantee of long-term integrity in a distributed, replicated system equivalent to the guarantee Byzantine Fault Tolerance (BFT) offers for short-term integrity?
      The answer turns out to be surprisingly little. Just a small amount of state to hold the root of a tree of hashes, protected in a specific way that is easy to implement in hardware.

      I'm particularly interested in this paper because, even before I started work on LOCKSS, I had realized that long-term integrity with a BFT-like guarantee wasn't possible without hardware assistance, and that I had no idea what such hardware would look like. The work that led to the major LOCKSS papers was a response to the fact that requiring special-purpose hardware that hadn't even been designed yet would have been a major barrier to entry for libraries wishing to preserve content. Our papers addressed the question:
      What is the strongest reassurance we can offer without special-purpose hardware?
      Taking the long view, it is likely that Chun and his co-authors will eventually be the starting point of the right track for preservation.

      Read More......

      Thursday, January 15, 2009

      Postel's Law

      In RFC 793 (1981) the late, great Jon Postel laid down one of the basic design principles of the Internet, Postel's Law or the Robustness Principle:


      "Be conservative in what you do; be liberal in what you accept from others."

      Its important not to lose sight of the fact that digital preservation is on the "accept" side of Postel's Law, but it seems that people often do.

      On the Digital Curation Centre Associates mail list, Adil Hasan started a discussion by asking:

      "Does anyone know [whether] there has been a study to estimate how many PDF documents do not comply with the PDF standards?"


      No-one in the subsequent discussion knew of a comprehensive study, but Sheila Morrissey reported on the results of Portico's use of JHOVE to classify the 9 million PDFs they have received from 68 publishers as one of not well formed, well-formed and not valid, and well-formed and valid. A significant proportion were classified as either not well formed or well-formed and not valid.

      These results are not unexpected. It is well known that much of the HTML on the Web fails the W3C validation tests. Indeed, a 2001 study reportedly concluded that less than 1% of it was valid SGML. Alas, I couldn't retrieve the original document via this link, but our experience confirms that much HTML is poorly formed. For this very reason LOCKSS uses a crawler based on work by James Gosling at Sun Microsystems to develop techniques for extracting links from HTML that are very tolerant of malformed input; an application of Postel's Law.

      Follow me below the fold to see why, although questions like Adil's are frequently asked, devoting resources to answering them or acting upon the answers is unlikely to help digital preservation.

      Why, in a forum devoted to digital curation, would anyone ask about the proportion of PDF files that don't conform to the standards? After all, the PDF files they are asking about are generated by tools. No-one writes PDF by hand. So if they don't conform to the standards, it is because the tool that generated them had a bug in it. Why not report the bug to the tool creator? Because even if the tool creator fixed the bug, the files the tool generated before the fix was propagated would still be wrong. There's no way to recall and re-create them, so digital curators simply have to deal with them.

      The saving grace in this situation is that the software, such as Adobe Reader, that renders PDF is constructed according to Postel's Law. It does the best it can to render even non-standard PDF legibly. Because it does so, it is very unlikely that a bug in the generation tool will have a visible effect. And if the bug doesn't have a visible effect, it is very unlikely to be detected, reported and fixed.

      Thus we see that a substantial proportion of non-conforming PDF files is to be expected. And it is also to be expected that the non-conforming files will render correctly, since they will have been reviewed by at least one human (the author) for legibility.

      Is the idea to report the bugs, which don't have visible effects, to the appropriate tool vendors? This would be a public-spirited effort to improve tool quality, but a Sysiphean task. And it wouldn't affect digital curation of PDF files since, as we have seen, it would have no effect on the existing population of PDF files.

      Is the idea to build a PDF repair tool, which takes non-conforming PDF files as input and generates a conforming PDF file that has an identical visual rendering as output? That would be an impressive feat of programming, but futile. After all, the non-conforming file is highly likely to render correctly without modification. And if it doesn't, how would the repair tool know what rendering the author intended?

      Is the idea to reject non-conforming files for preservation or curation? This is the possibility that worried me, as it would be a violation of Postel's Law. To see why I was worried, substitute HTML for PDF. It is well-known that a proportion, perhaps the majority, of web sites contain HTML that fails the W3C conformance tests but that is perfectly legible when rendered by all normal browsers. This isn't a paradox; the browsers are correctly observing Postel's Law. They are doing their best with whatever they are given, and are to be commended for doing so. Web crawls by preservation institutions such as national libraries and the Internet Archive would be very badly advised to run the W3C tests on the HTML they collect and reject any that failed. Such nit-picking would be a massive waste of resources and would cause them to fail in their mission of preserving the Web as a cultural artifact.

      And how would an archive reject non-conforming files? By returning them to the submittor with a request to fix the problem? In almost all cases there's nothing the submittor can do to fix the problem. It was caused by a bug in a tool he used, not by error on his part. All the submittor could do would be to transmit the error report to the tool vendor and wait for an eventual fix. This would not be a very user-friendly archive.

      So why do digital curators think it is important to use tools such as JHOVE to identify and verify the formats of files? Identifying the format is normally justified on the basis of knowing what formats are being preserved (interesting) and flagging those thought to be facing obsolescence (unlikely to happen in the foreseeable future to the formats we're talking about). But why do curators care that the file conforms to the format specification rather than whether it renders legibly?

      The discussion didn't answer this question but it did reveal some important details:

      First, although it is true that JHOVE flags a certain proportion of PDF files as not conforming to the standards, it is known that in some cases these are false positives. It is not known what JHOVE's rate of false negatives is, which would be cases in which it did not flag a file that in fact did not conform. It is hoped that JHOVE2 (PDF), the successor to JHOVE which is currently under development, will have lower error rates. But there don't appear to be any plans to measure these error rates, so it'll be hard to be sure that JHOVE2 is actually doing better.

      Second, no-one knows what proportion of files that JHOVE flags as not conforming are not legible when rendered using standard tools such as Adobe Reader or Ghostscript. There are no plans to measure this proportion, either for JHOVE or for JHOVE2. So there is no evidence that the use of these tools contributes to future readers' ability to read the files which is, after all, the goal of curation. Wouldn't it be a good idea to choose a random sample among the Portico PDFs that JHOVE flags, render them with Ghostscript, print the results and have someone examine them to see if they were legible?

      Third, although Portico classifies the PDF files it receives into the three JHOVE categories, it apparently observes Postel's Law by accepting PDF files for preservation irrespective of the category they are in. If so, they are to be commended.

      Fourth, there doesn't seem to be much concern about the inevitable false positives and false negatives in the conformance testing process. The tool that classifies the files isn't magic, it is just a program that purports to implement the specification which, as I pointed out in a related post, is not perfect. And why would we believe that the programmer writing the conformance tester was capable of flawless implementation of the specification when his colleagues writing the authoring tools generating the non-conformances were clearly not? Lastly, absence of evidence is not evidence of absence. If the program announces that the file does not conform, it presumably identifies the non-conforming elements. They can be checked to confirm that the program is correct. Otherwise, it presumably says OK. But what it is really saying is "I didn't find any non-conforming elements. So the estimate from running the program is likely to be an under-estimate - there will be false negatives, non-conforming files that the program fails to detect.

      The real question for people who think that JHOVE-like tools are important, either as gatekeepers or as generators of metadata, is "what if the tool is wrong?" There are two possible answers. Something bad happens. That makes the error rate of the tool a really important, but unknown, number. Alternatively, nothing bad happens. That makes the tool irrelevant, since not using it can't be worse than using it and having it give wrong answers.

      Thus, to be blunt for effect, we have a part of the ingest pipeline that is considered to be important which classifies files into three categories with some unknown error rate. There is no evidence that these categories bear any relationship to the current or eventual legibility of these files by readers. And the categories are ignored in subsequent processing. Why are we bothering to do this?

      Read More......