DSHR's Blog: Digitized Historical Documents

Josh Marshall of Talking Points Memo trained as a historian. From that perspective, he has a great post entitled Navigating the Deep Riches of the Web about the way digitization and the Web have transformed our access to historical documents. Below the fold, I bestow both praise and criticism.

The really good part of Marshall's post is the explanation, with many examples, of the beneficial results of libraries' and archives' digitization programs. These all used to be locked away in archives or, at best, presented a page at a time in display cases. All but a tiny fraction of the interested public had to make do with small reproductions. Even scholars would need months to get permission for even brief access.

Now anyone, but especially scholars, wherever they feel the need,can examine in close-up detail treasures such as ancient codexes, illuminated manuscripts, Gutenberg Bibles, ancient maps, or Isaac Newton's own notebook. Even more recent resources such as photographs of jazz greats are available. All this without asking permission, travel to the library, or even donning white gloves. And, as a side-effect, the risk to the originals has been significantly reduced. Truly something worth celebrating, as Marshall does.

Marshall describes the problem of preserving the digital versions of these treasures thus:

Happily, for those of us who are merely consumers of these riches in the present, it’s someone else’s problem. But it is a big, fascinating problem for librarians and digital archivists around the world.

Well, yes, it is a much more interesting problem than I thought when I started work on it more than two decades ago. The not so good part of Marshall's post is that, like most of the public and even some in the digital preservation community is simply wrong about what the problem is. He writes eloquently:

There is a more complex process underneath all these digital riches which is just how to preserve digitized collections to stand the test of time. With books, by and large, you just take care of them. Easier said than done and world class libraries now have a complex set of practices to preserve physical artifacts from acid-free containers to climate control and the like. But there’s an entirely different set of issues with digitization. It would certainly suck if you’d digitized your whole collection in 1989 and just had a big collection of 5 1/4 inch floppy disks produced on OS/2, the failed IBM-backed PC operating system that officially died in 2006.

That’s just an example for illustration. But you can see the challenge. Over the last 30 or 40 years we’ve had Betamax, VHS, vinyl albums, CDs, DVDs, BluRay, various downloadable video and audio formats. These are all a positive terror if you’re trying to organize and preserve artifacts of the past that people will have some hope of using in a century or five centuries. What formats do you use? How do you store them – not simply to make them available today but to ensure they aren’t lost in some digital transition or societal disruption in the future?

I wrote about some of the many reasons format obsolescence was only a problem for the archaeology of IT before the 90s back in 2007, in the second and third posts to this blog. Here is a summary from 2011's Are We Facing a "Digital Dark Age?":

In the pre-Web world digital information lived off-line, in media like CD-Rs. The copy-ability needed for media refresh and migration was provided by a reader extrinsic to the medium, such as a CD drive. In the Web world, information lives on-line. Copy-ability is intrinsic to on-line media; media migration is routine but insignificant. The details of the storage device currently in use may change at any time without affecting reader's access. Access is mediated not by physical devices such as CD readers but by network protocols such as TCP/IP and HTTP. These are the most stable parts of cyberspace. Changing them in incompatible ways is effectively impossible; even changing them in compatible ways is extremely hard. This is because they are embedded in huge numbers of software products which it is impossible to update synchronously and whose function is essential to the Internet's operation.

Research by the BL's Andy Jackson and INA's Matt Holden (discussed here and here) shows that Web formats are extremely slow to change and backwards compatibility is generally well-maintained. Even if it isn't Ilya Kreymer's oldweb.today shows that accessing preserved Web content with a contemporaneous Web browser is easy.

British Library budget, 2000 £

As I keep saying, for example here, the fundamental problem of digital preservation is economic; we know how to keep digital content safe for the long term, we just don't want to pay enough to have it done.The budgets for society's memory institutions - libraries, museums and archives - have been under sustained pressure for many years. For each of them, caring for their irreplaceable legacy treasures must take priority in their shrinking budget over caring for digitized access surrogates of them. These are, after all, replaceable if the originals survive. It may be expensive to re-digitize them with, presumably, better technology in the future. But it can't be done if the originals succumb to a budget crisis.

DSHR's Blog

Thursday, April 4, 2019

Digitized Historical Documents

No comments: