Tuesday, December 30, 2008

Persistence of Poor Peer Reviewing

Another thing I've been doing during the hiatus is serving as a judge for Elsevier's Grand Challenge. Anita de Waard and her colleagues at Elsevier's research labs set up this competition with a substantial prize for the best demonstration of what could be done to improve science and scientific communication given unfettered access to Elsevier's vast database of publications. I think the reason I'm on the panel of judges is that after Anita's talk at the Spring CNI she and I had an interesting discussion. The talk described her team's work to extract information from full-text articles to help authors and readers. I asked who was building tools to help reviewers and make their reviews better. This is a bete noir of mine, both because I find doing reviews really hard work, and because I think the quality of reviews (including mine) is really poor. For an ironic example of the problem, follow me below the fold.

Sunday, December 28, 2008

Foot, meet bullet

The gap in posting since March was caused by a bad bout of RSI in my hands. I believe it was triggered by the truly terrible ergonomics of the mouse buttons on the first-generation Asus EEE (which otherwise fully justifies its reputation as a game-changing product). It took a long time to recover and even longer to catch up with all the work I couldn't do when I couldn't type for more than a few minutes.

One achievement during this enforced hiatus was to turn my series of posts on A Petabyte for a Century into a paper entitled Bit Preservation: A Solved Problem? (190KB PDF) and present it at the iPRES 2008 conference last September at the British Library.

I also attended the 4th International Digital Curation Conference in Edinburgh. As usual these days, for obvious reasons, sustainability was at the top of the agenda. Brian Lavoie of OCLC talked (461KB .ppt) about the work of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access which he co-chairs with Fran Berman of the San Diego Supercomputer Center. NSF, the Andrew W. Mellon Foundation and others are sponsoring this effort; the LOCKSS team have presented to the Task Force. Their interim report has just been released.

Listening to Brian talk about the need to persuade funding organizations of the value of digital preservation efforts I came to understand the extent to which the tendency to present simply preserving the bits as a trivial, solved problem has caused the field to shoot itself in the foot.

The activities that the funders are told they need to support are curation-focused, such as generating metadata to prepare for possible format obsolescence, and finding the content for future readers. The problem is that, as a result, the funders see a view of the future in which, even if they do nothing, the bits will survive. There might possibly be problems in the distant future if formats go obsolete, but there might not be. There might be problems finding content in the future, but there might not be. After all, funders might think, if the bits survive and Google can index them, how much worse than the current state could things be? Why should they pour money into activities intended to enhance the data? After all, the future can figure out what to do with the bits when they need them; they'll be there whatever happens.

A more realistic view of the world, as I showed in my iPRES paper, would be that there are huge volumes of data that need to be preserved, that simply storing a few copies of all of it is more costly than we can currently cope with, and that even if we spend enough to use the best available technology we can't be sure the bits will be safe. If this were the view being presented to the funders, that unless they immediately provide funds important information would gradually be lost, they might be scared into actually doing something.

Thursday, March 6, 2008

More bad news on storage reliability

Last year's FAST conference had depressing news for those who think that bits are safe in modern storage systems, with two papers showing that disks in use in large-scale storage facilities are much less reliable than the manufacturers claim, and a keynote (PDF) reporting that errors in file system code are endemic. This year's FAST had more sobering news. I'll return to these papers in more detail, but here are the take-away messages.

Jiang et al from UIUC and NetApp took a detailed look at the various subsystems in modern storage system, showing that 45-75% of the apparent disk unreliability in last year's papers is probably due to the unreliability of other components in the storage system, and that the correlations between errors are even worse than last year's papers suggested.

Gunawi et al from Wisconsin analyzed one of the root causes of the incorrect response of file systems to errors in the underlying storage that was reported in earlier papers from Wisconsin (PDF) and Stanford (PDF), namely the way the file systems propogate reported errors between functions and modules, showing that correct handling of these problems is so hard that implementors often throw up their hands.

Bairavasundaram et al from Wisconsin, NetApp and Toronto presented a massive study of silent data corruption in storage systems, reinforcing the earlier study (PDF) from CERN in showing an alarming incidence of both these errors and of correlations between them.

Krioukov et al from Wisconsin and NetApp analyzed the techniques RAID-based storage systems use to tolerate silent data corruption, showing that in various ways all current systems fall short of an adequate solution to this problem.

Greenan and Wylie (PDF) from HP Labs gave a work-in-progress presentation showing that the Markov models which are pretty much the exclusive technique for analyzing failures in storage systems give results that are systematically optimistic because they depend on assumptions that are known to be untrue.

Considerable kudos is due to NetApp for their many contributions to both data and analysis and to the University of Wisconsin, which is building an impressive track record in this important area.

Thursday, January 31, 2008

Lawyers and Mass-Market Scholarly Communication

There's some evidence that lawyers are even faster than scientists to adopt mass-market scholarly communication. This is not surprising, given the number of lawyers who blog (or rather blawg). In a post at the outstanding law blog Balkinization Jack Balkin writes:
Orin Kerr found that law review citations to our friends at the Volokh Conspiracy have been increasing significantly over the years. Using the same methodology (citations to balkin.blogspot.com in Westlaw JLR database limited to each year) I discovered the same thing is true of Balkinization. In 2003 we received 1 cite; in 2004 3 cites; in 2005 14 cites; in 2006 36 cites; and in 2007 49 cites. As Orin reminds us, some law journals have not yet published all their 2007 issues, so the final number for 2007 may be slightly higher.

These results suggest that blogging has become a more widespread and accepted practice in the legal academy. It's important to remember that people cite for many different reasons: to give credit for ideas, to criticize ideas, and as (persuasive) authority. My guess is that most of these citations fall into the first two categories, but that is true to many citations for law review articles as well.

On the Law Librarian Blog, Joe Hodnicki provides more evidence for this.
This built-in undercount does not diminish from the fact that the below statistics do give a sense of the magnitude of the growth rate of blog citations. According to this estimate, blog citations in law reviews and court opinions have grown from about 70 in 2004 to over 500 in 2007 (and still counting since many law reviews have not completed their 2007 publishing cycle). I believe it is fair to say that for 2005 and 2006 blog citations probably grew exponentially on a document count basis, doubling each year.

It is unlikely, however, that any final count for 2007 will show a similar rate of growth. If the case, would this mean that blogs are "on the decline." Doubtful. It would simply mean that the blogging phenomenon is maturing. As with other forms of publication, with age comes acceptance and recognition of place within the structure of legal literature.

Wednesday, January 30, 2008

Does Preserving Context Matter?

As a Londoner, I really appreciate the way The Register brings some of the great traditions of Fleet Street to technology. In an column that appeared there just before Christmas, Guy Kewney asks his version of Provost O'Donnell's question, "Who's archiving IT's history?" and raises the important issue of whether researchers need only the "intellectual content" to survive, or whether they need the context in which it originally appeared.

Now is an unusual opportunity to discuss this issue, because the same content has been preserved both by techniques that do, and do not, preserve the context, and it has been made available in the wake of a trigger event. Some people, but not everyone, will be able to draw real comparisons.

Kewney writes:
One of my jobs recently has been to look back into IT history and apply some 20-20 hindsight to events five years ago and ten years ago.
Temporarily unable to get to his library of paper back issues of IT Week for inspiration, he turned to the Internet Archive's Wayback Machine to look back five years at his NewsWireless site:
I won't hear a word against the WayBackMachine. But I will in honesty have to say a few words against it: it's got holes.
What it's good at is holding copies of "That day's edition" just the way a newspaper archive does. I can, for example, go back to NewsWireless by opening up this link; and there, I can find everything that was published on December 6th 2002 - five years ago! - more or less. I can even see that the layout was different, if I look at the story of how NewsWireless installed a rogue wireless access point in the Grand Hotel Palazzo Della Fonte in Fiuggi, ...

Now, have a look at the same story, as it appears on NewsWireless today. The words are there, but it looks nothing like it used to look.

Unusually, NewsWireless does give you the same page you would have seen five years ago. When you're reading the Fiuggi story, the page shows you contemporary news... It's the week's edition, in content at least.

Most websites don't do this.

You can, sometimes, track back a particular five-year-old story (though sadly you'll often find it's been deleted), but if you go to the original site you're likely to find that the page you see is surrounded by modern stories. It's not a five-year-old edition. Take, for example Gordon Laing's Christmas 2002 article ... and you'll find exactly no stories at all relating to Christmas 2002. They were published, yes, but they aren't archived together anywhere - except the WayBackMachine.
Look at the two versions of the Fiuggi story linked from the quote above - although the words are the same the difference is striking. It reveals a lot about the changes in the Web over the past five years.

A much more revealing example than Kewney's is now available. SAGE publishes many academic journals. Some succeed, others fail. One of the failures was Graft: Organ and Cell Transplantation, of which SAGE published three volumes from 2001 to 2003. SAGE participates in both the major e-journal archiving efforts, CLOCKSS and Portico, and both preserve the content of these three volumes. SAGE decided to cease publishing these volumes, and has allowed both CLOCKSS and Portico to trigger the content, i.e. to go through the process each defines for making preserved content available.

The Graft content in CLOCKSS is preserved using LOCKSS technology, which uses the same basic approach as the Internet Archive. The system carefully crawls the e-journal website, collecting the content of every URL that it thinks of as part of the journal. After the trigger event all these collected URLs are reassembled to re-constitute the e-journal website, which is made freely available to all under a Creative Commons license.

You can see the result at the CLOCKSS web site. The page at that link is an introduction, but if you follow the links on that page to the Graft volumes, you will be seeing preserved content extracted from the CLOCKSS system via a script that arranges it in a form suitable for Apache to serve. Please read the notes on the introductory page describing ways in which content preserved in this way may surprise you.

The Graft content in Portico is preserved by a technique that aims only to preserve the "intellectual content", not the context. Content is obtained from the publisher as source files, typically the SGML markup used to generate the HTML, PDF and other formats served by the e-journal web site. It undergoes a process of normalization that renders it uniform. In this way the same system at Portico can handle content from many publishers consistently, because the individual differences such as branding have been normalized away. The claim is that this makes the content easier to preserve against the looming crisis of format obsolescence. It does, however, mean that the eventual reader sees the "intellectual content" as published by Portico's system now, not as originally published by SAGE's system. Since the trigger event, readers at institutions which subscribe to Portico can see this version of Graft for themselves. Stanford isn't a subscriber, so I can't see it; I'd be interested in comments from those who can make the comparison.

It is pretty clear that Kewney is on the LOCKSS side of this issue:
Once upon a time, someone offered me all the back numbers of a particular tech magazine I had contributed to. He said: "I don't need it anymore. If I want to search for something I need to know, I Google it."

But what if you don't know you need to know it? What sort of records of the present are we actually keeping? What will historians of the future get to hear about contemporary reactions to stories of the day, without the benefit of hindsight?

Maybe, someone in the British Library ought to be solemnly printing out all the content on every news website every day, and storing them in boxes, labelled by date?
The LOCKSS technology can in some respects do better than that, but in other respects it can't. For example, every reader of a Web page containing advertisements may see a different ad. Printing the page gets one of them. The LOCKSS technology has to exclude the ads. But, as you can see, it does a reasonable job of capturing the context in which the "intellectual content" appeared. Notice, for example, the difference between the headline bar of a typical table of contents page extracted from an Edinburgh University CLOCKSS node and a Stanford University CLOCKSS node. This is an artifact of the institution's different subscriptions to SAGE journals.

This isn't a new argument. The most eloquent case for the importance of preserving what the publisher published was made by Nicholson Baker in Double fold: libraries and the assault on paper. He recounts how microfilm vendors convinced librarians of a looming crisis. Their collections of newspapers were rapidly decaying. It was urgently necessary to microfilm them or their "intellectual content" would be lost to posterity. Since the microfilm would take up much less space, they would save money in the long run. The looming crisis turned out to be a bonanza for the microfilm companies but a disaster for posterity. Properly handled newspapers were not decaying, improperly handled they were. Although properly handled microfilm would not decay, improperly handled it decayed as badly as paper. The process of microfilming destroyed both "intellectual content" and context.

I'd urge anyone tempted to believe that the crisis of format obsolescence looms so menacingly that it can be solved only through the magic of "normalization" to read Nicholson Baker.

Sunday, January 20, 2008

How Hard Is "A Petabyte for a Century"?

In a comment on my "Petabyte for a Century" post Chris Rushbridge argues that because his machine has 100GB of data and he expects much higher reliability than a 50% chance of it surviving undamaged for a year, which would be a bit half-life of 100 times the age of the universe, that the Petabyte for a Century challenge is not a big deal.

It is true that disk, tape and other media are remarkaby reliable, and that we can not merely construct systems with a bit half-life of the order of 100 times the age of the universe, but also conduct experiments that show that we have done so. Watching a terabyte of data for a year is clearly a feasible experiment, and at a bit half-life of 100 times the age of the universe one would expect to see 5 bit flips.

Nevertheless, it is important to note that this is an experiment very few people actually do. Does Chris maintain checksums of every bit of his 100GB? Does he check them regularly? How certain is he that at the end of the year every single bit is the same as it was at the start? I suspect that Chris assumes that because he has 100GB of data and most of it is over a year old and he hasn't noticed anything bad, that the problem isn't that hard. Even if all these assumptions were correct, the petabyte for a century problem is one million times harder. Chris' argument amounts to saying "I have a problem one-millionth the size of the big one, and because I haven't looked very carefully I believe that it is solved. So the big problem isn't scary after all."

The few people who have actually measured silent data corruption in large operational data storage systems have reported depressing results. For example, the excellent work at CERN described in a paper (pdf) and summarized at StorageMojo showed that the error rate delivered to applications from a state-of-the-art storage farm is of the order of ten million times worse than the quoted bit error rate of the disks it uses.

We know that assembling large numbers of components into a system normally results in a system much less reliable than the components. And we have evidence from CERN, Google and elsewhere (pdf) that this is what actually happens when you assemble large numbers of disks, controllers, busses, memories and CPUs into a storage system. And we know that these systems contain large amounts of software which contains large (pdf) amounts (pdf) of bugs. And we know that it is economically and logistically impossible to do the experiments that would be needed to certify a system as delivering a bit error rate low enough to provide a 50% probability of keeping a petabyte uncorrupted for a century.

The basic point I was making was that even if we ignore all the evidence that we can't, and assume that we could actually build a system reliable enough to preserve a petabyte for a century, we could not prove that we had done so. No matter how easy or hard you think a problem is, if it is impossible to prove that you have solved it, scepticism about proposed solutions is inevitable.

Friday, January 18, 2008

Digital Preservation for the "Google Generation"

A study by researchers at University College London sponsored by the British Library and JISC supports a point I've been making since the start of the LOCKSS program nearly a decade ago.
"The report Information Behaviour of the Researcher of the Future (PDF format; 1.67MB) also shows that research-behaviour traits that are commonly associated with younger users – impatience in search and navigation, and zero tolerance for any delay in satisfying their information needs – are now becoming the norm for all age-groups, from younger pupils and undergraduates through to professors."

What this means for digital preservation is that transparency of access is essential. Readers don't have the patience and attention span to jump through hoops to obtain access to preserved materials. The Web is training users that if they click on a link and nothing happens within about 10 seconds, they should forget that link and click elsewhere. These few seconds are all a digital preservation system has to satisfy its readers.

If preserved materials are not instantly available through their normal finding techniques, primarily search engines such as Google, they will not be used. We made this observation in late 1998 during the initial design of the LOCKSS syste. It motivated us to make access to preserved content completely transparent, by having an institution's LOCKSS box behave as a persistent Web cache. Content thus remains instantly available at its original URL. From our 2000 Freenix paper (pdf):
"Unless links to pages continue to resolve, the material will effectively be lost because no-one will have the knowledge or patience to retrieve it."

Dark archives are thus not useful to the general readership. Although they may provide useful insurance for the content, their complex and time-consuming access methods mean that readers will require an access copy. For small collections, additional copies are not significant. But for collections big enough that the cost of storage is significant this is a problem.

Thursday, January 3, 2008

Format Obsolescence: Right Here Right Now?

time961 at Slashdot reports that:
In Service Pack 3 for Office 2003, Microsoft disabled support for many older file formats. If you have old Word, Excel, 1-2-3, Quattro, or Corel Draw documents, watch out!

Is this yet another format obsolescence horror story of the kind I discussed in an earlier post? Follow me below the fold for reassurance; this story is less scary than it seems.

The field of digital preservation has been heavily focussed on the problem of format obsolescence, paying little attention to the vast range of other threats to which digital content is vulnerable. I have long argued that the reason is that most people's experience of format obsolescence is heavily skewed; it comes from Microsoft's Office suite. Microsoft's business model depends almost entirely on driving its customers endlessly around the upgrade cycle, extracting more money from their existing customer base each time around the loop. They do this by deliberately introducing gratuitous format obsolescence. In the comments LuckyLuke58 makes my point:
Doubt it's really about security at all; I'm guessing it's probably more about 'nudging' the few people still using old versions of the software to upgrade: Those who currently exchange documents with users on newer versions will find suddenly they won't be able to send documents to anyone anymore without getting complaints that people can't open them. Deliberately making it too cumbersome and complex for most people to ever work around this, i.e. leaving it technically (but not really practically for almost everyone) an option, for now at least gives MS an excuse, while still taking a big step towards getting rid of support for those old formats entirely, which is not all that unreasonable I suppose for formats greater than 10 years old.


LuckyLuke58 gets the basic idea right. New instances of Office entering the installed base are set up to save documents in a format that older versions cannot understand. The only way to maintain compatibility is to use a deliberately awkward sequence of commands and ignore warnings. Every time someone with an older version gets one of these new documents, they get a forceful reminder of why they need to spend the money to get upgraded. This works particularly well in organizations, where the people with the power tend to have their computers upgraded most frequently. Telling your boss that you need an upgrade in order to read the documents he's sending you makes it hard for him to deny the request.

Because almost everyone encounters this kind of deliberate format obsolescence regularly, and because the cure for it (buy a more recent version of Office) is essentially forced upon them, they make two natural assumptions:

  • Format obsolescence happens when the software vendor says it does.

  • Format obsolescence happens frequently and regularly to all formats.


Both of these are wrong. The fact that Microsoft has ended support for old formats does not mean they can no longer be read. It just means that you can't use up-to-date versions of Microsoft's tools to read them. Microsoft's annoucement hasn't magically removed the support for these formats from any preserved binaries of the pre-upgrade tools, and these can be run using emulation. The Open Source tools that support these formats still work (see my post on Format Obsolescence: Scenarios).

Even if Microsoft did have magic powers to tamper with old binaries and source, it is pretty much only Microsoft formats that are subject to rapid gratuitous format obsolescence. A business model dependent on driving the upgrade cycle to extract money from existing customers is something that happens only to monopolists; everyone else needs to attract new customers. A reputation for frequent format obsolescence isn't a good way to do that. In fact, it isn't even a good way to keep existing customers. The formidable resistance Microsoft has encountered in trying to "standardize" OOXML in a way that allows them to continue to use proprietary lock-in and gratuitous format obsolescence to milk their customer base shows that even a monopolist's customers will reach their pain threshold eventually.

At first glance, this announcement from Microsoft appears to support those who think dealing with format obsolescence is the be-all and end-all of digital preservation. But it doesn't. Content preserved in these formats can still be rendered, and converted to more modern formats, using easily available tools. There are at least two ways to do this - open source tools, and emulated environments running preserved Microsoft tools. Its hard to construct a scenario in which either would stop working in the foreseeable future. And what has happened is not typical of formats in general, its typical of Microsoft. It is true that access to older content will be a little less convenient, but that is what Microsoft is trying to achieve.