Sunday, December 30, 2007

Mass-market scholarly communication revisited

Again, I need to apologize for the gap in posting. It turns out that I'm not good at combining travel and blogging, and I've been doing a lot of travel. One of my trips was to DC for CNI and the 3rd International Digital Curation Conference. One of the highlights was a fascinating talk (abstract) by Prof. Carole Goble of the University of Manchester's School of Computer Science. She's a great speaker, with a vivid turn of phrase, and you have to like a talk about science on the web in which a major example is, a shoe shopping site.

Carole and her team work on enabling in silico experiments by using workflows to compose Web and other services. Their myGrid site uses their Taverna workflow system to provide scientists with access to over 3000 services. Their myExperiment "scientists social network and Virtual Research Environment encourages community-wide sharing, and curation, of workflows".

Two things I found really interesting about Carole's talk were:

  • myExperiment is an implementation of the ideas I discussed in my post on Mass-Market Scholarly Communication, enhanced with the concepts of workflows.

  • The emerging world of web services is the big challenge facing digital preservation. Her talk was a wonderful illustration both of why this is an important problem, in that much of reader's experience of the Web is already mediated by services, and why the barriers to doing so are almost insurmountable.

Carole's talk was like a high-speed recapitulation of the history of the Web, with workflows taking the place of pages. More generally, it was an instance of the way Web 2.0 evolution is like Web 1.0 evolution with services instead of static content. Carole described how scientists discovered that they could link together services (pages) using workflows (links). There were soon so many services that sites that directories of them arose (think the early Yahoo!). Then there were so many of them that search engines arose. Then enough time elapsed that people started noticing that workflows (links) decayed over time quite rapidly. There was, however, one important piece of Web 1.0 missing from her presentation - advertising. Follow me below the jump for an explanation of why this omission is important and some suggestions about what can be done to remedy it.

Advertising has played a much underestimated role in driving the evolution of the Web. It provides direct, almost instantaneous, feedback that rewards successful mutations. Because the feedback is monetary, it ensures that successful mutations will get the resources they need to thrive, and unsuccessful ones will not. We take for granted that a Web site that attracts a large readership will be able to sustain itself and grow, but the only reason this is so is because of advertising. Daily Kos, a political blogging site, was started on a shoestring in 2002 and hit a million page views a month after its first year. It now gets 15 million a month, consumes a respectable-size server farm and employs a growing staff. Darwin would recognize this process instantly.

Despite a 1998 harangue from Tim Berners-Lee, research showed that pages had a half-life of a month or two, and that even links in academic journals decayed rapidly. But that was before advertising effects had been felt. Now, it may still be true that pages have a short half-life. But pages that the readership judges to be important, and which thus bring in advertising dollars, do not decay rapidly. Nor do the links that bring traffic to them. Site administrators have been schooled by advertising's reward and punishment system, and they know that gratutiously moving pages breaks links and impairs search engine rankings, which decreases income. So the problem of decaying links has been solved, not by persistent URL technology but by rewarding good behavior and punishing bad behavior.

Although the analogy between Web 1.0 pages and Web 2.0 workflows was apparent from Carole's talk, there is one big difference. People that advertisers will pay to reach read Web 1.0 pages. Workflows are read by programs, whose discretionary spending is zero. There's thus no effective mechanism rewarding good behavior in the workflow world and punishing bad. Suppose that putting a service on-line that attracted a large workflow-ership rapidly caused a flow of money to arrive at the site hosting the service, sufficent to sustain and grow it. Many of the problems currently plaguing workflows would vanish overnight. Site maintainers would find, for example, that a poor availability record or non-upwards-compatible changes to their sites API would rapidly reduce their income, and being the smart young scientists they are, would learn not to do these things.

Funding agencies and others interested in the progress of e-science need an equivalent of advertising to drive the evolution of services and workflows. Without it the field will continue to be plagued by poor performance, fragility, unreliability and instability, with much effort being wasted. More critically, one major key to scientific progress is the requirement that experiments be replicable by later researchers. The current workflow environment appears to provide an almost total inability to replicate experiments after a period no matter how well they were published.

What key aspects of Web advertising are needed in a system to drive the evolution of scientific workflows?

  • It must provide money directly to the maintainers of the services and workflows.

  • The amount of money must be based on automated, third-party measures of usage (think AdSense or Doubleclick or even SiteMeter for scientific workflows) and importance (think PageRank for scientific workflows).

  • The cycle time of the reward system must match the Web, not the research funding process. myExperiment has been in public beta less than six months. In that time it has evolved significantly. A feedback process that involves writing grant proposals, having them peer-reviewed, and processed through an annual budget cycle is far too slow to have any effect on the evolution of a workflow environment.

Funders need to put some infrastructure money into a pot that is doled out automatically via these measures. Doing so will pay great benefits in scientific productivity.

Saturday, October 13, 2007

Who's looking after the snowman?

In a post to the liblicense mailing list James O'Donnell, Provost of Georgetown University, asks:

"So when I read ten years from now about this cool debate among Democratic candidates that featured video questions from goofy but serious viewers, including a snowman concerned about global warming, and people were watching it on YouTube for weeks afterwards: how will I find it? Who's looking after the snowman?

This is an important question. Clearly, future scholars will not be able to understand the upcoming election without access to YouTube videos, blog posts and other ephemera. In this particular case, I believe there are both business and technical reasons why Provost O'Donnell can feel somewhat reassured, and legal and library reasons why he should not. Follow me below the fold for the details.

Here is the pre-debate version of the snowman's video, and here are the candidates' responses. CNN, which broadcast the debate, has the coverage here. As far as I can tell the Internet Archive doesn't collect videos like these.

From a business point of view, YouTube videos are a business asset of Google, and will thus be preserved with more than reasonable care and attention. As I argued here, content owned by major publishing corporations (which group now includes Google) is at very low risk of accidental loss; the low and rapidly decreasing cost per byte of storage makes the business decision to keep it available rather than take it down a no-brainer. And that is ignoring the other aspects of the Web's Long Tail economics which mean that the bulk of the revenue comes from the less popular content.

Technically, YouTube video is Flash Video. It can easily be downloaded, for example by this website. The content is in a widely used web format that has an open-source player, in this case at least two (MPlayer and VLC). It is thus perfectly feasible to preserve it, and for the reasons I describe here the open source players make it extraordinarily unlikely that it would not be possible to play the video in 10, or even 30 years. If someone collects the video from YouTube and preserves the bits, it is highly likely that the bits will be viewable indefinitely.

But, will anyone other than Google actually collect and preserve the bits? Provost O'Donnell's library might want to do so, but the state of copyright law places some tricky legal obstacles in the way. Under the DMCA, preserving a copy of copyright content requires the copyright owner's permission. Although I heard rumors that CNN would release video of the debate under a Creative Commons license, on their website there is a normal "All Rights Reserved" copyright notice. And on YouTube, there is no indication of the copyright status of the videos. A library downloading the videos would have to assume it didn't have permission to preserve them. It could follow the example of the Internet Archive and depend on the "safe harbor" provision, complying with any "takedown letters" by removing them. This is a sensible approach for the Internet Archive, which aims to be a large sample of the Web, but not for the kind of focused collections Provost O'Donnell has in mind.

The DMCA, patents and other IP restrictions place another obstacle in the way. I verified that an up-to-date Ubuntu Linux system using the Totem media player plays downloaded copies of YouTube videos very happily. Totem uses the GStreamer media framework with plugins for specific media. Playing the YouTube videos used the FFmpeg library. As with all software, it is possible that some patent holder might claim that it violated their patents, or that in some way it could be viewed as evading some content protection mechanism as defined by the DMCA. As with all open source software, there is no indemnity from a vendor against such claims. Media formats are so notorious for such patent claims that Ubuntu segregates many media plugins into separate classes and provides warnings during the install process that the user may be straying into a legal gray area. The uncertainty surrounding the legal status is carefully cultivated by many players in the media market, as it increases the returns they may expect from what are, in many cases, very weak patents and content protection mechanisms. Many libraries judge that the value of the content they would like to preserve doesn't justify the legal risks of preserving it.

Tuesday, October 9, 2007

Workshop on Preserving Government Information

Here is an announcement of a workshop in Oxford on Preserving & Archiving Government Information. Alas, our invitation arrived too late to be accepted, but the results should be interesting. Its sponsored by the Portuguese Management Centre for an e-Government Network (CEGER). Portugal's recent history of dictatorship tends to give them a realistic view of government information policies.

Wednesday, October 3, 2007

Update on Preserving the Record

In my post Why Preserve E-Journals? To Preserve the Record I used the example of government documents to illustrate why trusting web publishers to maintain an accurate record is fraught with dangers. The temptation to mount an "insider attack" to make the record less inconvenient or embarrassing is too much to resist.

Below the fold I report on two more examples, one from the paper world and one from the pre-web electronic world, showing the value of a tamper-evident record.

For the first example I'm indebted to Prof. Jeanine Pariser Plottel of Hunter College, who has compared the pre- and post-WWII editions of books published by right-wing authors in France and shown that the (right-wing) publishers sanitized the post-WWII editions to remove much of the anti-semitic rhetoric. Note that this analysis was possible only because the pre-WWII editions survived in libraries and private collections. They were widely distributed on durable, reasonably tamper-evident media. They survived invasion, occupation, counter-invasion and social disruption. It would have been futile for the publishers to claim that the pre-WWII editions had somehow been faked after the war to discredit the right. Prof. Plottel points to two examples of "common practice":

1. the books of Robert Brasillach (who was executed) edited by his brother-in-law Maurice Bardèche, Professor of 19th Century French Literature at the Sorbonne during the war, stripped of his post, after. The two men published an Histoire du cinéma in 1935. In subsequent editions published several times after the war beginning in 1947, the term "fascism" is replaced by "anti-communisme."

2. Lucien Rebatet's, Les décombres (1942), was one of the best-sellers of the Occupation, and it is virulently anti-Semitic. A new expurgated version was later published under the title Mémoire d'un fasciste. Who was Rebatet? you ask. Relegated to oblivion, I hope. Still, you may remember Truffaut's film, Le dernier métro (wonderful and worth seeing, if you haven't). The character Daxiat is modeled upon Rebatet.

In a web-only world it would have been much easier for the publishers to sanitize history. Multiple libraries keeping copies of the original editions would have been difficult under the DMCA. It must be doubtful whether the library copies would have survived the war. The publisher's changes would likely have remained undetected. Had they been detected the critics would have been much easier to discredit.

The second example is here. This fascinating paper is based on Will Crowther's original source code for ADVENT the pioneering work of interactive fiction that became, with help from Don Woods, the popular Adventure game. The author, Dennis Jerz, shows that the original was based closely on a real cave, part of Kentucky's Colossal Cave system. This observation was obscured by Don Woods' later improvements.

As the swift and comprehensive debunking of the allegations in SCO vs. IBM shows, archaeology of this kind for Open Source software is now routine and effective. This is because the code is preserved in third-party archives which use Source Code Control systems derived from Marc Rochkind's 1972 SCCS, and provide a somewhat tamper-evident record. Although Jerz shows Crowther's original ADVENT dates from the 1975-6 academic year, SCCS had yet to become widely used outside Bell Labs, and the technology needed for third-party repositories was a decade in the future. Jerz's work depended on Stanford's ability to recover data from backups of Don Woods' student account from 30 years ago; an impressive feat of system administration! Don Woods vouches for the recovered code, so there's no suspicion that it isn't authentic.

How likely is it that other institutions could recover 30-year old student files? Absent such direct testimony, how credible would allegedly recovered student files that old be? Yet they have provided important evidence for the birth of an entire genre of fiction.

Sunday, September 16, 2007

Sorry for the gap in posting

It was caused by some urgent work for the CLOCKSS project, vacation and co-authoring a paper which has just been submitted. The paper is based on some interesting data, but I can't talk about it for now. I hope to have more time to blog in a week or two after some upcoming meetings.

In the meantime, I want to draw attention to some interesting discussion about silent corruption in large databases that relates to my "Petabyte for a Century" post. Here (pdf) are slides from a talk by Peter Kelemen of CERN describing an on-going monitoring program at CERN using fsprobe(8). It randomly probes 4000 of CERN's file systems, writing a known pattern then reading it back looking for corruption. They find a steady flow of 1-3 silent corruptions/day, that is the data read back doesn't match what was written and there is no error indication.

Peter sparked a discussion and a post at KernelTrap. The slides, the discussion and the post are well worth reading, especially if you are among the vast majority who believe that data written to storage will come back undamaged when you need it.

Also, in a development related to my "Mass-market Scholarly Communication" post, researchers at UC's Office of Scholarly Communication released a report that apparently contradicts some of the findings of the UC Berkeley study I referred to. I suspect, without having read the new study, that this might have something to do with the fact that they studied only "ladder-rank" faculty, where the Berkeley team studied a more diverse group.

Thursday, August 9, 2007

The Mote In God's Eye

I gave a "Tech Talk" at Google this week. Writing it, I came up with two analogies that are worth sharing, one based on Larry Niven and Jerry Pournelle's 1974 science fiction classic The Mote In God's Eye and the other on global warming. Warning: spoilers below the fold.

The Mote In God's Eye describes humanity's first encounter with intelligent aliens, called Moties. Motie reproductive physiology locks their society into an unending cycle of over-population, war, societal collapse and gradual recovery. They cannot escape these Cycles, the best they can do is to try to ensure that each collapse starts from a higher level than the one before by preserving the record of their society's knowledge through the collapse to assist the rise of its successor. One technique they use is museums of their technology. As the next war looms, they wrap the museums in the best defenses they have. The Moties have become good enough at preserving their knowledge that the next war will feature lasers capable of sending light-sails to the nearby stars, and the use of asteroids as weapons. The museums are wrapped in spheres of two-meter thick metal, highly polished to reduce the risk from laser attack.

"Horst, this place is fantastic! Museums within museums; it goes back incredibly far - is that the secret? That civilization is very old here? I don't see why you'd hide that."

"You've had a lot of wars," Potter said slowly.

The Motie bobbed her head and shoulder. "Yah."

"Big wars."

"Right. Also little wars."

"How many?"

"God's sake, Potter! Who counts? Thousands of Cycles. Thousands of collapses back to savagery."

One must hope that humanity's problems are less severe that those of the Moties, but it is clear that preserving the record of society's knowledge is, and always has been, important. At first, societies developed specialist bards and storytellers whose job it was to memorize the knowledge and pass it on to succeeding generations. The invention of writing led to the development of libraries full of manuscripts. Most libraries at this stage both collected copies of manuscripts, and also employed scribes to copy them for exchange with other libraries. It took many man-years of work to create a copy, but they were extremely robust. Vellum, papyrus and silk can last a millennium or more.

Printing made copies cheap enough for the consumer market, thereby eliminating the economic justification for libraries to create copies. They were reduced to collecting the mass-market products. But it was much cheaper to run a library, so there were many more of them, and reader's access to information improved greatly. The combination of a fairly durable paper medium and large numbers of copies in library collections made the system remarkably effective at preserving society's knowledge. It has worked this way for about 550 years and, in effect, no-one really had to pay for it. Preservation was just a side-effect of the way readers got access to knowledge; in economic jargon, an externality.

Humanity is only now, arguably much too late, coming to terms with the externalities (global warming, acidification of the oceans and so on) involved in burning fossil fuels. The difficulty is that technological change brings about a need for something that once was free (the externality) now to be paid for. And those whose business models benefited most from the free externality (e.g. the fossil fuel industry and its customers) have a natural reluctance to do so. Governments are considering imposing carbon taxes, cap-and-trade schemes or other ways to ensure that the real costs of maintaining a survivable environment are paid. Similarly, the technology industry and in particular highly profitable information providers such as Google, Elsevier and News Corporation are unlikely to fund the necessary two-meter thick metal shells without encouragement.

Sunday, July 15, 2007

Update to "Petabyte for a Century"

In a paper (abstract only) at the Archiving 2007 conference Richard Moore and his co-authors report that the San Diego Supercomputer Center's cost to sustain one disk plus three tape replicas is $3K per terabyte per year. The rapidly decreasing disk media cost is only a small part of this, so that the overall cost is not expected to drop rapidly. Consider our petabyte of data example. Simply keeping it on-line with bare-bones backup, ignoring all access and update costs, will cost $3M per year. The only safe funding mechanism is endowment. Endowing the petabyte at a 7% rate of return is a $43M investment.

There are probably already many fields of study for which the cost of generating a petabyte of useful data is less than $43M. The trend in the cost per byte of generating data is down, in part because of the increased productivity of scholarship based on data rather than directly on experiment. Thus the implied and unacknowledged cost of the data generated may in many cases overwhelm the acknowledged cost of the project that generated it.

Further, if all the data cannot be saved, a curation process is needed to determine what should be saved and add metadata describing (among other things) what has been discarded. This process is notoriously hard to automate, and thus expensive. The curation costs are just as unacknowledged as the storage costs. The only economically feasible thing to do with the data may be to discard it.

An IDC report sponsored by EMC (pdf) estimates that the world created 161 exabytes of data in 2006. Using SDSC's figures it would cost almost half a trillion dollars per year to keep one on-line and three tape backup copies. Endowing this amount of data for long-term preservation would take nearly seven trillion dollars in cash. Its easy to see that a lot of data isn't going to survive.

The original "Petabyte for a Century" post is here.

[Edited to correct broken link]

Friday, July 13, 2007

Update on Post-Cancellation Access

In my June 10 post on post-cancellation access to e-journals I said:
big publishers increasingly treat their content not as separate journals but as a single massive database. Subscription buys access to the whole database. If a library cancels their subscription, they lose access to the whole database. This bundling, or "big deal", leverages a small number of must-have journals to ensure that cancellation of even low-value journals, the vast majority in the bundle, is very unlikely. It is more expensive to subscribe individually to the few high-value journals than to take the "big deal". Thus cancellation of large publisher journals is a low risk, which is the goal of the "big deal" scheme.
On July 5 Elsevier mailed their subscribers about 2008 pricing. The mail confirmed that both individual print journals and individual e-journal subscriptions are history.

On July 6 the Association of Subscription Agents (an interested party) issued a press release that clarified the impact of Elsevier's move:
Libraries face a choice between Science Direct E-Selects (a single journal title purchased in electronic format only on a multiple password basis rather than a site licence), or a Science Direct Standard or Complete package (potentially a somewhat more expensive option but with the virtue of a site licence).
Libraries must now pay the full rate for both the print and the E-Select (electronic) option if they required both formats. This separation of the print from the electronic also leaves European customer open to Value Added Tax on the E-Select version which, depending on the EU country involved could add substantially to the cost (17.5% in the UK, 19% in Germany for example).
It is clear that Elsevier, at least, is determined to ensure that its customers subscribe only to the electronic "big deal" because, once subscribed, libraries will find cancellation effectively impossible.

Sunday, June 10, 2007

Why Preserve E-Journals? To Preserve The Record

In the previous post in this series I examined one of the two answers to the question "why preserve e-journals?". Libraries subscribing to e-journals want post-cancellation access for their readers to the content they purchased. Given that the problem only arises for subscription journals that don't operate a moving wall, that cancellation is a fairly rare event, and that canceled content can't have been that important anyway (or why was it canceled?), it is hard to see this reason for preservation justifying current levels of investment in e-journal preservation.

I will now turn to the other answer to the question, that is in order to preserve the integrity of the record of scholarship. This is, or should be, a concern not of individual libraries, but of libraries working together in the interests of society as a whole.

To understand what is needed to preserve the integrity of the record of scholarship as it migrates to the digital world, we need to understand the threats to the integrity. The LOCKSS team set out our model of the threats to which digital content is subject in a paper in D-Lib. Briefly, they are:

  • Media Failure or loss.

  • Hardware Failure.

  • Software Failure.

  • Communication Errors.

  • Failure of Network Services.

  • Media and Hardware Obsolescence.

  • Software and Format Obsolescence.

  • Operator Error.

  • Natural Disaster.

  • External Attack.

  • Internal Attack.

  • Economic Failure.

  • Organizational Failure.

All digital content is subject to all these threats, but different threats affect different content differently. The major publishers have been in business a long time, most well over a century and Oxford University Press well over 5 centuries. They understand that their content is their core business asset, and are highly motivated to ensure that these threats don't destroy its value. Irrespective of external efforts, they already devote significant resources (which ultimately come from their subscribers, the libraries) to preserving these assets. E-journals in their custody are not likely to die of media, hardware, software or format obsolescence or failure. Smaller publishers, whether for-profit or not, as long as they survive the consolidation of the publishing market, will also understand that both their chances of survival and their value as an acquisition depend on maximizing the value of their content, and will be motivated to preserve it.

Yet much of the external money being invested in preserving e-journals targets precisely this low-risk content, from the largest to the medium-sized publishers. The first question librarians ask of an e-journal preservation system is "do you have Elsevier content?" Elsevier has been in business since 1880, longer than many of the libraries asking the question. Their content would not be at significant risk even if copies of it were not, as they actually are, already on-site at about a dozen libraries around the world. This investment is merely adding to the already substantial existing investment from the libraries via the publishers in preserving their content. Are these sensible investment policies?

The content that is at the most risk from these threats is that of the smallest publishers. In the humanities, this is the content that is being preserved by the LOCKSS Humanities Project. This is the content that, were it on paper, research libraries would be collecting to provide future scholars with the material they need to study the culture of our times. The publishers and their content are visibly at-risk; among those initially selected by a group of humanities collection specialists at major research libraries in 2002, several disappeared from the Web before the project could collect them, and others disappeared since. For example, World Haiku Review's content was lost by the journal's hosting platform and had to be recovered from the LOCKSS system.

The e-journals from these tiny publishers are almost all open access. Libraries have no financial incentive, in the form of concern about post-cancellation access, to preserve them. Some of this content is not being collected by the Internet Archive (e.g. Princeton Report on Knowledge) and some is, although the Archive's collection and preservation (pdf) is somewhat haphazard. In many fields communication has moved almost completely to the Web. Poetry, dance, media criticism and many other areas find the Web, with its low cost, multimedia capability and rapid, informal exchanges a much more congenial forum than print. There seem to be three main reasons why research libraries are no longer collecting the record of these fields:

  • Faculty have little awareness of the evanescent nature of Web content.

  • Thus there is little demand on the collections specialists, whose skills have been eroded as the "big deal" lease licenses take over.

  • The "big deal" licenses have consumed more and more of the budget.

Both the LOCKSS system and the Internet Archive's Heritrix crawler provide tools libraries could use, if they wanted to. The provisions of the DMCA mean that more systematic collection and preservation than the Internet Archive can manage requires either a Creative Commons license or specific permission from the publisher. Awareness and use of Creative Commons licenses is limited in these areas. Experience in the LOCKSS Humanities Project has shown that, although permission is always granted, the process of tracking down the publisher, having the necessary conversation, and getting them to add the permission statement to their web site is time-consuming. A small fraction of the resources going in to preserving low-risk e-journals from large publishers could preserve a huge amount of high-risk content for future scholars, by encouraging and supporting librarians in using the available tools.

And yet, simply trusting even the biggest publishers to preserve their own content is not a responsible approach. To make this clear, I'll start by looking at the problem of preserving the integrity of the record in a different, easier to explain area, that of US federal government documents.

Since 1813 federal documents have been printed by an executive branch agency, the Government Printing Office (GPO), and distributed to a network of over 1,000 libraries around the US under the Federal Depository Library Program (FDLP). These are University, State and public libraries, and each collects a subset of the documents matching the interests of their readers. The documents remain the property of the US government, but copies may be made as they are not copyright. Under various reciprocal arrangements, many documents are also supplied to the national libraries of other countries.

The goal of the FDLP was to provide citizens with ready access to their government's information. But, even though this wasn't the FDLP's primary purpose, it provided a remarkably effective preservation system. It created a large number of copies of the material to be preserved, the more important the material, the more copies. These copies were on low-cost, durable, write-once, tamper-evident media. They were stored in a large number of independently administered repositories, some in different jurisdictions. They are indexed in such a way that it is easy to find some of the copies, but hard to be sure that you have found them all.

Preserved in this way, the information was protected from most of the threats to which stored information is subject. The FDLP's massive degree of replication protected against media decay, fire, flood, earthquake, and so on. The independent administration of the repositories protected against human error, incompetence and many types of process failures. But, perhaps most important, the system made the record tamper evident.

Winston Smith in "1984" was "a clerk for the Ministry of Truth, where his job is to rewrite historical documents so that they match the current party line". George Orwell wasn't a prophet. Throughout history, governments of all stripes have found the need to employ Winston Smiths and the US government is no exception. Government documents are routinely recalled from the FDLP, and some are re-issued after alteration.

An illustration is Volume XXVI of Foreign Relations of the United States, the official history of the US State Department. It covers Indonesia, Malaysia, Singapore and the Philippines between 1964 and 1968. It was completed in 1997 and underwent a 4-year review process. Shortly after publication in 2001, the fact that it included official admissions of US complicity in the murder of at least 100,000 Indonesian "communists"by Suharto's forces became an embarrassment, and the CIA attempted to prevent distribution. This effort became public, and was thwarted when the incriminating material was leaked to the National Security Archive and others.

The important property of the FDLP is that in order to suppress or edit the record of government documents, the administration of the day has to write letters, or send US Marshals, to a large number of libraries around the country. It is hard to do this without attracting attention, as happened with Volume XXVI. Attracting attention to the fact that you are attempting to suppress or re-write history is self-defeating. This deters most attempts to do it, and raises the bar of desperation needed to try. It also ensures that, without really extraordinary precautions, even if an attempt succeeds it will not do so without trace. That is what tamper-evident means. It is almost impossible to make the record tamper-proof against the government in power, but the paper FDLP was a very good implementation of a tamper-evident record.

It should have become evident by now that I am using the past tense when describing the FDLP. The program is ending and being replaced by FDSys. This is in effect a single huge web server run by the GPO on which all government documents will be published. The argument is that through the Web citizens have much better and more immediate access to government information than through an FDLP library. That's true, but FDSys is also Winston Smith's dream machine, providing a point-and-click interface to instant history suppression and re-writing.

It isn't just official government documents with which governments may have problems. In recent testimony (pdf) to the House Committee on Oversight and Government Reform entitled "Political Interference with Government Climate Change Science", NASA climate scientist James Hansen described the lengths to which the Bush administration was prepared to go in its attempts to suppress or adjust the speech and writings of scientists. These included suppression or editing of the testimony of individuals to congress, press releases, conference presentations, press interviews and web postings. The administration also used budgetary and public pressure (pdf) to persuade scientists to self-censor papers before submission. Although we don't have evidence that the government has changed journal papers after they have been published, it must be likely that if this administration thought they could get away with it they would be tempted to do so.

Just as it seems imprudent for all copies of each element of the record of government information to be in the custody of a single government agency, it seems imprudent for all copies of each element of the record of scholarship to be in the custody of a single publisher. Governments are far from alone in potentially being tempted to tamper with the record. Consider drug and other companies involved in patent and other lawsuits, for example. Arcane details of the published record can be essential evidence in these cases. Just as it would be easy for the government to change information on FDSys, a very small percentage of the lawyer's fees, let alone the potential settlement, would be more than enough to bribe or coerce the system administrators of the publisher's web platform to correct "flaws" in the record. As these systems are currently implemented, the probability that the change would not be detected is overwhelming. Post-publication changes to e-journal content are routine, typically to correct errors such as typos.

What lessons can we take from the FDLP and the paper library system that has protected the record of scholarship until recently?

The key architectural features we need to copy from paper systems to the digital systems are massive replication in many independent repositories that do not trust each other and are implemented and operated transparently. The one architectural feature that is needed in the digital system that the paper system lacks is mutual audit among the replicas. This is necessary because, unlike paper, digital content is stored on media that is easily and tracelessly rewritable, and because technology exists (pdf) that could potentially re-write many of the on-line copies in a short period of time.

To sum up both this and the preceding post, investment in preserving e-journals is going primarily into establishing a very few, centralized, monolithic "trusted third party" repositories, implemented using expensive, proprietary, enterprise-scale technology and incurring high staff costs. These expensive facilities, quite naturally, focus on preserving premium, high-cost e-journals. Doing so seems natural, because a single negotiation with a large publisher brings a vast amount of content, which makes the repository look attractive to the librarians whose serials budget must in the end bear the cost. Although the LOCKSS system's distributed, peer-to-peer architecture has the features needed, market pressures mean that it too is being targeted at low-risk content.

This is a poor use of resources. The investment in technology and staff is replicating investments already made and paid for (via subscription costs by the publishers) by the same librarians making it. The content being preserved is at very low risk of loss through accident or incompetence. It is even at low risk of cancellation, since these publishers use the "big deal" bundled lease approach to make cancellation extremely painful. The investments are not effective at preventing the record of scholarship being tampered with, since the "trusted third party" architecture lacks the essential tamper-proofing features of the paper system. The investments are not even particularly effective at ensuring post-cancellation access, since a significant proportion of publishers won't allow the repositories to provide it. Not to mention the untested legal and operational obstacles in the path of third party repositories providing such access. Meanwhile, the content future scholars will need that is actually at serious risk of loss through accident, incompetence or economic failure is not being effectively collected or preserved.

Why Preserve E-Journals? Post-Cancellation Access

Much of the investment currently going in to digital preservation concentrates on preserving e-journals. The LOCKSS technology was originally developed in response to librarians' concerns about e-journals. It is now being used for a range of other content types, but the bulk of the system's use is for e-journals, both in the worldwide LOCKSS network and in the CLOCKSS program. Other efforts target e-journals specifically, such as Ex Libris' JOS (pdf), Portico, the Koninklijke Bibliotheek's e-Depot, the British Library, and others.

The reasons why e-journals became the target of choice include history, economics and technical convenience. In this post I will analyze these reasons in the light of what is now almost a decade of experience, and argue that they make less sense than they should.

There are two main answers to the question "why preserve e-journals?":

  • Post-cancellation access to subscription material.

  • Maintaining the integrity of the record of scholarship.

In this post I'll look at post-cancellation access. I'll return to the problem of maintaining the integrity of the record in a subsequent post.

Many libraries' interest in preserving e-journals arose when it became obvious that a side-effect of the transition of academic publishing to the Web was to change what the libraries were buying with their subscription dollars.

In the paper world the library purchased a physical copy of the content. Their reader's continued access to the content did not depend on continuing subscription payments, only on the library's decision whether or not to de-accession it. In the Web world, the library's subscription leased access to the publisher's copy of the content. Their reader's continued access to the content is perpetually at the mercy of the publisher's pricing policy.

This uncertainty didn't make the librarians happy, and since they write the checks that keep the publishers churning out content, they had various ways to communicate their unhappiness to the publishers. The first, immediate response was to insist on receiving paper copies as well as web access. It rapidly became obvious that this wasn't an acceptable solution to anyone. The libraries' readers rapidly found that they were vastly more productive working with web content. On-line use greatly outpaced use of paper. The publishers soon realized, not just that their readers preferred the Web, but more importantly that it was much cheaper to publish on the Web than on paper. Could librarians be persuaded to accept electronic-only publishing, while maintaining the same subscription pricing?

The major impediment to this tempting prospect was the librarians' insecurity about future access to the content to which they subscribed. Even publisher's promises that they would provide ex-subscribers free access to the content they had paid for weren't convincing; librarians were rightly skeptical of promises whose costs weren't covered.

Two broad approaches to post-cancellation access have been tried. One is to restore the paper model by preserving local copies, in which libraries pay to receive a copy of the content which they can keep and use to satisfy future requests for access. The other is to devise and implement an escrow service, a third party which receives a copy of the content from the publisher and, subject to the agreement of the publisher, to which it can provide ex-subscribers access.

After about a decade of concern about post-cancellation access, we have a small number of partial solutions to the problem. Some are local copy solutions and have been in production use for some years. A few libraries including University of Toronto (.ppt) and Los Alamos use a commercial system from Elsevier (now sold to Ex Libris) called JOS (Journals On Site) to preserve local copies of journals from Elsevier and some other major publishers. About 200 libraries use the LOCKSS system to preserve content from a wide range of publishers, largely disjoint from those in JOS. Others are escrow services, including some copyright deposit schemes at national libraries, and the Portico system. None has yet achieved anything approaching full coverage of the field, all are at a nascent stage. None is routinely providing readers with post-cancellation access.

As time has gone by with no simple, affordable, one size fits all publishers and libraries system for post-cancellation access the world has changed.

First, paper journals are no longer the version of record; for many of the most cited, highest impact journals the version delivered over the network has more information and more highly valued functions. The paper version is incomplete.

Second, the various ways publishers have tried to deliver physical copies of e-journal content, for example on CD-ROM, have proved to be so much trouble to deal with that they have been discredited as a means of post-cancellation access.

Third, the continual increase in subscription costs and the availability of cheap Web publishing platforms is driving a movement for open access to scholarship. It isn't certain that this will continue, and the effect varies greatly from field to field, but to the extent to which the trend continues it again reduces the importance of a solution to post-cancellation access. There is no subscription to cancel.

Fourth, the pressure for open access to the scientific literature has led many subscription journals to adopt a moving wall. Access to the content is restricted to subscribers for a period of time after it is first published, ranging from a few months to five years. After that, access is opened to anyone. The idea is that researchers active in a field will need immediate access to new content, and will justify the subscription to their librarians. Thus librarians will believe that, when their future readers want access, the moving wall will still be in effect to satisfy them. Thus they will be satisfied with a Web-only subscription.

Fifth, some other publishers have decided that charging for their back content on a pay-per-view basis is an important revenue source. These publishers are unlikely to participate in any solution for post-cancellation access.

Sixth, big publishers increasingly treat their content not as separate journals but as a single massive database. Subscription buys access to the whole database. If a library cancels their subscription, they lose access to the whole database. This bundling, or "big deal", leverages a small number of must-have journals to ensure that cancellation of even low-value journals, the vast majority in the bundle, is very unlikely. It is more expensive to subscribe individually to the few high-value journals than to take the "big deal". Thus cancellation of large publisher journals is a low risk, which is the goal of the "big deal" scheme.

Publishers who charge for back content typically do not allow their journals to be preserved using the LOCKSS system. They may provide their content to the nascent schemes for electronic copyright deposit at national libraries, but under very restrictive terms for access. For example, the Koninklijke Bibliotheek and the forthcoming British Library schemes both provide full access only to readers physically at the library itself; others get no or minimal access. National libraries are not a realistic solution to providing post-cancellation access to readers at subscribing libraries. Again, although these publishers may deposit content in the Portico system they're unlikely to sign the:
"rider to the agreement that a participating publisher signs if they choose to name Portico as a mechanism to fill post-cancellation access claims submitted by participating libraries." (emphasis added)

The rider in question is as follows:
"Perpetual Access.[Publisher] agrees that Portico shall provide access to the [content] to [Publisher]'s former purchasers or subscribers. Participating [Libraries] may submit perpetual access claims to Portico by certifying, either electronically or in writing, that they were a purchaser or subscriber to [the content] to which they are seeking access. ... Portico may Deliver the requested [content] if [publisher] has not notified Portico and the [library] of its objection ... in writing within thirty (30) days."

Thus for each library and each publisher, post-cancellation access is subject to the agreement of the publisher after the subscription has been canceled. Despite having a current subscription to Portico after they cancel their subscription to the publisher's content, and despite the publisher's having signed the rider, libraries can't be fully confident of receiving post-cancellation access. For example, suppose that a publisher signs the rider and is then sold to another that regards charging for post-cancellation access as important to its business model. The new owner could simply institute a policy of objecting to all perpetual access claims.

About 1/3 of Portico's publishers currently have not signed the rider. The only access a library obtains to their content is described here:
"The participating Library may designate up to four staff members per campus or system branch that will be provided password protected full access to the Portico archive for verification and testing purposes only." (emphasis added)

It is clear that a scrupulous library cannot look on Portico as a universal, robust solution for post-cancellation access.

There are two fundamental contradictions in the attempt to solve the problem of access to content after a subscription to a service (the publisher) is canceled by subscribing to a service (the preservation system) which prevents access after its subscription is canceled. First, it is not a solution, it is another instance of the same problem. Second, to the extent to which the subscription to the second service is regarded as insurance, it suffers from the same moral hazard as someone who takes out fire insurance then burns down the building himself. Insurance is being purchased against the direct consequences of voluntary actions by the insured. In other areas claims against such policies are treated as insurance fraud.

So we see that no matter how ingenious the proponents of digital preservation for e-journals, there is no realistic prospect of a single solution that provides post-cancellation access for 100% of subscription content the way that paper did. Generally speaking, the smaller publishers will be more likely allow one or more preservation systems to provide post-cancellation access, and the larger for-profit publishers will be less likely. There will always be some level of uncertainty as to whether access will actually be available when it is needed.

The following post looks at the second reason for preserving e-journals, maintaining the integrity of the record of scholarship.

Saturday, June 9, 2007

A Petabyte For A Century

A talk at the San Diego Supercomputer Center in September 2006 was when I started arguing (pdf) that one of the big problems in digital preservation is that we don't know how to measure how well we are doing it, and that makes it difficult to improve how well we do it. Because supercomputer people like large numbers, I started using the example of keeping a petabyte of data for a century to illustrate the problem. This post expands on my argument.

Lets start by assuming an organization has a petabyte of data that will be needed in 100 years. They want to buy a preservation system good enough that there will be a 50% chance that at the end of the 100 years every bit in the petabyte will have survived undamaged. This requirement sounds reasonable, but it is actually very challenging. They want 0.8 exabit-years of preservation with a 50% chance of success. Suppose the system they want to buy suffered from bit rot, a process that had a very small probability of flipping a bit at random. By analogy with the radioactive decay of atoms, they need the half-life of bits in the system to be at least 0.8 exa-years, or roughly 100,000,000 times the age of the universe.

In order to be confident that they are spending money wisely, the organization commissions an independent test lab to benchmark the competing preservation systems. The goal is to measure the half-life of bits in each system to see whether it meets the 0.8 exa-year target. The contract for the testing specifies that results are needed in a year. What does the test lab have to do?

The lab needs to assemble a big enough test system so that, if the half-life is exactly 0.8 exa-year, it will see enough bit flips to be confident that the measurement is good. Say it needs to see 5 bit flips or fewer to claim that the half-life is long enough. Then the lab needs to test an exabyte of data for a year.

The test consists of writing an exabyte of data into the system at the start of the year and reading it back several times, lets say 9 times, during the year to compare the bits that come out with the bits that went in. So we have 80 exabits of I/O to do in one year, or roughly 10 petabits/hour, which is an I/O rate of about 3 terabits/sec. That is 3,000 gigabit Ethernet interfaces running at full speed continuously for the whole year.

At current storage prices just the storage for the test system will cost hundreds of millions of dollars. When you add on the cost of the equipment to sustain the I/O and do the comparisons, and the cost of the software, staff, power and so on, its clear that the test to discover whether a system would be good enough to keep a petabyte of data for a century with a 50% chance of success would cost in the billion-dollar range. This is of the order of 1,000 times the purchase price of the system, so the test isn't feasible.

I'm not an expert on experimental design, and this is obviously a somewhat simplistic thought-experiment. But, suppose that the purchasing organization was prepared to spend 1% of the purchase price per system on such a test. The test would then have to cost roughly 100,000 times less than my thought-experiment to be affordable. I leave this 100,000-fold improvement as an exercise for the reader.

Monday, May 7, 2007

Format Obsolescence: the Prostate Cancer of Preservation

This is the second post in a series on format obsolescence. In the first I argued that it is hard to find a plausible scenario in which it would no longer be possible to render a format for which there is an open-source renderer.

In the long run, we are all dead. In the long run, all digital formats become obsolete. Broadly, reactions to this dismal prospect have taken two forms:

- The aggressive form has been to do as much work as possible as soon as possible in the expectation that when format obsolescence finally strikes, the results of this meticulous preparation will pay dividends.

-The relaxed form has been to postpone doing anything until it is absolutely essential, in the expectation that the onward march of technology will mean that the tools available for performing the eventual migration will be better than those available today.

I will argue that format obsolescence is the prostate cancer of digital preservation. It is a serious and ultimately fatal problem. If you were to live long enough you would eventually be diagnosed with it, and some long time later you would die from it. Once it is diagnosed there is as yet no certain cure for it. No prophylactic measures are known to be effective in preventing its onset. All prophylactic measures are expensive and have side-effects. But it is highly likely that something else will kill you first, so "watchful waiting", another name for the "relaxed" approach, is normally the best course of action.

The most important threat to digital preservation is not the looming but distant and gradual threat of format obsolescence. It is rather the immediate and potentially fatal threat of economic failure. No-one has enough money to preserve the materials that should be preserved with the care they deserve. That is why current discussions of digital preservation prominently feature the problem of "sustainability".

The typical aggressive approach involves two main activities that take place while a repository ingests content:

- An examination of the incoming content to extract and validate preservation metadata which is stored along with the content. The metadata includes a detailed description of the format. The expectation is that detailed knowledge of the formats being preserved will provide the repository with advance warning as they approach obsolescence, and assist in responding to the threat by migrating the content to a less obsolete format.

- The preemptive use of format migration tools to normalize the content by creating and preserving a second version of it in a format the repository considers less likely to suffer obsolescence.

The PREMIS dictionary of preservation metadata is a 100-page report defining 84 different entities, although only about half of them are directly relevant to format migration. Although it was initially expected that humans would provide much of this metadata, the volume and complexity of the content to be preserved meant that human-generated preservation metadata was too expensive and too unreliable to be practical. Tools such as JHOVE were developed to extract and validate preservation metadata automatically. Similarly, tools are needed to perform the normalization. These tools are typically open source. Is there a plausible scenario in which it is no longer possible to run these tools? If not, what is the benfit of running them now and preserving their output, as opposed to running them whenever the output is needed?

Since none of the proponents of the "aggressive" approach are sufficiently confident to discard the original bits, their storage costs are more than double those of the "relaxed" approach. The normalized copy of the original will be about the same size, plus storage is needed for the preservation metadata. Further, the operational costs of the "aggressive" ingest pipeline are significantly higher, since it undertakes much more work, and humans are needed to monitor progress and handle exceptions. The best example of a "relaxed" ingest pipeline is the Internet Archive, which has so far ingested over 85 billion web pages with minimal human intervention.

Is there any hard evidence that either preservation metadata or normalization actually increases the chance of content surviving format obsolescence by enough to justify the increased costs it imposes? Even the proponents of the "aggressive" approach would have to admit that the answer is "not yet". None of the formats in wide use when serious efforts at digital preservation of published content started a decade ago have become obsolete. Nor, as I argued in the previous post, is there any realistic scenario in which they will become obsolete in the near future. Thus both preservation metadata and normalization will remain speculative investments for the foreseeable future.

A large speculative investment in preparations of this kind might be justified if it were clear that format obsolescence was the most significant risk facing the content being preserved. Is that really the case? In the previous post I argued that for the kinds of published content currently being preserved format obsolescence is not a plausible threat, because all the formats being used to publish content have open source renderers. There clearly are formats for which obsolescence is the major risk, but content in those formats is not being preserved. For example, console games use encryption and other DRM techniques (see Bunnie Huang's amazing book for the Xbox example) that effectively prevent both format migration and emulation. Henry Lowood at Stanford Library is working to preserve these games, but using very different approaches.

Many digital preservation systems define levels of preservation; the higher the level assigned to a format, the stronger the "guarantee" of preservation the system offers. For example, PDF gets a higher level than Microsoft Word. Essentially, the greater the perceived difficulty of migrating a format, the lower the effort that will be devoted to preserving it. But the easier the format is to migrate, the lower the risk it is at. So investment, particularly in the "aggressive" approach, concentrates on the low-hanging fruit. This is neither at significant risk of loss, nor at significant risk of format obsolescence. A risk-based approach would surely prefer the "relaxed" approach, minimizing up-front and storage costs, and thereby freeing up resources to preserve more, and higher-risk content.

In my next post, I plan to look at what a risk-based approach to investing in preservation would look like, and whether it would be feasible.

Sunday, April 29, 2007

Format Obsolescence: Scenarios

This is the first of a series of posts in which I'll argue that much of the discussion of digital preservation, which focuses on the problem of format obsolescence, has failed to keep up with the evolution of the market and the technology. The result is that the bulk of the investment in the field is going to protecting content that is not at significant risk from events that are unlikely to occur, while at-risk content is starved of resources.

There are several format obsolescence "horror stories" often used to motivate discussion of digital preservation. I will argue that they are themselves now obsolete. The community of funders and libraries are currently investing primarily in preserving academic journals and related materials published on the Web. Are there realistic scenarios in which this content would become obsolescent?

The most frequently cited "horror story" is that of the BBC Micro and the Domesday Book. In 1986 the BBC created a pair of video disks, hardware enhancements and software for the Acorn-based BBC Micro home computer. It was a virtual exhibition celebrating the 900th anniversary of the Domesday Book. By 2002 the hardware was obsolete and the video disks were decaying. In a technical tour de force the CAMiLEON project, a collaboration among Leeds University, the University of Michigan and the UK National Archives rescued it by capturing the video from the media and building an emulator for the hardware that ran on a Windows PC.

The Domesday Book example shares certain features with almost all the "horror stories" in that it involves (a) off-line content, (b) in little-used, proprietary formats, (c) published for a limited audience and (d) a long time ago. The market has moved on since these examples; the digital preservation community now focuses mostly on on-line content published in widely-used, mostly open formats for a wide audience. This is the content that, were it on paper, would be in library collections. It matches the Library of Congress collection practice, which is the "selection of best editions as authorized by copyright law. Best editions are generally considered to be works in their final state." By analogy with libraries' paper collections, the loss or unreadability of this content would severely impact our culture. Mitigating these risks surely justifies significant investment.

How might this content be lost? Experience starting with the Library of Alexandria shows that the way to ensure that content survives is to distribute copies across a range of independent repositories. This was the way printed paper worked for hundreds of years, but the advent of the Web changed the ground rules. Now, readers gain temporary access to the original publisher's copy; there is no distribution of long-lived copies as a side-effect of providing access to the content. As we have seen with music, and as we are seeing with video, once this mechanism becomes established its superior economics rapidly supplant any distribution channel involving physical artefacts. Clearly, no matter how careful web publishers intend to be with their content the risk of loss is greater than with a proliferation of physical copies. Simply keeping the bits from being lost is the sine qua non of digital preservation, and its not as easy as people think (a subject of future posts).

Lets assume we succeed in avoiding loss of the bits; how might those bits become unreadable? Lets look at how they can be rendered now, and try to construct a scenario in which this current rendering process would become impossible.

I'm writing this on my desktop machine. It runs the Ubuntu version of Linux, with the Firefox browser. Via the Stanford network I have access through Stanford's subscriptions to a vast range of e-journals and other web resources as well as the huge variety of open access resources. I've worked this way for several years, since I decided to eliminate Microsoft software from my life. Apart from occasional lower quality than on my PowerBook, I don't have problems reading e-journals or other web resources. Almost all formats are rendered using open source software in the Ubuntu distribution; for a few such as Adobe's Flash the browser uses a closed-source binary plugin.

Lets start by looking at the formats for which an open source renderer exists (HTML, PDF, the Microsoft Office formats, and so on). The source code for an entire software stack capable of rendering each of these formats, from the BIOS through the boot loader, the operating system kernel, the browser, the PostScript and PDF interpreters and the Open Office suite is in ASCII, a format that will not itself become obsolete. The code is carefully preserved in a range of source code repositories. The developers of the various projects don't actually rely on the repositories; they also keep regular backups. The LOCKSS program is typical, we keep multiple backup copies of our SourceForge repository. They are synchronized nightly. We could switch to any one of them at a moment's notice. All the tools needed to build a working software stack are also preserved in the same way, and regularly exercised (most open source projects have automatic build and test processes that are run at least nightly).

As if this wasn't safe enough, in most cases there are multiple independent implementations of each layer of functionality in the stack. For example, at the kernel layer there are at least 5 independent open source implementations capable of supporting this stack (Linux, FreeBSD, NetBSD, OpenBSD and Solaris). As if even this wasn't safe enough, this entire stack can be built and run on a large number of different CPU architectures (NetBSD supports 16 of them). Even if the entire base of Intel architecture systems stopped working overnight, in which case format obsolescence would be the least of our problems, this software stack would still be able to render the formats just as it always did, although on a much smaller total number of computers. In fact, almost all the Windows software would continue to run (albeit a bit slower) since there are open source emulations of the Intel architecture. Apple used similar emulation technology during their transitions from the Motorola 68000 to PowerPC, and PowerPC to Intel architectures.

What's more, the source code is preserved in source code control systems, such as subversion. These systems ensure that the state of the system as it was at any point in the past can be reconstructed. Since all the code is handled this way, the exact state of the entire stack at the time that some content was rendered correctly can be recreated.

But what of the formats for which there is no open source renderer, only a closed-source binary plugin? Flash is the canonical example, but in fact there is an open source Flash player, it is just some years behind Adobe's current one. This is very irritating for partisans of open source, who are forced to use Adobe's plugin to view recent content, but it may not be critical for digital preservation. After all, if preservation needs an open source renderer it will, by definition, be many years after the original release of the new format. There will be time for the open source renderer to emerge. But even if it doesn't, and even if subsequent changes to the software into which the plugin is plugged make it stop working, we have seen that the entire software stack at a time when it was working can be recreated. So provided that the binary plugin itself survives, the content can still be rendered.

Historically, the open source community has developed rendering software for almost all proprietary formats that achieve wide use, if only after a significant delay. The Microsoft Office formats are a good example. Several sustained and well-funded efforts, including Open Office, have resulted in adequate, if not pixel-perfect, support for these formats. The Australian National Archives preservation strategy is based on using these tools to preemptively migrate content from proprietary formats to open formats before preservation. Indeed, the availability of open source alternatives is now making it difficult for Microsoft to continue imposing proprietary formats on their customers.

Even the formats which pose the greatest problems for preservation, those protected by DRM technology, typically have open source renderers, normally released within a year or two of the DRM-ed format's release. The legal status of a preservation strategy that used such software, or some software arguably covered by patents such as MP3 players, would be in doubt. Until the legal issues are clarified, no preservation system can make well-founded claims as to its ability to preserve these formats against format obsolescence. However, in most but not all cases these formats are supported by binary plugins for open source web browsers. If these binary plugins are preserved, we have seen that the software stack into which they plugged could be recreated in order to render content in that format.

It is safe to say that the software environment needed to support rendering of most current formats is preserved much better than the content being rendered.

If we ask "what would have to happen for these formats no longer to be renderable?" we are forced to invent implausible scenarios in which not just all the independent repositories holding the source code of the independent implementations of one layer of the stack were lost, but also all the backup copies of the source code at the various developers of all these projects, and also all the much larger number of copies of the binaries of this layer.

What has happened to make the predictions of the impending digital dark ages less menacing, at least as regards published content? First, off-line content on hardware-specific media has come to be viewed simply as a temporary backup for the primary on-line access copy. Second, publishing information on-line in arcane, proprietary formats is self-defeating. The point of publishing is to get the content to as many readers as possible, so publishers use popular formats. Third, open source environments have matured to the point where, with their popular and corporate support, only the most entrenched software businesses can refuse to support their use. Fourth, experience has shown that, even if a format is proprietary, if it is popular enough the open source community will support it effectively.

The all-or-nothing question that has dominated discussion of digital preservation has been how to deal with format obsolescence, whether by emulating the necessary software environment, or by painstakingly collecting "preservation metadata" in the hope that it will make future format migration possible. It turns out that:
the "preservation metadata" that is really needed for a format is an open source renderer for that format.
The community is creating these renderers for reasons that have nothing to do with preservation.

Of course, one must admit that reconstructing the entire open source software stack is not very convenient for the eventual reader, and could be expensive. Thus the practical questions about the obsolescence of the formats used by today's readers are really how convenient it will be for the eventual reader to access the content, and how much will be spent when in order to reach that level of convenience. The next post in this series will take up these questions.

These ideas have evolved from those in a paper called Transparent Format Migration of Preserved Web Content we published in 2005. It explained the approach the LOCKSS program takes to format migration. LOCKSS is a trademark of Stanford University.

Saturday, April 21, 2007

Mass-market scholarly communication

I attended the Workshop on Repositories sponsored by the NSF (US) and the JISC (UK). I apologize in advance for the length of this post, which is a follow-up. As I wrote it new aspects kept emerging and more memories of the discussion came back.

In his perceptive position paper for the workshop, Don Waters cites a fascinating paper by Harley et al. entitled "The Influence of Academic Values on Scholarly Publication and Communication Practices". I'd like to focus on two aspects of the Harley et al paper:
  • They describe a split between "in-process" communication which is rapid, flexible, innovative and informal, and "archival" communication. The former is more important in establishing standing in a field, where the latter is more important in establishing standing in an institution.
  • They suggest that "the quality of peer review may be declining" with "a growing tendency to rely on secondary measures", "difficult[y] for reviewers in standard fields to judge submissions from compound disciplines", "difficulty in finding reviewers who are qualified, neutral and objective in a fairly closed acacdemic community", "increasing reliance ... placed on the prestige of publication rather than ... actual content", and that "the proliferation of journals has resulted in the possibility of getting almost anything published somewhere" thus diluting "peer-reviewed" as a brand.

In retrospect, I believe Malcolm Read made the most important observation of the workshop when he warned about the coming generational change in the scholarly community, to a generation which has never known a world without Web-based research and collaboration tools. These warnings are particularly important because of the inevitable time lags in developing and deploying any results from the policy changes that the workshop's report might advocate.

Late in the workshop I channeled my step-daughter, who is now a Ph.D. student. Although I was trying to use her attitudes to illuminate the coming changes, in fact she is already too old to be greatly impacted by any results from the workshop. She was in high school as the Web was exploding. The target generation is now in high school, and their equivalent experience includes blogs and MySpace.

I'd like to try to connect these aspects to Malcolm's warnings and to the points I was trying to communicate by channeling my step-daughter. In my presentation I used as an example of "Web 2.0 scholarship" a post by Stuart Staniford, a computer scientist, to The Oil Drum blog, a forum for discussion of "peak oil" among a diverse group of industry professionals and interested outsiders, like Stuart. See comments and a follow-on post for involvement of industry insiders.

I now realize that I missed my own basic point, which is:

Blogs are bringing the tools of scholarly communication to the mass market, and with the leverage the mass market gives the technology, may well overwhelm the traditional forms.

Why is it that Stuart feels 2-3 times as productive doing "blog-science"? Based on my blog experience of reading (a lot) and writing (a little) I conjecture as follows:
  • The process is much faster. A few hours to a few days to create a post, then a few hours of intensive review, then a day or two in which the importance of the reviewed work becomes evident as other blogs link to it. Stuart's comment came 9 hours into a process that accumulated 217 comments in 30 hours. Contrast this with the ponderous pace of traditional academic communication.
  • The process is much more transparent. The entire history of the review is visible to everyone, in a citable and searchable form. Contrast this with the confidentiality-laden process of traditional scholarship.
  • Priority is obvious. All contributions are time-stamped, so disputes can be resolved objectively and quickly. They're less likely to fester and give rise to suspicions that confidentiality has been violated.
  • The process is meritocratic. Participation is open to all, not restricted to those chosen by mysterious processes that hide agendas. Participants may or may not be pseudonymous but their credibility is based on the visible record. Participants put their reputation on the line every time they post. The credibility of the whole blog depends on the credibility and frequency of other blogs linking to it - in other words the same measures applied to traditional journals, but in real time with transparency.
  • Equally, the process is error-tolerant. Staniford says "recognition on all our parts that this kind of work will have more errors in any given piece of writing, and its the collaborative debate process that converges towards the truth." This tolerance is possible because the investment in each step is small, and corrections can be made quickly. Because the penalty for error is lower, participants can afford to take more creative risk.
  • The process is both cooperative and competitive. Everyone is striving to improve their reputation by contributing. Of course, some contributions are negative, but the blog platforms and norms are evolving to cope with this inevitable downside of openness.
  • Review can be both broad and deep. Staniford says "The ability for anyone in the world, with who knows what skill set and knowledge base, to suddenly show up ... is just an amazing thing". And the review is about the written text, not about the formal credentials of the reviewers.
  • Good reviewing is visibly rewarded. Participants make their reputations not just by posting, but by commenting on posts. Its as easy to assess the quality of a participant reviews as to assess their authorship; both are visible in the public record.

Returning to the Harley et al. paper's observations, it is a commonplace that loyalty to employers is decreasing, with people expecting to move jobs frequently and often involuntarily. Investing in your own skills and success makes more sense than investing in the success of your (temporary) employer. Why would we be surprised that junior faculty and researchers are reluctant to put effort into institutional repositories for no visible benefit except to the institution? More generally, it is likely that as the mechanisms for establishing standing in the field diverge from those for establishing standing in the institution, investment will focus on standing in the field as being more portable, and more likely to be convertible into standing in their next host institution.

It is also very striking how many of the problems of scholarly communication are addressed by Staniford's blog-science:

  • "the proliferation of journals has resulted in the possibility of getting almost anything published somewhere" - If scholarship is effectively self-published then attention focusses on tools for rating the quality of scholarship, which can be done transparently, rather than tools for preventing low-rated scholarship being published under the "peer-reviewed" brand. As the dam holding back the flood of junk leaks, the brand looses value, so investing in protecting it becomes less rewarding. Tools for rating scholarship, on the other hand, reward investment. They will be applied to both branded and non-branded material (cf. Google), and will thus expose the decreased value of the brand, leading to a virtuous circle.
  • "increasing reliance ... placed on the prestige of publication rather than ... actual content" - Blog-style self-publishing redirects prestige from the channel to the author. Clearly, a post to a high-traffic blog such as Daily Kos (500,000 visits/day) can attract more attention, but this effect is lessened by the fact that it will compete with all the other posts to the site. In the end the citation index effect works, and quickly.
  • "a growing tendency to rely on secondary measures" - If the primary measures of quality were credible, this wouldn't happen. The lack of transparency in the traditional process makes it difficult to regain credibility. The quality rating system for blogs is far from perfect, but it is transparent, it is amenable to automation, and there is an effective incentive system driving innovation and improvement for the mass market.
  • "difficult[y] for reviewers in standard fields to judge submissions from compound disciplines" - This is only a problem because the average number of reviewers per item is small, so each needs to span most of the fields. If, as with blogs, there are many reviewers with transparent reputations, the need for an individual reviewer to span fields is much reduced.
  • "difficulty in finding reviewers who are qualified, neutral and objective in a fairly closed acacdemic community" - This is only a problem because the process is opaque. Outsiders have to trust the reviewers; they cannot monitor their reviews. With a completely transparent, blog-like process it is taken for granted that many reviewers will have axes to grind, the process exists to mediate these conflicting interests in public.

Of the advantages I list above, I believe the most important is sheer speed. John Boyd, the influential military strategist, stressed the importance of accelerating the OODA (Observation, Orientation, Decision, Action) loop. Taking small, measurable steps quickly is vastly more productive than taking large steps slowly, especially when the value of the large step takes even longer to become evident.

Why did arXiv arise? It was a reaction to a process so slow as to make work inefficient. Successive young generations lack patience with slow processes; they will work around processes they see as too slow just as the arXiv pioneers did. Note that once arXiv became institutionalized, it ceased to evolve and is now in danger of loosing relevance as newer techologies with the leverage of the mass market overtake it. Scientists no longer really need arXiv; they can post on their personal web sites and Google does everything else (see Peter Suber), which reinforces my case that mass-market tools will predominate. The only mass-market tool missing is preservation of personal websites, which blog platforms increasingly provide. Almost nothing in the workshop was about speeding up the scholarly process, so almost everything we propose will probably get worked around and become irrelevant.

The second most important factor is error tolerance. The key to Silicon Valley's success is the willingness to fail fast, often and in public; the idea that learning from failure is more important than avoiding failure. Comments in the workshop about the need for every report to a funding agency to present a success illustrate the problem. If the funding agencies are incapable of hearing about failures they can't learn much.

What does all this mean for the workshop's influence on the future?

  • Unless the institutions' and agencies' efforts are focussed on accelerating the OODA loop in scholarship, they will be ignored and worked-around by a coming generation notorious for its short attention span. No-one would claim that institutional repositories are a tool for accelerating scholarship; thus those workshop participants describing their success as at best "mixed" are on the right track. Clearly, making content at all scales more accessible to scholars and their automated tools is a way to accelerate the process. In this respect Peter Murray-Rust's difficulties in working around restrictions on automated access to content that is nominally on-line are worthy of particular attention.
  • Academic institutions and funding agencies lack the resources, expertise and mission to compete head-on with mass market tools. Once the market niche has been captured, academics will use the mass market tools unless the productivity gains from specialized tools are substantial. Until recently, there were no mass-market tools for scholarly communication, but that's no longer true. In this case the mass-market tools are more productive that the specialized ones, not less. Institutions and agencies need to focus on ways to leverage these tools, not to deprecate their use and arm-twist scholars into specialized tools under institutional control.
  • Insititutions and agencies need to learn from John Boyd and Silicon Valley themselves. Big changes which will deliver huge value but only in the long term are unlikely to be effective. Small steps that may deliver a small increment in value but will either succeed or fail quickly are the way to go.
  • Key to effective change are the incentive and reward systems, since they close the OODA loop. The problem for institutions and agencies in this area is that the mass-market tools have very effective incentive and reward systems, based on measuring and monetizing usage. Pay attention to the way Google runs vast numbers of experiments every day, tweaking their systems slightly and observing the results on user's behavior. Their infrastructure for conducting these experiments is very sophisticated, because the rewards for success flows straight to the bottom line. The most important change institutions and agencies can make is to find ways to leverage the Web's existing reward systems by measuring and rewarding use of scholarly assets. Why does the academic structure regard the vast majority of accesses to the Sloan Digital Sky Survey as being an unintended, uninteresting by-product? Why don't we even know what's motivating these accesses? Why aren't we investing in increasing these accesses?

I tend to be right about the direction things are heading and very wrong about how fast they will get there. With that in mind, here's my prediction for the way future scholars will communicate. The entire process, from lab notebook to final publication, will use the same mass-market blog-like tools that everyone uses for everyday cooperation. Everything will be public, citable, searchable, accessible by automated scholarly tools, time-stamped and immutable. The big problem will not be preservation, because the mass-market blog-like platforms will treat the scholarly information as among the most valuable of their business assets. It will be more credible, and thus more used, and thus generate more income, than less refined content. The big problem will be a more advanced version of the problems currently plaguing blogs, such as spam, abusive behavior, and deliberate subversion. But, again, since the mass-market systems have these problems too, scholars will simply use the mass-market solutions.