DSHR's Blog: April 2007

Sunday, April 29, 2007

Format Obsolescence: Scenarios

This is the first of a series of posts in which I'll argue that much of the discussion of digital preservation, which focuses on the problem of format obsolescence, has failed to keep up with the evolution of the market and the technology. The result is that the bulk of the investment in the field is going to protecting content that is not at significant risk from events that are unlikely to occur, while at-risk content is starved of resources.

There are several format obsolescence "horror stories" often used to motivate discussion of digital preservation. I will argue that they are themselves now obsolete. The community of funders and libraries are currently investing primarily in preserving academic journals and related materials published on the Web. Are there realistic scenarios in which this content would become obsolescent?

The most frequently cited "horror story" is that of the BBC Micro and the Domesday Book. In 1986 the BBC created a pair of video disks, hardware enhancements and software for the Acorn-based BBC Micro home computer. It was a virtual exhibition celebrating the 900th anniversary of the Domesday Book. By 2002 the hardware was obsolete and the video disks were decaying. In a technical tour de force the CAMiLEON project, a collaboration among Leeds University, the University of Michigan and the UK National Archives rescued it by capturing the video from the media and building an emulator for the hardware that ran on a Windows PC.

The Domesday Book example shares certain features with almost all the "horror stories" in that it involves (a) off-line content, (b) in little-used, proprietary formats, (c) published for a limited audience and (d) a long time ago. The market has moved on since these examples; the digital preservation community now focuses mostly on on-line content published in widely-used, mostly open formats for a wide audience. This is the content that, were it on paper, would be in library collections. It matches the Library of Congress collection practice, which is the "selection of best editions as authorized by copyright law. Best editions are generally considered to be works in their final state." By analogy with libraries' paper collections, the loss or unreadability of this content would severely impact our culture. Mitigating these risks surely justifies significant investment.

How might this content be lost? Experience starting with the Library of Alexandria shows that the way to ensure that content survives is to distribute copies across a range of independent repositories. This was the way printed paper worked for hundreds of years, but the advent of the Web changed the ground rules. Now, readers gain temporary access to the original publisher's copy; there is no distribution of long-lived copies as a side-effect of providing access to the content. As we have seen with music, and as we are seeing with video, once this mechanism becomes established its superior economics rapidly supplant any distribution channel involving physical artefacts. Clearly, no matter how careful web publishers intend to be with their content the risk of loss is greater than with a proliferation of physical copies. Simply keeping the bits from being lost is the sine qua non of digital preservation, and its not as easy as people think (a subject of future posts).

Lets assume we succeed in avoiding loss of the bits; how might those bits become unreadable? Lets look at how they can be rendered now, and try to construct a scenario in which this current rendering process would become impossible.

I'm writing this on my desktop machine. It runs the Ubuntu version of Linux, with the Firefox browser. Via the Stanford network I have access through Stanford's subscriptions to a vast range of e-journals and other web resources as well as the huge variety of open access resources. I've worked this way for several years, since I decided to eliminate Microsoft software from my life. Apart from occasional lower quality than on my PowerBook, I don't have problems reading e-journals or other web resources. Almost all formats are rendered using open source software in the Ubuntu distribution; for a few such as Adobe's Flash the browser uses a closed-source binary plugin.

Lets start by looking at the formats for which an open source renderer exists (HTML, PDF, the Microsoft Office formats, and so on). The source code for an entire software stack capable of rendering each of these formats, from the BIOS through the boot loader, the operating system kernel, the browser, the PostScript and PDF interpreters and the Open Office suite is in ASCII, a format that will not itself become obsolete. The code is carefully preserved in a range of source code repositories. The developers of the various projects don't actually rely on the repositories; they also keep regular backups. The LOCKSS program is typical, we keep multiple backup copies of our SourceForge repository. They are synchronized nightly. We could switch to any one of them at a moment's notice. All the tools needed to build a working software stack are also preserved in the same way, and regularly exercised (most open source projects have automatic build and test processes that are run at least nightly).

As if this wasn't safe enough, in most cases there are multiple independent implementations of each layer of functionality in the stack. For example, at the kernel layer there are at least 5 independent open source implementations capable of supporting this stack (Linux, FreeBSD, NetBSD, OpenBSD and Solaris). As if even this wasn't safe enough, this entire stack can be built and run on a large number of different CPU architectures (NetBSD supports 16 of them). Even if the entire base of Intel architecture systems stopped working overnight, in which case format obsolescence would be the least of our problems, this software stack would still be able to render the formats just as it always did, although on a much smaller total number of computers. In fact, almost all the Windows software would continue to run (albeit a bit slower) since there are open source emulations of the Intel architecture. Apple used similar emulation technology during their transitions from the Motorola 68000 to PowerPC, and PowerPC to Intel architectures.

What's more, the source code is preserved in source code control systems, such as subversion. These systems ensure that the state of the system as it was at any point in the past can be reconstructed. Since all the code is handled this way, the exact state of the entire stack at the time that some content was rendered correctly can be recreated.

But what of the formats for which there is no open source renderer, only a closed-source binary plugin? Flash is the canonical example, but in fact there is an open source Flash player, it is just some years behind Adobe's current one. This is very irritating for partisans of open source, who are forced to use Adobe's plugin to view recent content, but it may not be critical for digital preservation. After all, if preservation needs an open source renderer it will, by definition, be many years after the original release of the new format. There will be time for the open source renderer to emerge. But even if it doesn't, and even if subsequent changes to the software into which the plugin is plugged make it stop working, we have seen that the entire software stack at a time when it was working can be recreated. So provided that the binary plugin itself survives, the content can still be rendered.

Historically, the open source community has developed rendering software for almost all proprietary formats that achieve wide use, if only after a significant delay. The Microsoft Office formats are a good example. Several sustained and well-funded efforts, including Open Office, have resulted in adequate, if not pixel-perfect, support for these formats. The Australian National Archives preservation strategy is based on using these tools to preemptively migrate content from proprietary formats to open formats before preservation. Indeed, the availability of open source alternatives is now making it difficult for Microsoft to continue imposing proprietary formats on their customers.

Even the formats which pose the greatest problems for preservation, those protected by DRM technology, typically have open source renderers, normally released within a year or two of the DRM-ed format's release. The legal status of a preservation strategy that used such software, or some software arguably covered by patents such as MP3 players, would be in doubt. Until the legal issues are clarified, no preservation system can make well-founded claims as to its ability to preserve these formats against format obsolescence. However, in most but not all cases these formats are supported by binary plugins for open source web browsers. If these binary plugins are preserved, we have seen that the software stack into which they plugged could be recreated in order to render content in that format.

It is safe to say that the software environment needed to support rendering of most current formats is preserved much better than the content being rendered.

If we ask "what would have to happen for these formats no longer to be renderable?" we are forced to invent implausible scenarios in which not just all the independent repositories holding the source code of the independent implementations of one layer of the stack were lost, but also all the backup copies of the source code at the various developers of all these projects, and also all the much larger number of copies of the binaries of this layer.

What has happened to make the predictions of the impending digital dark ages less menacing, at least as regards published content? First, off-line content on hardware-specific media has come to be viewed simply as a temporary backup for the primary on-line access copy. Second, publishing information on-line in arcane, proprietary formats is self-defeating. The point of publishing is to get the content to as many readers as possible, so publishers use popular formats. Third, open source environments have matured to the point where, with their popular and corporate support, only the most entrenched software businesses can refuse to support their use. Fourth, experience has shown that, even if a format is proprietary, if it is popular enough the open source community will support it effectively.

The all-or-nothing question that has dominated discussion of digital preservation has been how to deal with format obsolescence, whether by emulating the necessary software environment, or by painstakingly collecting "preservation metadata" in the hope that it will make future format migration possible. It turns out that:

the "preservation metadata" that is really needed for a format is an open source renderer for that format.

The community is creating these renderers for reasons that have nothing to do with preservation.

Of course, one must admit that reconstructing the entire open source software stack is not very convenient for the eventual reader, and could be expensive. Thus the practical questions about the obsolescence of the formats used by today's readers are really how convenient it will be for the eventual reader to access the content, and how much will be spent when in order to reach that level of convenience. The next post in this series will take up these questions.

These ideas have evolved from those in a paper called Transparent Format Migration of Preserved Web Content we published in 2005. It explained the approach the LOCKSS program takes to format migration. LOCKSS is a trademark of Stanford University.

Saturday, April 21, 2007

Mass-market scholarly communication

I attended the Workshop on Repositories sponsored by the NSF (US) and the JISC (UK). I apologize in advance for the length of this post, which is a follow-up. As I wrote it new aspects kept emerging and more memories of the discussion came back.

In his perceptive position paper for the workshop, Don Waters cites a fascinating paper by Harley et al. entitled "The Influence of Academic Values on Scholarly Publication and Communication Practices". I'd like to focus on two aspects of the Harley et al paper:

They describe a split between "in-process" communication which is rapid, flexible, innovative and informal, and "archival" communication. The former is more important in establishing standing in a field, where the latter is more important in establishing standing in an institution.
They suggest that "the quality of peer review may be declining" with "a growing tendency to rely on secondary measures", "difficult[y] for reviewers in standard fields to judge submissions from compound disciplines", "difficulty in finding reviewers who are qualified, neutral and objective in a fairly closed acacdemic community", "increasing reliance ... placed on the prestige of publication rather than ... actual content", and that "the proliferation of journals has resulted in the possibility of getting almost anything published somewhere" thus diluting "peer-reviewed" as a brand.

In retrospect, I believe Malcolm Read made the most important observation of the workshop when he warned about the coming generational change in the scholarly community, to a generation which has never known a world without Web-based research and collaboration tools. These warnings are particularly important because of the inevitable time lags in developing and deploying any results from the policy changes that the workshop's report might advocate.

Late in the workshop I channeled my step-daughter, who is now a Ph.D. student. Although I was trying to use her attitudes to illuminate the coming changes, in fact she is already too old to be greatly impacted by any results from the workshop. She was in high school as the Web was exploding. The target generation is now in high school, and their equivalent experience includes blogs and MySpace.

I'd like to try to connect these aspects to Malcolm's warnings and to the points I was trying to communicate by channeling my step-daughter. In my presentation I used as an example of "Web 2.0 scholarship" a post by Stuart Staniford, a computer scientist, to The Oil Drum blog, a forum for discussion of "peak oil" among a diverse group of industry professionals and interested outsiders, like Stuart. See comments and a follow-on post for involvement of industry insiders.

I now realize that I missed my own basic point, which is:

Blogs are bringing the tools of scholarly communication to the mass market, and with the leverage the mass market gives the technology, may well overwhelm the traditional forms.

Why is it that Stuart feels 2-3 times as productive doing "blog-science"? Based on my blog experience of reading (a lot) and writing (a little) I conjecture as follows:

The process is much faster. A few hours to a few days to create a post, then a few hours of intensive review, then a day or two in which the importance of the reviewed work becomes evident as other blogs link to it. Stuart's comment came 9 hours into a process that accumulated 217 comments in 30 hours. Contrast this with the ponderous pace of traditional academic communication.
The process is much more transparent. The entire history of the review is visible to everyone, in a citable and searchable form. Contrast this with the confidentiality-laden process of traditional scholarship.
Priority is obvious. All contributions are time-stamped, so disputes can be resolved objectively and quickly. They're less likely to fester and give rise to suspicions that confidentiality has been violated.
The process is meritocratic. Participation is open to all, not restricted to those chosen by mysterious processes that hide agendas. Participants may or may not be pseudonymous but their credibility is based on the visible record. Participants put their reputation on the line every time they post. The credibility of the whole blog depends on the credibility and frequency of other blogs linking to it - in other words the same measures applied to traditional journals, but in real time with transparency.
Equally, the process is error-tolerant. Staniford says "recognition on all our parts that this kind of work will have more errors in any given piece of writing, and its the collaborative debate process that converges towards the truth." This tolerance is possible because the investment in each step is small, and corrections can be made quickly. Because the penalty for error is lower, participants can afford to take more creative risk.
The process is both cooperative and competitive. Everyone is striving to improve their reputation by contributing. Of course, some contributions are negative, but the blog platforms and norms are evolving to cope with this inevitable downside of openness.
Review can be both broad and deep. Staniford says "The ability for anyone in the world, with who knows what skill set and knowledge base, to suddenly show up ... is just an amazing thing". And the review is about the written text, not about the formal credentials of the reviewers.
Good reviewing is visibly rewarded. Participants make their reputations not just by posting, but by commenting on posts. Its as easy to assess the quality of a participant reviews as to assess their authorship; both are visible in the public record.

Returning to the Harley et al. paper's observations, it is a commonplace that loyalty to employers is decreasing, with people expecting to move jobs frequently and often involuntarily. Investing in your own skills and success makes more sense than investing in the success of your (temporary) employer. Why would we be surprised that junior faculty and researchers are reluctant to put effort into institutional repositories for no visible benefit except to the institution? More generally, it is likely that as the mechanisms for establishing standing in the field diverge from those for establishing standing in the institution, investment will focus on standing in the field as being more portable, and more likely to be convertible into standing in their next host institution.

It is also very striking how many of the problems of scholarly communication are addressed by Staniford's blog-science:

"the proliferation of journals has resulted in the possibility of getting almost anything published somewhere" - If scholarship is effectively self-published then attention focusses on tools for rating the quality of scholarship, which can be done transparently, rather than tools for preventing low-rated scholarship being published under the "peer-reviewed" brand. As the dam holding back the flood of junk leaks, the brand looses value, so investing in protecting it becomes less rewarding. Tools for rating scholarship, on the other hand, reward investment. They will be applied to both branded and non-branded material (cf. Google), and will thus expose the decreased value of the brand, leading to a virtuous circle.
"increasing reliance ... placed on the prestige of publication rather than ... actual content" - Blog-style self-publishing redirects prestige from the channel to the author. Clearly, a post to a high-traffic blog such as Daily Kos (500,000 visits/day) can attract more attention, but this effect is lessened by the fact that it will compete with all the other posts to the site. In the end the citation index effect works, and quickly.
"a growing tendency to rely on secondary measures" - If the primary measures of quality were credible, this wouldn't happen. The lack of transparency in the traditional process makes it difficult to regain credibility. The quality rating system for blogs is far from perfect, but it is transparent, it is amenable to automation, and there is an effective incentive system driving innovation and improvement for the mass market.
"difficult[y] for reviewers in standard fields to judge submissions from compound disciplines" - This is only a problem because the average number of reviewers per item is small, so each needs to span most of the fields. If, as with blogs, there are many reviewers with transparent reputations, the need for an individual reviewer to span fields is much reduced.
"difficulty in finding reviewers who are qualified, neutral and objective in a fairly closed acacdemic community" - This is only a problem because the process is opaque. Outsiders have to trust the reviewers; they cannot monitor their reviews. With a completely transparent, blog-like process it is taken for granted that many reviewers will have axes to grind, the process exists to mediate these conflicting interests in public.

Of the advantages I list above, I believe the most important is sheer speed. John Boyd, the influential military strategist, stressed the importance of accelerating the OODA (Observation, Orientation, Decision, Action) loop. Taking small, measurable steps quickly is vastly more productive than taking large steps slowly, especially when the value of the large step takes even longer to become evident.

Why did arXiv arise? It was a reaction to a process so slow as to make work inefficient. Successive young generations lack patience with slow processes; they will work around processes they see as too slow just as the arXiv pioneers did. Note that once arXiv became institutionalized, it ceased to evolve and is now in danger of loosing relevance as newer techologies with the leverage of the mass market overtake it. Scientists no longer really need arXiv; they can post on their personal web sites and Google does everything else (see Peter Suber), which reinforces my case that mass-market tools will predominate. The only mass-market tool missing is preservation of personal websites, which blog platforms increasingly provide. Almost nothing in the workshop was about speeding up the scholarly process, so almost everything we propose will probably get worked around and become irrelevant.

The second most important factor is error tolerance. The key to Silicon Valley's success is the willingness to fail fast, often and in public; the idea that learning from failure is more important than avoiding failure. Comments in the workshop about the need for every report to a funding agency to present a success illustrate the problem. If the funding agencies are incapable of hearing about failures they can't learn much.

What does all this mean for the workshop's influence on the future?

Unless the institutions' and agencies' efforts are focussed on accelerating the OODA loop in scholarship, they will be ignored and worked-around by a coming generation notorious for its short attention span. No-one would claim that institutional repositories are a tool for accelerating scholarship; thus those workshop participants describing their success as at best "mixed" are on the right track. Clearly, making content at all scales more accessible to scholars and their automated tools is a way to accelerate the process. In this respect Peter Murray-Rust's difficulties in working around restrictions on automated access to content that is nominally on-line are worthy of particular attention.
Academic institutions and funding agencies lack the resources, expertise and mission to compete head-on with mass market tools. Once the market niche has been captured, academics will use the mass market tools unless the productivity gains from specialized tools are substantial. Until recently, there were no mass-market tools for scholarly communication, but that's no longer true. In this case the mass-market tools are more productive that the specialized ones, not less. Institutions and agencies need to focus on ways to leverage these tools, not to deprecate their use and arm-twist scholars into specialized tools under institutional control.
Insititutions and agencies need to learn from John Boyd and Silicon Valley themselves. Big changes which will deliver huge value but only in the long term are unlikely to be effective. Small steps that may deliver a small increment in value but will either succeed or fail quickly are the way to go.
Key to effective change are the incentive and reward systems, since they close the OODA loop. The problem for institutions and agencies in this area is that the mass-market tools have very effective incentive and reward systems, based on measuring and monetizing usage. Pay attention to the way Google runs vast numbers of experiments every day, tweaking their systems slightly and observing the results on user's behavior. Their infrastructure for conducting these experiments is very sophisticated, because the rewards for success flows straight to the bottom line. The most important change institutions and agencies can make is to find ways to leverage the Web's existing reward systems by measuring and rewarding use of scholarly assets. Why does the academic structure regard the vast majority of accesses to the Sloan Digital Sky Survey as being an unintended, uninteresting by-product? Why don't we even know what's motivating these accesses? Why aren't we investing in increasing these accesses?

I tend to be right about the direction things are heading and very wrong about how fast they will get there. With that in mind, here's my prediction for the way future scholars will communicate. The entire process, from lab notebook to final publication, will use the same mass-market blog-like tools that everyone uses for everyday cooperation. Everything will be public, citable, searchable, accessible by automated scholarly tools, time-stamped and immutable. The big problem will not be preservation, because the mass-market blog-like platforms will treat the scholarly information as among the most valuable of their business assets. It will be more credible, and thus more used, and thus generate more income, than less refined content. The big problem will be a more advanced version of the problems currently plaguing blogs, such as spam, abusive behavior, and deliberate subversion. But, again, since the mass-market systems have these problems too, scholars will simply use the mass-market solutions.