DSHR's Blog

Sunday, December 30, 2007

Mass-market scholarly communication revisited

Again, I need to apologize for the gap in posting. It turns out that I'm not good at combining travel and blogging, and I've been doing a lot of travel. One of my trips was to DC for CNI and the 3rd International Digital Curation Conference. One of the highlights was a fascinating talk (abstract) by Prof. Carole Goble of the University of Manchester's School of Computer Science. She's a great speaker, with a vivid turn of phrase, and you have to like a talk about science on the web in which a major example is VivaLaDiva.com, a shoe shopping site.

Carole and her team work on enabling in silico experiments by using workflows to compose Web and other services. Their myGrid site uses their Taverna workflow system to provide scientists with access to over 3000 services. Their myExperiment "scientists social network and Virtual Research Environment encourages community-wide sharing, and curation, of workflows".

Two things I found really interesting about Carole's talk were:

myExperiment is an implementation of the ideas I discussed in my post on Mass-Market Scholarly Communication, enhanced with the concepts of workflows.

The emerging world of web services is the big challenge facing digital preservation. Her talk was a wonderful illustration both of why this is an important problem, in that much of reader's experience of the Web is already mediated by services, and why the barriers to doing so are almost insurmountable.

Carole's talk was like a high-speed recapitulation of the history of the Web, with workflows taking the place of pages. More generally, it was an instance of the way Web 2.0 evolution is like Web 1.0 evolution with services instead of static content. Carole described how scientists discovered that they could link together services (pages) using workflows (links). There were soon so many services that sites that directories of them arose (think the early Yahoo!). Then there were so many of them that search engines arose. Then enough time elapsed that people started noticing that workflows (links) decayed over time quite rapidly. There was, however, one important piece of Web 1.0 missing from her presentation - advertising. Follow me below the jump for an explanation of why this omission is important and some suggestions about what can be done to remedy it.

Advertising has played a much underestimated role in driving the evolution of the Web. It provides direct, almost instantaneous, feedback that rewards successful mutations. Because the feedback is monetary, it ensures that successful mutations will get the resources they need to thrive, and unsuccessful ones will not. We take for granted that a Web site that attracts a large readership will be able to sustain itself and grow, but the only reason this is so is because of advertising. Daily Kos, a political blogging site, was started on a shoestring in 2002 and hit a million page views a month after its first year. It now gets 15 million a month, consumes a respectable-size server farm and employs a growing staff. Darwin would recognize this process instantly.

Despite a 1998 harangue from Tim Berners-Lee, research showed that pages had a half-life of a month or two, and that even links in academic journals decayed rapidly. But that was before advertising effects had been felt. Now, it may still be true that pages have a short half-life. But pages that the readership judges to be important, and which thus bring in advertising dollars, do not decay rapidly. Nor do the links that bring traffic to them. Site administrators have been schooled by advertising's reward and punishment system, and they know that gratutiously moving pages breaks links and impairs search engine rankings, which decreases income. So the problem of decaying links has been solved, not by persistent URL technology but by rewarding good behavior and punishing bad behavior.

Although the analogy between Web 1.0 pages and Web 2.0 workflows was apparent from Carole's talk, there is one big difference. People that advertisers will pay to reach read Web 1.0 pages. Workflows are read by programs, whose discretionary spending is zero. There's thus no effective mechanism rewarding good behavior in the workflow world and punishing bad. Suppose that putting a service on-line that attracted a large workflow-ership rapidly caused a flow of money to arrive at the site hosting the service, sufficent to sustain and grow it. Many of the problems currently plaguing workflows would vanish overnight. Site maintainers would find, for example, that a poor availability record or non-upwards-compatible changes to their sites API would rapidly reduce their income, and being the smart young scientists they are, would learn not to do these things.

Funding agencies and others interested in the progress of e-science need an equivalent of advertising to drive the evolution of services and workflows. Without it the field will continue to be plagued by poor performance, fragility, unreliability and instability, with much effort being wasted. More critically, one major key to scientific progress is the requirement that experiments be replicable by later researchers. The current workflow environment appears to provide an almost total inability to replicate experiments after a period no matter how well they were published.

What key aspects of Web advertising are needed in a system to drive the evolution of scientific workflows?

It must provide money directly to the maintainers of the services and workflows.

The amount of money must be based on automated, third-party measures of usage (think AdSense or Doubleclick or even SiteMeter for scientific workflows) and importance (think PageRank for scientific workflows).

The cycle time of the reward system must match the Web, not the research funding process. myExperiment has been in public beta less than six months. In that time it has evolved significantly. A feedback process that involves writing grant proposals, having them peer-reviewed, and processed through an annual budget cycle is far too slow to have any effect on the evolution of a workflow environment.

Funders need to put some infrastructure money into a pot that is doled out automatically via these measures. Doing so will pay great benefits in scientific productivity.

Saturday, October 13, 2007

Who's looking after the snowman?

In a post to the liblicense mailing list James O'Donnell, Provost of Georgetown University, asks:

"So when I read ten years from now about this cool debate among Democratic candidates that featured video questions from goofy but serious viewers, including a snowman concerned about global warming, and people were watching it on YouTube for weeks afterwards: how will I find it? Who's looking after the snowman?

This is an important question. Clearly, future scholars will not be able to understand the upcoming election without access to YouTube videos, blog posts and other ephemera. In this particular case, I believe there are both business and technical reasons why Provost O'Donnell can feel somewhat reassured, and legal and library reasons why he should not. Follow me below the fold for the details.

Here is the pre-debate version of the snowman's video, and here are the candidates' responses. CNN, which broadcast the debate, has the coverage here. As far as I can tell the Internet Archive doesn't collect videos like these.

From a business point of view, YouTube videos are a business asset of Google, and will thus be preserved with more than reasonable care and attention. As I argued here, content owned by major publishing corporations (which group now includes Google) is at very low risk of accidental loss; the low and rapidly decreasing cost per byte of storage makes the business decision to keep it available rather than take it down a no-brainer. And that is ignoring the other aspects of the Web's Long Tail economics which mean that the bulk of the revenue comes from the less popular content.

Technically, YouTube video is Flash Video. It can easily be downloaded, for example by this website. The content is in a widely used web format that has an open-source player, in this case at least two (MPlayer and VLC). It is thus perfectly feasible to preserve it, and for the reasons I describe here the open source players make it extraordinarily unlikely that it would not be possible to play the video in 10, or even 30 years. If someone collects the video from YouTube and preserves the bits, it is highly likely that the bits will be viewable indefinitely.

But, will anyone other than Google actually collect and preserve the bits? Provost O'Donnell's library might want to do so, but the state of copyright law places some tricky legal obstacles in the way. Under the DMCA, preserving a copy of copyright content requires the copyright owner's permission. Although I heard rumors that CNN would release video of the debate under a Creative Commons license, on their website there is a normal "All Rights Reserved" copyright notice. And on YouTube, there is no indication of the copyright status of the videos. A library downloading the videos would have to assume it didn't have permission to preserve them. It could follow the example of the Internet Archive and depend on the "safe harbor" provision, complying with any "takedown letters" by removing them. This is a sensible approach for the Internet Archive, which aims to be a large sample of the Web, but not for the kind of focused collections Provost O'Donnell has in mind.

The DMCA, patents and other IP restrictions place another obstacle in the way. I verified that an up-to-date Ubuntu Linux system using the Totem media player plays downloaded copies of YouTube videos very happily. Totem uses the GStreamer media framework with plugins for specific media. Playing the YouTube videos used the FFmpeg library. As with all software, it is possible that some patent holder might claim that it violated their patents, or that in some way it could be viewed as evading some content protection mechanism as defined by the DMCA. As with all open source software, there is no indemnity from a vendor against such claims. Media formats are so notorious for such patent claims that Ubuntu segregates many media plugins into separate classes and provides warnings during the install process that the user may be straying into a legal gray area. The uncertainty surrounding the legal status is carefully cultivated by many players in the media market, as it increases the returns they may expect from what are, in many cases, very weak patents and content protection mechanisms. Many libraries judge that the value of the content they would like to preserve doesn't justify the legal risks of preserving it.

Tuesday, October 9, 2007

Workshop on Preserving Government Information

Here is an announcement of a workshop in Oxford on Preserving & Archiving Government Information. Alas, our invitation arrived too late to be accepted, but the results should be interesting. Its sponsored by the Portuguese Management Centre for an e-Government Network (CEGER). Portugal's recent history of dictatorship tends to give them a realistic view of government information policies.

Wednesday, October 3, 2007

Update on Preserving the Record

In my post Why Preserve E-Journals? To Preserve the Record I used the example of government documents to illustrate why trusting web publishers to maintain an accurate record is fraught with dangers. The temptation to mount an "insider attack" to make the record less inconvenient or embarrassing is too much to resist.

Below the fold I report on two more examples, one from the paper world and one from the pre-web electronic world, showing the value of a tamper-evident record.

For the first example I'm indebted to Prof. Jeanine Pariser Plottel of Hunter College, who has compared the pre- and post-WWII editions of books published by right-wing authors in France and shown that the (right-wing) publishers sanitized the post-WWII editions to remove much of the anti-semitic rhetoric. Note that this analysis was possible only because the pre-WWII editions survived in libraries and private collections. They were widely distributed on durable, reasonably tamper-evident media. They survived invasion, occupation, counter-invasion and social disruption. It would have been futile for the publishers to claim that the pre-WWII editions had somehow been faked after the war to discredit the right. Prof. Plottel points to two examples of "common practice":

1. the books of Robert Brasillach (who was executed) edited by his brother-in-law Maurice Bardèche, Professor of 19th Century French Literature at the Sorbonne during the war, stripped of his post, after. The two men published an Histoire du cinéma in 1935. In subsequent editions published several times after the war beginning in 1947, the term "fascism" is replaced by "anti-communisme."

2. Lucien Rebatet's, Les décombres (1942), was one of the best-sellers of the Occupation, and it is virulently anti-Semitic. A new expurgated version was later published under the title Mémoire d'un fasciste. Who was Rebatet? you ask. Relegated to oblivion, I hope. Still, you may remember Truffaut's film, Le dernier métro (wonderful and worth seeing, if you haven't). The character Daxiat is modeled upon Rebatet.

In a web-only world it would have been much easier for the publishers to sanitize history. Multiple libraries keeping copies of the original editions would have been difficult under the DMCA. It must be doubtful whether the library copies would have survived the war. The publisher's changes would likely have remained undetected. Had they been detected the critics would have been much easier to discredit.

The second example is here. This fascinating paper is based on Will Crowther's original source code for ADVENT the pioneering work of interactive fiction that became, with help from Don Woods, the popular Adventure game. The author, Dennis Jerz, shows that the original was based closely on a real cave, part of Kentucky's Colossal Cave system. This observation was obscured by Don Woods' later improvements.

As the swift and comprehensive debunking of the allegations in SCO vs. IBM shows, archaeology of this kind for Open Source software is now routine and effective. This is because the code is preserved in third-party archives which use Source Code Control systems derived from Marc Rochkind's 1972 SCCS, and provide a somewhat tamper-evident record. Although Jerz shows Crowther's original ADVENT dates from the 1975-6 academic year, SCCS had yet to become widely used outside Bell Labs, and the technology needed for third-party repositories was a decade in the future. Jerz's work depended on Stanford's ability to recover data from backups of Don Woods' student account from 30 years ago; an impressive feat of system administration! Don Woods vouches for the recovered code, so there's no suspicion that it isn't authentic.

How likely is it that other institutions could recover 30-year old student files? Absent such direct testimony, how credible would allegedly recovered student files that old be? Yet they have provided important evidence for the birth of an entire genre of fiction.

Sunday, September 16, 2007

Sorry for the gap in posting

It was caused by some urgent work for the CLOCKSS project, vacation and co-authoring a paper which has just been submitted. The paper is based on some interesting data, but I can't talk about it for now. I hope to have more time to blog in a week or two after some upcoming meetings.

In the meantime, I want to draw attention to some interesting discussion about silent corruption in large databases that relates to my "Petabyte for a Century" post. Here (pdf) are slides from a talk by Peter Kelemen of CERN describing an on-going monitoring program at CERN using fsprobe(8). It randomly probes 4000 of CERN's file systems, writing a known pattern then reading it back looking for corruption. They find a steady flow of 1-3 silent corruptions/day, that is the data read back doesn't match what was written and there is no error indication.

Peter sparked a discussion and a post at KernelTrap. The slides, the discussion and the post are well worth reading, especially if you are among the vast majority who believe that data written to storage will come back undamaged when you need it.

Also, in a development related to my "Mass-market Scholarly Communication" post, researchers at UC's Office of Scholarly Communication released a report that apparently contradicts some of the findings of the UC Berkeley study I referred to. I suspect, without having read the new study, that this might have something to do with the fact that they studied only "ladder-rank" faculty, where the Berkeley team studied a more diverse group.

Thursday, August 9, 2007

The Mote In God's Eye

I gave a "Tech Talk" at Google this week. Writing it, I came up with two analogies that are worth sharing, one based on Larry Niven and Jerry Pournelle's 1974 science fiction classic The Mote In God's Eye and the other on global warming. Warning: spoilers below the fold.

The Mote In God's Eye describes humanity's first encounter with intelligent aliens, called Moties. Motie reproductive physiology locks their society into an unending cycle of over-population, war, societal collapse and gradual recovery. They cannot escape these Cycles, the best they can do is to try to ensure that each collapse starts from a higher level than the one before by preserving the record of their society's knowledge through the collapse to assist the rise of its successor. One technique they use is museums of their technology. As the next war looms, they wrap the museums in the best defenses they have. The Moties have become good enough at preserving their knowledge that the next war will feature lasers capable of sending light-sails to the nearby stars, and the use of asteroids as weapons. The museums are wrapped in spheres of two-meter thick metal, highly polished to reduce the risk from laser attack.

"Horst, this place is fantastic! Museums within museums; it goes back incredibly far - is that the secret? That civilization is very old here? I don't see why you'd hide that."

"You've had a lot of wars," Potter said slowly.

The Motie bobbed her head and shoulder. "Yah."

"Big wars."

"Right. Also little wars."

"How many?"

"God's sake, Potter! Who counts? Thousands of Cycles. Thousands of collapses back to savagery."

One must hope that humanity's problems are less severe that those of the Moties, but it is clear that preserving the record of society's knowledge is, and always has been, important. At first, societies developed specialist bards and storytellers whose job it was to memorize the knowledge and pass it on to succeeding generations. The invention of writing led to the development of libraries full of manuscripts. Most libraries at this stage both collected copies of manuscripts, and also employed scribes to copy them for exchange with other libraries. It took many man-years of work to create a copy, but they were extremely robust. Vellum, papyrus and silk can last a millennium or more.

Printing made copies cheap enough for the consumer market, thereby eliminating the economic justification for libraries to create copies. They were reduced to collecting the mass-market products. But it was much cheaper to run a library, so there were many more of them, and reader's access to information improved greatly. The combination of a fairly durable paper medium and large numbers of copies in library collections made the system remarkably effective at preserving society's knowledge. It has worked this way for about 550 years and, in effect, no-one really had to pay for it. Preservation was just a side-effect of the way readers got access to knowledge; in economic jargon, an externality.

Humanity is only now, arguably much too late, coming to terms with the externalities (global warming, acidification of the oceans and so on) involved in burning fossil fuels. The difficulty is that technological change brings about a need for something that once was free (the externality) now to be paid for. And those whose business models benefited most from the free externality (e.g. the fossil fuel industry and its customers) have a natural reluctance to do so. Governments are considering imposing carbon taxes, cap-and-trade schemes or other ways to ensure that the real costs of maintaining a survivable environment are paid. Similarly, the technology industry and in particular highly profitable information providers such as Google, Elsevier and News Corporation are unlikely to fund the necessary two-meter thick metal shells without encouragement.

Sunday, July 15, 2007

Update to "Petabyte for a Century"

In a paper (abstract only) at the Archiving 2007 conference Richard Moore and his co-authors report that the San Diego Supercomputer Center's cost to sustain one disk plus three tape replicas is $3K per terabyte per year. The rapidly decreasing disk media cost is only a small part of this, so that the overall cost is not expected to drop rapidly. Consider our petabyte of data example. Simply keeping it on-line with bare-bones backup, ignoring all access and update costs, will cost $3M per year. The only safe funding mechanism is endowment. Endowing the petabyte at a 7% rate of return is a $43M investment.

There are probably already many fields of study for which the cost of generating a petabyte of useful data is less than $43M. The trend in the cost per byte of generating data is down, in part because of the increased productivity of scholarship based on data rather than directly on experiment. Thus the implied and unacknowledged cost of the data generated may in many cases overwhelm the acknowledged cost of the project that generated it.

Further, if all the data cannot be saved, a curation process is needed to determine what should be saved and add metadata describing (among other things) what has been discarded. This process is notoriously hard to automate, and thus expensive. The curation costs are just as unacknowledged as the storage costs. The only economically feasible thing to do with the data may be to discard it.

An IDC report sponsored by EMC (pdf) estimates that the world created 161 exabytes of data in 2006. Using SDSC's figures it would cost almost half a trillion dollars per year to keep one on-line and three tape backup copies. Endowing this amount of data for long-term preservation would take nearly seven trillion dollars in cash. Its easy to see that a lot of data isn't going to survive.

The original "Petabyte for a Century" post is here.

[Edited to correct broken link]