Sunday, December 30, 2007

Mass-market scholarly communication revisited

Again, I need to apologize for the gap in posting. It turns out that I'm not good at combining travel and blogging, and I've been doing a lot of travel. One of my trips was to DC for CNI and the 3rd International Digital Curation Conference. One of the highlights was a fascinating talk (abstract) by Prof. Carole Goble of the University of Manchester's School of Computer Science. She's a great speaker, with a vivid turn of phrase, and you have to like a talk about science on the web in which a major example is, a shoe shopping site.

Carole and her team work on enabling in silico experiments by using workflows to compose Web and other services. Their myGrid site uses their Taverna workflow system to provide scientists with access to over 3000 services. Their myExperiment "scientists social network and Virtual Research Environment encourages community-wide sharing, and curation, of workflows".

Two things I found really interesting about Carole's talk were:

  • myExperiment is an implementation of the ideas I discussed in my post on Mass-Market Scholarly Communication, enhanced with the concepts of workflows.

  • The emerging world of web services is the big challenge facing digital preservation. Her talk was a wonderful illustration both of why this is an important problem, in that much of reader's experience of the Web is already mediated by services, and why the barriers to doing so are almost insurmountable.

Carole's talk was like a high-speed recapitulation of the history of the Web, with workflows taking the place of pages. More generally, it was an instance of the way Web 2.0 evolution is like Web 1.0 evolution with services instead of static content. Carole described how scientists discovered that they could link together services (pages) using workflows (links). There were soon so many services that sites that directories of them arose (think the early Yahoo!). Then there were so many of them that search engines arose. Then enough time elapsed that people started noticing that workflows (links) decayed over time quite rapidly. There was, however, one important piece of Web 1.0 missing from her presentation - advertising. Follow me below the jump for an explanation of why this omission is important and some suggestions about what can be done to remedy it.

Advertising has played a much underestimated role in driving the evolution of the Web. It provides direct, almost instantaneous, feedback that rewards successful mutations. Because the feedback is monetary, it ensures that successful mutations will get the resources they need to thrive, and unsuccessful ones will not. We take for granted that a Web site that attracts a large readership will be able to sustain itself and grow, but the only reason this is so is because of advertising. Daily Kos, a political blogging site, was started on a shoestring in 2002 and hit a million page views a month after its first year. It now gets 15 million a month, consumes a respectable-size server farm and employs a growing staff. Darwin would recognize this process instantly.

Despite a 1998 harangue from Tim Berners-Lee, research showed that pages had a half-life of a month or two, and that even links in academic journals decayed rapidly. But that was before advertising effects had been felt. Now, it may still be true that pages have a short half-life. But pages that the readership judges to be important, and which thus bring in advertising dollars, do not decay rapidly. Nor do the links that bring traffic to them. Site administrators have been schooled by advertising's reward and punishment system, and they know that gratutiously moving pages breaks links and impairs search engine rankings, which decreases income. So the problem of decaying links has been solved, not by persistent URL technology but by rewarding good behavior and punishing bad behavior.

Although the analogy between Web 1.0 pages and Web 2.0 workflows was apparent from Carole's talk, there is one big difference. People that advertisers will pay to reach read Web 1.0 pages. Workflows are read by programs, whose discretionary spending is zero. There's thus no effective mechanism rewarding good behavior in the workflow world and punishing bad. Suppose that putting a service on-line that attracted a large workflow-ership rapidly caused a flow of money to arrive at the site hosting the service, sufficent to sustain and grow it. Many of the problems currently plaguing workflows would vanish overnight. Site maintainers would find, for example, that a poor availability record or non-upwards-compatible changes to their sites API would rapidly reduce their income, and being the smart young scientists they are, would learn not to do these things.

Funding agencies and others interested in the progress of e-science need an equivalent of advertising to drive the evolution of services and workflows. Without it the field will continue to be plagued by poor performance, fragility, unreliability and instability, with much effort being wasted. More critically, one major key to scientific progress is the requirement that experiments be replicable by later researchers. The current workflow environment appears to provide an almost total inability to replicate experiments after a period no matter how well they were published.

What key aspects of Web advertising are needed in a system to drive the evolution of scientific workflows?

  • It must provide money directly to the maintainers of the services and workflows.

  • The amount of money must be based on automated, third-party measures of usage (think AdSense or Doubleclick or even SiteMeter for scientific workflows) and importance (think PageRank for scientific workflows).

  • The cycle time of the reward system must match the Web, not the research funding process. myExperiment has been in public beta less than six months. In that time it has evolved significantly. A feedback process that involves writing grant proposals, having them peer-reviewed, and processed through an annual budget cycle is far too slow to have any effect on the evolution of a workflow environment.

Funders need to put some infrastructure money into a pot that is doled out automatically via these measures. Doing so will pay great benefits in scientific productivity.