Saturday, October 29, 2011

What Problems Does Open Access Solve?

The library at the University of British Columbia invited me to speak during their Open Access Week event. Thinking about the discussions at the recent Dagstuhl workshop I thought it would be appropriate to review the problems with research communication and ask to what extent open access can help solve them. Below the fold is an edited text of the talk with links to the sources.

Who Am I?

I'm honored by your invitation to take part in this celebration of Open Access. I am impressed by the changes that the Open Access movement has already wrought in the system of research communication, and believe that there are more to come. Nevertheless, the system of research communication is dysfunctional in many ways. Open access can help with some of them, but it isn't a panacea, and it may make some problems worse.

But who am I to pontificate on this subject? I'm an engineer. I spent my career in academia and Silicon Valley startups. Although I have published many papers, even prize-winning ones, doing so wasn't important compared to making systems that worked in the real world and, ideally, made money. In general, publishing isn't as critical even for academic engineers as it is for other subjects; there tend to be more objective ways of measuring our impact. Since publishing isn't a matter of life and death for me, I may be able to take a more dispassionate view of the subject.


After doing three startups, all of which IPO-ed, I was pretty burnt-out. I got involved in the academic publishing business by accident. In 1995 the Stanford Library pioneered the transition of academic journals from paper to the Web when HighWire Press put the Journal of Biological Chemistry on-line. Readers rapidly found features such as links for citations and citing papers, spreadsheets of the data behind graphs, and instant access from anywhere essential. But librarians, who actually paid for the journals, were much less happy. The Web forced them to switch from purchasing a copy of the journal to renting access to the publisher's copy. They worried about access to the material in the long term.

In 1998 Vicky Reich and I came up with the program called LOCKSS (Lots Of Copies Keep Stuff Safe) as a way of implementing the purchase model in the Web world; allowing librarians to take custody of their own copy of the e-journals to which they subscribed and keep it to guarantee future readers access. Now, about 150 libraries around the world use the LOCKSS system to collect and preserve subscription e-journals. The LOCKSS system has proved useful beyond its original target of subscription e-journals and now preserves a wide range of content including open access journals and websites, e-books and library special collections.

Research Communication

This summer, I was invited to a workshop on the "Future of Research Communication" at Schloss Dagstuhl and ended up responsible for the part of the workshop report that described the problems of the current system. Although the report isn't finished, I posted my draft section to my blog, where it drew considerable attention. This talk is based on that section. I should here, as I did in the post, express my debt to the British House of Commons Science and Technology Committee's report on Peer review in scientific publications.

After all, if we were proposing changing the system, we were presumably dissatisfied with things about it. Based on this, and other discussions at the workshop, when I was invited to give this talk I thought it would be appropriate to ask "to what extent open access can help solve the problems with the current system?"

To answer that question, we need to know what the problems are. My approach to identifying them was to look at how the system addresses the needs of the various interested parties:

  • The general public
  • Researchers
  • Libraries
  • Publishers
  • Software Developers
I'll address each of these in turn, looking first at the problems and then at the extent to which open access addresses them.

The General Public

In the U.S., the general public pays for a large proportion of research, either directly, or via the R&D tax credit, or via the tax exemptions provided to foundations. Things are similar elsewhere. It is a matter of simple fairness that the public should be able to see the results of their investments. But more than just fairness is at issue. The public needs to be able to extract reliable information from the deluge of mostly ill-informed, self-serving or commercial messages that forms their information environment. They have been educated to believe that content branded "peer-reviewed" is a gold standard on which they can rely.

It would be in the public interest if it were reliable but high-profile examples show this isn’t always the case. For example, it took 12 years before the notorious Wakefield paper linking MMR vaccine to autism was retracted, and another 11 months before the full history of the fraud was revealed. The delay had serious effects on public health; UK MMR uptake went from about 90% to about 65%.

The additional quality denoted by the "peer-reviewed" brand has been decreasing:
"False positives and exaggerated results in peer-reviewed scientific studies have reached epidemic proportions in recent years."
One major cause has been that the advent of the Internet, by reducing the cost of distribution, encouraged publishers to switch libraries from subscribing to individual journals to the "big deal", in which they paid a single subscription to access all of a publisher's content. In the world of the big deal, many publishers discovered the effectiveness of this Microsoft-like "bundling" strategy. By proliferating cheap, low-quality journals, thus inflating the perceived value of their deal to the librarians, they could grab more of the market.

This intuitive conclusion is supported by detailed economic analysis of the "big deal":
"Economists are familiar with the idea that a monopoly seller can increase its profits by bundling. This possibility was discussed by W.J. Adams and Janet Yellen and by Richard Schamalensee. Hal Varian noted that academic journals are well suited for this kind of bundling. Mark Armstrong and Yannis Bakos and Erik Brynjolfsson demonstrated that bundling large collections of information goods such as scholarly articles will not only increase a monopolist's profits, but will also decrease net benefits to consumers."
Researchers cooperated with the proliferation of journals. They were seduced by extra opportunities to publish and extra editorial board slots. They did not see the costs, which were paid by their librarians or funding agencies. The big deal deprived librarians of their economic ability to reward high quality journals and punish low quality journals:
"΄Libraries find the majority of their budgets are taken up by a few large publishers,‘ says David Hoole, director of brand marketing and institutional relations at [Nature Publishing Group]. ΄There is [therefore] little opportunity [for libraries] to make collection decisions on a title-by-title basis, taking into account value-for-money and usage.‘"
The inevitable result of stretching the "peer-reviewed" brand in this way has been to devalue it. Almost anything, even commercial or ideological messages, can be published under the brand:
΄BIO-Complexity is a peer-reviewed scientific journal with a unique goal. It aims to be the leading forum for testing the scientific merit of the claim that intelligent design (ID) is a credible explanation for life.‘
Authors submit papers repeatedly, descending the quality hierarchy to find a channel with lax enough reviewing to accept them. PLoS ONE publishes every submission that meets its criteria for technical soundness; despite this 40% of the submissions it rejects as unsound are eventually published elsewhere. Even nonsense can be published if page charges are paid.

Commercial and ideological interests are using their ability to brand their messages as "peer-reviewed" as part of a campaign to discredit science, portraying scientists as "insiders" conspiring to defraud taxpayers by creating bogus scares such as global warming. To the extent that these efforts are successful, taxpayers will be reluctant to continue to have their hard-earned dollars fund research.

Open access to research communication is clearly in the public interest. Hiding research, especially the best research, behind paywalls from the people who paid for it is, in the context of the campaign to discredit scientists as "insiders", shooting ourselves in the foot. Objections are often raised that the material is so arcane that the general public has no business reading it; this is not merely condescending but actively harmful.

How will the general public find open access peer-reviewed content? The answer is clear, via Google, just like researchers do. As soon as journals published by HighWire Press allowed Google to index them, their traffic exploded. A recent survey of Chinese scientists' use of Google found that even in China where Google doesn't dominate the search market (Baidu does), it is the first place they go to find papers.
"More than 80% use the search engine to find academic papers; close to 60% use it to get information about scientific discoveries or other scientists' research programmes; and one-third use it to find science-policy and funding news ... 'The findings are very typical of most countries in the world,' says David Bousfield, ... 'Google and Google Scholar have become indispensable tools for scientists.'"
This makes it pretty much irrelevant whether a paper is accessed from the journal publisher, from an institutional or subject repository, or from the author's own website. The brand, or tag, of the journal that published it still matters as a proxy for quality, but the journal isn't essential for access. But this also raises an issue; publishing open access peer-reviewed research is a good thing, especially in controversial areas, but it is unlikely to have much effect without investments in search engine optimization to match those of the opponents.


Researchers play many roles in the flow of research communication, as authors, reviewers, readers, reproducers and re-users. As readers, researchers are in a better position to judge the quality of material they read, and are normally not impeded by paywalls. Their institution's library typically pays for their access from some overhead budget with which they are not concerned. Scholars have been complaining of information overload for more than two centuries. Online access provides much better discovery and aggregation tools, but these tools struggle against the fragmentation of research communication caused by the rapid proliferation of increasingly specialized and overlapping journals with decreasing quality of reviewing.

Open access will, of course, improve the access of some researchers to some of the literature, especially in developing countries. But because most researchers already have unimpeded access to the literature that is most important for their work, this effect is likely to be marginal. A much bigger potential effect is on the tools that make readers more productive. The more of the literature that is open access, the lower the barriers to entry for tool makers, the more competitive the market for tools, and thus the more effective the tools are likely to be.

Google came out of the NSF-funded Digital Library project at Stanford's CS department. My wife, Vicky Reich, was the librarian on the project; she was the one who explained citation indexes, the basis for PageRank, to Larry and Sergey. The extraordinary development of search technology since then is an example of what open access to the literature and a competitive environment can do. Imagine how useful Google would be if it had to negotiate individually with each web site to be allowed to index it.

The evaluations that determine researchers' career success are, in most cases, based on their role as authors of papers (for scientists) or books and monographs (for humanists), to the exclusion of their other roles. In the sciences, credit is typically based on the number of papers and the "impact factor" of the journal in which they appeared. Journal impact factor is generally agreed to be a seriously flawed measure of the quality of the research described by a paper. Although impact factors are based on citation counts for their articles, journal impact factors do not predict article citation counts, which are in any case easily manipulated. For example, a citation pointing out that an article had been retracted acts to improve the impact factor of the journal that had to retract it.

There are already competitors for impact factor, but their impact has been limited. Neither the impact factor nor its competitors have unimpeded access to the entire literature. Uniform open access would encourage competition and rapid evolution of better tools for evaluating the work of researchers, which would not merely benefit the researchers worth rewarding, but also improve the "bang for the buck" of public research funding.

Peer review depends on reviewers, who are only very indirectly rewarded for their essential efforts. The anonymity of reviews makes it impossible to build a public reputation as a high-quality reviewer. If articles had single authors and averaged three reviewers, authors would need to do an average of three reviews per submission. Multiple authorship reduces this load, so if were evenly distributed it would be manageable. In practice, the distribution is heavily skewed, loading some reviewers enough to interfere with their research:
"Academic stars are unlikely to be available for reviewing; hearsay suggests that sometimes professors ask their assistants or PhD students to do reviews which they sign! Academics low down in the pecking order may not be asked to review. Most reviews are done by academics in the middle range of reputation and specifically by those known to editors and who have a record of punctuality and rigour in their reviews: the willing and conscientious horses are asked over and over again by overworked and—sometimes desperate—editors."
The cost is significant:
"In 2008, a Research Information Network report estimated that the unpaid non-cash costs of peer review, undertaken in the main by academics, is £1.9 billion globally each year."
Reviewers rarely have access to the raw data and enough information on methods and procedures to be able to reproduce the results, even if they had adequate time and resources to do so. Lack of credit for thorough reviews means there is little motivation to do so. Reviewers are thus in a poor position to detect falsification or fabrication. Experimental evidence suggests that they aren’t even in a good position to detect significant errors:
"Indeed, an abundance of data from a range of journals suggests peer review does little to improve papers. In one 1998 experiment designed to test what peer review uncovers, researchers intentionally introduced eight errors into a research paper. More than 200 reviewers identified an average of only two errors. That same year, a paper in the Annals of Emergency Medicine showed that reviewers couldn't spot two-thirds of the major errors in a fake manuscript. In July 2005, an article in JAMA showed that among recent clinical research articles published in major journals, 16% of the reports showing an intervention was effective were contradicted by later findings, suggesting reviewers may have missed major flaws."
My favorite example of poor reviewing appeared in AAAS Science in 2003. It was Dellavalle RP et al: Going, Going, Gone: Lost Internet References, a paper about the decay of Internet links. The authors failed to acknowledge that the paper repeated, with smaller samples and somewhat worse techniques, two earlier studies that had been published in Communications of the ACM 9 months before, and in IEEE Computer 32 months before. Neither of these are obscure journals. It is particularly striking that neither the reviewers nor the editors bothered to feed the keywords from the article abstract into Google; had they done so they would have found both of these earlier papers at the top of the search results.

Clearly, if the reviewers and editors of a premier journal can't be bothered even to feed the keywords for a paper into Google, the peer review process is just broken. Complaining about reviewers won't fix it. They need better tools so they can do a better job with less effort. Open access to the literature would enable a competitive market for reviewer tools aimed at detecting not just plaigiarism, as is common today, but the kind of duplicative research demonstrated by the Science example. Tools of this kind can be effective only if they have the access to the entire literature provided by uniform open access.

Peer review is often said to be the gold standard of science, but this is not the case. The gold standard in experimental science is reproducibility, ensuring that anyone repeating the experiment gets the same result. When even a New York Times op-ed points out that, in practice, scientists almost never reproduce published experiments it is clear that there is a serious problem. Articles in high-impact journals are regularly retracted; there is even a blog tracking retractions. Lower-impact journals retract articles less frequently, but this probably reflects the lesser scrutiny that their articles receive rather than a lower rate of error. These retractions are rarely based on attempts to reproduce the experiments in question. Researchers are not rewarded for reproducing previous experiments. Causing a retraction does not normally count as a publication, and it can be impossible to publish refutations:
"Three teams of scientists promptly tried to replicate his results. All three teams failed. One of the teams wrote up its results and submitted them to [the original journal]. The team's submission was rejected — but not because the results were flawed. As the journal’s editor, [explained] the journal has a longstanding policy of not publishing replication studies.'This policy is not new and is not unique to this journal,' he said. As a result, the original study stands."
The lack of recognition for reproducing experiments is the least of the barriers to reproducibility. Publications only rarely contain all the information an independent researcher would need in order to reproduce the experiment in question:
"The article summarises the experiment ... - the data are often missing or so emasculated as to be useless. It is the film review without access to the film."
Quite apart from the difficulty of reproducing the experiment, this frequently prevents other researchers from re-using the techniques in future, related experiments. Isaac Newton famously "stood on the shoulders of giants"; it is becoming harder and harder for today's researchers to stand on their predecessors' shoulders.

Re-using the data that forms the basis of a research communication is harder than it should be. Even in the rare cases when the data is part of the research communication it typically forms "supplementary material", whose format, and preservation are inadequate. In other cases the data are in a separate data repository, tenuously linked to the research communication. Data submissions are only patchily starting to be cite-able via DOIs.

Open access to data is thus even more important than open access to the literature. It poses two classes of problem:
  • Technical: in that the formats and metadata needed to render the data useful to others are far more diverse and complex than those needed to make the literature readable.
  • Legal: in that the intellectual property framework for data is far less clear and well-established than that for the literature.
The world of Linked Open Data is starting to address the technical problems. Among its many advantages is that, just as the Web provides an infrastructure for measuring the world of words, Linked Open Data provides an infrastructure for measuring the world of data. One of the measurements that has already been made shows the extent of the legal problem. Over 80% of all linked data sources provide no license information. Anyone using these data sources is placing themselves in legal jeopardy.


Libraries used to play an essential role in research communication. They purchased and maintained local collections of journals, monographs and books, reducing the latency and cost of access to research communications for researchers in the short term. As a side effect of doing so, they safeguarded access for scholars in the long term. A large number of identical copies in independently managed collections provided a robust preservation infrastructure for the scholarly record.

The transition to the Web as the medium for scholarly communication has ended the role local library collections used to play in short-term access. In many countries, such as the US, libraries (sometimes in consortia) retain their role as the paying customers of the publishers. In other countries, such as the UK, negotiations as to the terms of access and payment for it are now undertaken at a national level. But neither provides librarians much ability to be discriminating customers of individual journals, because both are subject to the "big deal". Libraries bought into the big deal despite warnings from a few perceptive librarians who saw the threat:
"Academic library directors should not sign on to the Big Deal or any comprehensive licensing agreement with commercial publishers ... the Big Deal serves only the Big Publishers ... increasing our dependence on publishers who have already shown their determination to monopolize the marketplace"
Libraries and archives have been forced to switch from purchasing a copy of the research communications of interest to their readers, to leasing access to the publisher's copy. Librarians did not find publishers’ promises of "perpetual access" to the subscribed materials convincing as a replacement for libraries’ role as long-term stewards of the record. Two approaches to this problem of long-term access have emerged:
  • A single third-party subscription archive called Portico. Portico collects and preserves a copy of published material. Libraries subscribe to Portico and, as long as their subscription continues, can have access to material they used to but no longer subscribe to. Portico has been quite widely adopted, despite not actually implementing a solution to the problem of post-cancellation access (logically, it is a second instance of the same problem), but has yet to achieve economic sustainability.
  • A distributed network of local library collections called LOCKSS (Lots Of Copies Keep Stuff Safe), modeled on the way libraries work in the paper world. Publishers grant permission for LOCKSS boxes at subscribing libraries to collect and preserve a copy of the content to which they subscribe. Fewer libraries are using the LOCKSS system to build collections than subscribe to Portico for post-cancellation access. Despite this the LOCKSS program has been financially sustainable since 2007.
Open access makes preservation easier in some ways; the Creative Commons license pre-authorizes everything that is needed for preservation. The motivation for participation in Portico is post-cancellation access to subscription e-journals, which goes away with open access, or indeed with journals that have a "moving wall". However, in general it is harder to justify spending to preserve open access material, because there is no sunk investment to protect.

Bereft of almost all their role in the paper world, libraries are being encouraged to both compete in the electronic publishing market and take on the task of running "institutional repositories", in effect publishing their scholars data and research communications. Both tasks are important; neither has an attractive business model. Re-publishing an open access version of their scholars' output may seem redundant, but it is essential if the artificial barriers intellectual property restrictions have erected to data-mining and other forms of automated processing are to be overcome.

In many fields the volumes of data to be published, and thus the costs of doing so, are formidable:
"Adequate and sustained funding for long-lived data collections ... remains a vexing problem ... the widely decentralized and nonstandard mechanisms for generating data ... make this problem an order of magnitude more difficult than than our experiences to date ..."
In many cases, these vast collections of data are the output of scholars at many institutions, the motivation for an individual institution to expend the resources needed for publishing is weak. The business models for subject repositories are fragile; the UK's Arts and Humanities Data Service failed when its central funding was withdrawn, and's finances are shaky at best. A "Blue Ribbon Task Force" recently addressed the issue of sustainable funding for long-term access to data; its conclusions were not encouraging.

As we have seen, the potential gains from open access to data dwarf those from open access to the literature. To realize these gains, some way must be found to fund the storage and preservation of the data for the long term. There are three possible business models for storage:
  • Rent: An example is Amazon's S3. Rent requires a flow of funds indefinitely, which does not match the project-based, term-limited model of reearch funding.
  • Monetization: An example is Google's Gmail. Google pays for the storage by running ads against reader's access to their mail. Data is read by programs, not by people with credit cards, so the scope for monetizing it is negligible.
  • Endowment: An example is Princeton's Pay Once, Store Endlessly service, in which data is deposited together with a capital sum believed to be adequate to pay for its storage "forever".
All three are problematic, especially for data. Advocates for endowment correctly note that it is the only model that matches the way research is funded. But the economics of endowment depend critically on the decades-long exponential drop in disk cost, known as Kryder's Law: continuing. Even Kryder himself projects it ending before 2026. I'm more skeptical about this and other assumptions behind the endowment model.


Academic publishing is a multi-billion dollar business. For at least some of the large publishers, it is extraordinarily lucrative, which means that any significant change will be faced by heavily-funded opposition. Industry analyst Sami Kassab says:
"legal publishing (Westlaw and LexisNexis) operate on lower operating profit margins (c. 25-30% for West and 15-20% for LexisNexis), financial information (Reuters, Bloomberg) are on around 20%, educational publishing (i.e school textbooks and college textbooks) operates on c. 10-15% for school and 20-25% for college textbooks. Within Media, the marketing services industry (Omnicom, Interpublic, Publicis, WPP) generates 12-17%. Newspapers (when not dead) tend to generate 10-15% operating profit margins. TV Broadcasting is on 10-15%.

Google operates on similar operating profit margins at 30-35%. The only Media segment that I know off with higher margins is the Yellow Pages industry with 45-50% but rapidly declining."
Publisher's financial reports are tricky to interpret, but my calculations suggest that:
  • Elsevier's academic publishing operation generates $870M after-tax profit on $3160M gross revenue. In other words, 27 cents of every subscription dollar flows directly to Reed Elsevier's shareholders.
  • Wiley's academic publishing operation generates $200M after-tax profit on $990M of gross revenue, so 20 cents of every subscription dollar flows to Wiley's shareholders.
  • Springer's academic publishing operation generates $330M after-tax profit on gross revenue of $950M, so 33 cents of every subscription dollar flows to Springer's shareholders.
Despite this cornucopia of cash, big publishers search for additional revenue has become extreme:
"At the beginning of 2011, researchers in Bangladesh, one of the world’s poorest countries, received a letter announcing that four big publishers would no longer be allowing free access to their 2500 journals through the Health InterNetwork for Access to Research Initiative (HINARI) system. It emerged later that other countries are also affected."
The world's research and education budgets pay these three companies alone about $3.2B/yr for management, editorial and distribution services. Over and above that, the worlds research and education budgets pay the shareholders of these three companies nearly $1.5B for the privilege of reading the results of research (and writing and reviewing) that these budgets already paid for.

The over three billion dollars a year might be justified if the big publisher's journals were of higher quality than those of competing not-for-profit publishers, but:
"Surveys of [individual] journal [subscription] pricing ... show that the average price per page charged by commercial publishers is several times higher than that which is charged by professional societies and university presses. These price differences do not reflect differences in quality. If we use citation counts as a measure of journal quality ... we see that the prices charged per citation differ by an even greater margin."
It is hard to justify the one and a half billion dollars a year on any basis. It does not represent a competitive return on past investments, since publishers such as HighWire Press have shown that it is possible to deploy competitive publishing platforms for less than 1% of this annual return.

Not-for-profit publishers can be as rapacious as the for-profit giants. For example, the American Chemical Society had revenues of $460M and rewards its executives so lavishly that working chemists are protesting.

Publishers' major customers, libraries, are facing massive budget cuts thus are unlikely to be a major source of additional revenue:
"The Elsevier science and medical business ... saw modest growth reflecting a constrained customer budget environment."
The bundling model of the big publishers means that, in response to these cuts, libraries typically cancel their subscriptions to smaller and not-for-profit publishers, so that tough times increase the market dominance of the big publishers.

Can these generous margins be sustained? They ought to be an invitation to disruption by low-cost competitors, but they demonstrate that the three big publishers have effective monopoly power in their market:
"... despite the absence of obvious legal barriers to entry by new competing journals. Bergstrom argues that journals achieve monopoly power as the outcome of a ΄coordination game‘ in which the most capable authors and referees are attracted to journals with established reputations. This market power is sustained by copyright law, which restricts competitors from selling ΄perfect substitutes‘ for existing journals by publishing exactly the same articles. In contrast, sellers of shoes or houses are not restrained from producing nearly identical copies of their competitors' products."
Note that this has nothing to do with subscriptions versus open access. For-profit publishers are starting open access journals. 40% of Nature Communications authors pay $5K for open access, based on Nature's brand, which is probably enough to sustain Nature's traditional margins.

Nevertheless, some not-for-profit competitors are emerging that could pose a significant threat to both for-profit and not-for-profit society publishers. In 2010, PLoS achieved sustainability on a 6% operating margin despite author charges of only $1350. David Crotty of Oxford University Press and Kent Anderson, a society publisher, project that PLoS could achieve Elsevier-level margins by maintaining their current pricing and increasing volume (I believe they are wrong). They are outraged by this prospect, but should be far more worried by the prospect of PLoS maintaining their current margins and reaping economies of scale, which would reduce the author charges.

Even if the shift from subscriptions to author-pays open access left publisher's shareholders and not-for-profit executives extracting the same lavish sums from the world's research and education budgets, it would still be a good thing. The reason is that it would transfer the visibility of these charges from the librarians, who are no longer able to reward good journals and punish bad ones, to the authors, who are in a position to do so through their choice of where to publish. At least funds for publishing bad research would come from the budgets supporting bad research. The effect would be likely to increase the competitive threat of a low-price PLoS model.

So far, I've been talking about e-journals. But what about e-books? Last year, The Economist published one of its excellent reviews, this one on the media business entitled "A World Of Hits". The theme was that traditional publishers of all kinds were increasingly devoting their resources to the big hits, while the long tail of minority interest content was self-publishing via the Internet. The mid-range, where most of the long-term cultural value lives, was in trouble.

I'm less pessimistic about the mid-range than The Economist. Here are two examples:
  • Pomplamoose is Jack Conte and Nataly Dawn, a pair of musicians who graduated from Stanford and became a YouTube success. Their songs have always been free on YouTube and $0.99 on iTunes; they also make money by selling tchotchkes and playing gigs. Recently, Nataly wanted to make a solo album. She needed $20K to make it. She used KickStarter to raise $105K from 2315 fans. Pomplamoose has marketed themselves far better than any music label would have.
  • Joe Konrath and Johk Locke are successful published writers of mystery novels. A fascinating blog post from March this year reveals that the Kindle is providing them a profitable business model independent of publishers. John then held the #1, #4 and #10 spots on the Amazon Top 100, with another 3 books in the top 40. Joe had the #35 spot. Of the top 100, 26 slots were held by independent authors, 7 of them by Amanda Hocking. John and Joe had been charging $2.99 per download, of which Amazon gave them 70%. When they dropped the price to $0.99 per download of which Amazon only gives them 35%, not just their sales but also their income exploded. John is making $1800/day from $0.99 downloads. Kevin Kelly predicts that in 5 years, the average price of e-books will be $0.99. As he points out:
    $1 is near to the royalty payment that an author will receive on, say, a paperback trade book. So in terms of sales, whether an author sells 1,000 copies themselves directly, or via a traditional publishing house, they will make the same amount of money.
If publishers were doing all the things they used to do to promote books, maybe this would not be a problem. But they aren't. It is true that neither Joe and John's books nor downloading Pomplamoose's music is open access, but at $0.99 who cares?

Eric Hellman is trying to make truly open access e-books work with his new company Gluejar. The idea is simple:
Anyone will be able to kick off a pledge drive for a favorite book on our upcoming web site Gluejar will work with rightsholders to determine a good price, and anyone can contribute toward meeting it. When the goal is met, rightsholders will be paid in exchange for making their works available under a Creative Commons license. The book becomes free for everyone to read and share.
The analogy to KickStarter is obvious.

To sum up, the advent of the Internet should have greatly reduced the monetary value that can be extracted from academic content. Publishers who have depended on extracting this value face a crisis. The crisis is being delayed only by Universities and research funders. They have the power in their hands to insist on alternative models for access to the results of research, such as open access and self-archiving, but have in most cases been reluctant to do so. My suspicion is that the decision-makers in this area also tend to be on editorial boards of major journals, and thus generously treated by the big publishers.

Software Developers

A large and active movement is developing tools and network services intended to improve the effectiveness of research communication, and thus the productivity of researchers. These efforts are to a great extent hamstrung by two related problems, access to the flow of research communication, and the formats in which research is communicated.

Both problems can be illustrated by the example of mouse genomics. Researchers in the field need a database allowing them to search for experiments that have been performed on specific genes, and their results. The value of this service is such that it has been developed. However, because the format in which these experiments are reported is the traditional journal paper, this and similar databases are maintained by a whole class of scientists, generally post-Ph.D. biologists, funded by NIH, who curate information from published papers into the databases. These expensive people are wasting time they should be spending on research on tasks that could in principle be automated.

Automating this process would require providing software with access to the journal papers, replacing the access the curators get via their institution's journal subscriptions. Unfortunately, these papers are copyright, and the copyright is fragmented among a large number of publishers. The developers of such an automated system would have to negotiate individually with each publisher. If even a single publisher refused to permit access, the value of the automation would be greatly reduced, and humans would still be needed. Uniform open access would greatly reduce the barriers to entry for tool developers, creating a competitive market.

Equally, because the mechanism that enforces compliance with the current system of research communication attaches value only to publications in traditional formats, vast human and machine efforts are required to extract the factual content of the communication from the traditional format. Were researchers to publish their content in formats better adapted to information technology, these costs could be avoided.


Generalizing, we can say that improving the current system requires:
  • More information to be published. Open access, by reducing the costs of publishing, may help with this.
  • In formats more suited to information technology. Open access, by reducing the barriers to entry of new publishers less trapped in legacy systems, may help with this.
  • Less encumbered with intellectual property restrictions. Obviously, open access helps with this for human readers, but depending upon the particular license involved, may not help with this for programs.
  • More cheaply. Depending on the market for author charges, open access may or may not help with this.
  • With better discovery and aggregation mechanisms. Open access may help with this if the licenses involved allow for data mining.
  • Better quality metrics. Open access may help with this if the licenses involved allow for data mining.
  • Better mechanisms for improving quality. Open access, with appropriate licenses, may help with this by encouraging a competitive market in post-publication review platforms.
  • And sustainably preserved for future scholars. With suitable licenses, open access can eliminate one barrier to entry for competitive preservation systems, having to negotiate individually with copyright holders for permission.
However, it seems to me that open access is not the critical issue for the future of research communication. Looking at the big picture, we see that in the past the goal of the system was to prevent bad research being published. The cost of print meant that publishing was expensive, with high barriers to entry, and limited space. Publishers were motivated to spend their limited publication slots on only the good stuff.

The transition to the Web meant that the cost fell, and space became unlimited. The big deal was a response to this. The bundling strategy meant that publishers were now motivated to publish anything they could get their hands on, just tagged with different journal brands to maintain an impression of quality.

Open access did not cause this switch, but it helped drive its later stages. As we see from the Bentham affair, publishers now sell their brands for profit, irrespective of the quality. The premier journal brands now compete for speed of publication and attention-grabbing research, never mind that it may well be retracted in a year or so. The journal is no longer a good predictor of article quality as measured by citations.

This switch can't be reversed. We can't go back to preventing people finding bad research by not publishing it. Everything that gets written will get published somewhere, the bad and the good. The legacy publishers will fight this tooth and nail, but unless they can improve their reviewing radically enough that the journal starts to predict article quality, the fact that it doesn't will eventually enable some better tool to out-compete the journal as a way to find good research.

In a world where everything gets published somewhere, research communication becomes a search engine optimization problem. There is hope in that, because although Google dominates the search market, they know that the barriers to entry are low, and they need to work furiously to keep their advantage. So search technology progresses rapidly. But there is also danger in this. Search engine optimization is an even more competitive, and much less ethical market. The black-hat SEO techniques can be applied by the enemies of science, or even by unscrupulous researchers to advance their careers. Who are the white hats in this battle?


David. said...

I hope no-one thinks I'm thinks I'm kidding about a campaign to discredit science.

David. said...

There's an interesting NPR interview with Pomplamoose from last April which touches on their business model, including their involvement with the YouTube Musicians Wanted ad revenue sharing program.

David. said...

A good post on the effect of computers on reproducibility. Tip of the hat to Carol Goble.