Tuesday, January 31, 2012

The 5 Stars of Online Journal Articles

David Shotton, another participant in last summer's Dagstuhl workshop on Future of Research Communications, has an important article in D-Lib entitledThe Five Stars of Online Journal Articles — a Framework for Article Evaluation. By analogy with Tim Berners-Lee's Five Stars of Linked Open Data, David suggests assessing online articles against five criteria:

  • peer review
  • open access
  • enriched content
  • available datasets
  • machine-readable metadata
For each criterion, he provides a five-point scale. For example, the open access scale goes from 0 for no open access to 4 for Creative Commons licensing. The full article is well worth a read, especially for David's careful explanation of the impacts of each point on the scale of each criterion on the usefulness of the content.

The article concludes by applying the evaluation to a number of articles (including itself). In this spirit, here is my evaluation of our SOSP '03 paper:
  • peer review: 2 - Responsive peer review
  • open access: 1 - Self-archiving green/gratis open access
  • enriched content: 1 - Active Web links
  • available datasets: 1 - Supplementary information files available
  • machine-readable metadata: 1- Structural markup available

Read More......

Friday, January 27, 2012

Yahoo's HTML5 Tools

Last August I wrote about the way HTML5 would accelerate the transition of the Web from static to dynamic content, from a document model to a programming environment. Now, via Slashdot, we learn of Yahoo's plans to open-source most of their tools for publishing HTML5. Details below the fold.

These tools aim to provide a "write once, run everywhere" capability for developers of apps for the Web. Content is developed once in Javascript and HTML5. It can run in a browser, but also as an app in environments such as iOS and Android. When a user invokes an app, such as Yahoo's Livestand, what they are actually doing is invoking what Yahoo calls a "chromeless" browser, an app that is generic in the way a browser is, in that it is the same irrespective of the content. Unlike a browser, the chromeless browser provides no user interface, just a Javascript VM and a rendering engine. This downloads and runs the content, just as a browser would, but allows the content developer total control over the user experience. Yahoo's tools also address one major problem with this approach, the amount of code that needs to be downloaded and run at the client before the user experience is functional. They run the code at the server first to provide an initial, simplified user interface that runs while the full version is being downloaded and executed.

As I've been saying for some time, techniques like this are making our current approaches to collecting and preserving Web content less and less effective as time goes by. It is time to invest in some R&D.

Read More......

Friday, January 20, 2012

Mass-Market Scholary Communication Revisted

The very first post to this blog in 2007 was entitled "Mass-Market Scholarly Communication". Its main point was:

Blogs are bringing the tools of scholarly communication to the mass market, and with the leverage the mass market gives the technology, may well overwhelm the traditional forms.
Now, Annotum: An open-source authoring and publishing platform based on WordPress is proving me a prophet.

It was developed based on experience with PLOS Currents, a rapid publishing journal hosted at Google. After a detailed review of the alternatives, the developers decided to implement Annotum as a WordPress theme providing the capabilities needed for journal publishing, such as multiple authors, strict adherence to JATS (the successor to the NLM DTD), tables, figures, equations, references and review. The leverage of mass-market publishing technology is considerable. The paper describing Annotum is well worth a read.

Read More......

Wednesday, December 28, 2011

Adding cloud storage to the economic model

The next stage in building the economic model of long-term storage is to add the ability to model cloud storage, and to use it to investigate the circumstances under which it is cheaper than local storage. The obvious first step is to collect historical data on cloud storage, to compare how rapidly it is decreasing against the Kryder's Law decrease in disk cost. The somewhat surprising results from looking at Amazon S3's price history are below the fold. I'd be grateful if anyone could save me the trouble of getting equivalent price histories for other cloud storage providers.

When Amazon launched S3 in March 2006 they charged $0.15 per GB per month. Nearly 5 years later, S3 charges $0.14 per GB per month for the first TB. For the first TB this is a price drop of less than 1.5%/yr. For the first month of the first TB of storage, you will pay $140, Even after the impact of the Thai floods, a 1TB Western Digital Green drive is $100 at Fry's. If we continue to assume that the media represent 1/3 of the 3-year cost of ownership S3 would cost over $4800 for a TB over 3 years where a raw local disk would cost $300, a factor of 16 difference.

What Amazon seems to have been doing is using the drop in storage prices to keep the price of a small amount of storage stable and introducing new, cheaper tiers for large amounts. At launch, a PB would have cost $15K/month. At current prices, it would cost about $10.3K/month, a drop of 31% over nearly 5 years, or about 10%/yr. Above 5PB the cost is now $0.055 per GB per month, only 27% of the launch price. Nevertheless, over the next 3 years a PB would cost about $3.36M versus about $300K assuming current inflated 1-off retail prices for local disk and the same assumption about other costs.

We can make two conclusions from this quick look at S3 pricing. S3 is competitive with local storage over the medium terms only if:

  • extremely large demands for storage can be aggregated
  • and either Amazon starts decreasing the cost of a given tier rather than simply adding lower cost tiers, or the Kryder's Law decrease in disk costs slows dramatically.
Amazon is pricing against value for smaller users, and pricing somewhat closer to cost for large users. Most S3 customers obviously value things other than cost of ownership.

Services such as Duracloud that act as brokers between customers and cloud storage providers thus depend in the medium term on aggregating very large, and rapidly increasing amounts of storage, and are assuming that cloud storage provider pricing policies change to more closely reflect media costs. Storing a TB in Duracloud for 3 years would cost $21K, a factor of 70 over the cost of raw local storage. Storing a PB in Duracloud for 3 years would cost over $3M, suggesting that they have negotiated favorable pricing with Amazon, or are using cheaper providers, or are using their current pricing as a loss leader to attract enough demand to get themselves into the cheapest tiers.

Of course, cloud storage providers such as S3 provide replication to enhance reliability, and brokers such as Duracloud or Oxygen Cloud layer additional services on top. We should expect them to cost several times the cost of raw local disk. But the factors are large, and at least for S3 appear to increase significantly as the price of disk decreases.

Read More......

Tuesday, December 13, 2011

CNI Talk on the Economic Model

I gave a talk at the Fall CNI meeting on the work I've been doing on economic models of long-term storage. CNI recorded the talk and I'm expecting them to post the video and the slides. Much of the talk expanded on the talk I gave at the Library of Congress Storage Workshop. The new part was that I managed to remove the assumption that storage prices could never go up, so I was able to model the effect of spikes in storage costs, such as those caused by the floods in Thailand.. Below the fold is the graph.

The Z-axis shows the ratio between the endowment needed for 98% probability of not running out of money in 100 years. The X axis shows the annual percentage rate at which the cost of storage decreases in the absence of the spike. The Y=0 axis assumes no spike in costs. The rest of the Y axis shows the effect of a spike that doubles costs, and takes two years to drop back to its pre-spike value, occurring at Y years after the start.

As you see, if storage costs drop rapidly, the spike has little effect, but if they drop slowly it can have a big impact. Note the "ridge" at Y=4, which is caused by the model's assumption of a 4-year service life for storage hardware. If costs spike just as your current hardware gets to the end of its service life, you are in a world of hurt.

Read More......

Thursday, November 17, 2011

Progress on the Economic Model of Storage

I've been working more on the economic model of long-term storage. As an exercise, I tried to model the effect on the long-term cost storage on disk of the current floods in Thailand. The more I work on this model, the more complex the whole problem of predicting the cost of long-term storage becomes. This time, what emerged is that, despite my skepticism about Kryder's Law, in a totally non-obvious way I had wired in to the model the assumption that disk prices could never rise! So when I tried to model the current rise in disk prices, things went very wrong. So, until I get this fixed, the best I can do is to model a pause of a varying number of years before disk prices resume their Kryder's Law decrease.

For this simulation, I assume that interest rates reflect the history of the last 20 years, that the service life of disks is 4 years, that the planning horizon is 7 years, that the disk cost is 2/3 of the 3-year cost of ownership, and that the initial cost of the unit of storage is $100. The graph plots the endowment required to have a 98% probability of surviving 100 years (z-axis) against the length of the initial pause in disk cost decrease in years (y-axis), and the percentage annual decrease in disk cost thereafter (x-axis).

As expected, the faster the disk price drops and the shorter the pause before it does, the lower the endowment needed. In this simulation the endowment needed ranges from 4.2 to 17.6 times the initial cost of storage, but these numbers should be taken with a grain of salt. It is early days and the model has many known deficiencies.

Read More......

Monday, October 31, 2011

PLoS Is Not As Lucrative As Elsevier

David Crotty of Oxford University Press made the headline-grabbing charge that PLoS will this year be more profitable than Elsevier. I responded skeptically in comments, and Kent Anderson, a society publisher, joined in to support David. Comments appear to have closed on this post, but I have more to say. Below the fold I present a more complete version of my analysis and respond to David's objections.

It has been clear from the start that high-volume open access publishers such as BioMed Central and PLoS are far more of a threat to smaller society and not-for-profit publishers than to the vastly more robust finances of the major commercial publishers. It is thus understandable that David and Kent are concerned by the competition from PLoS.

I agree with them that more transparency in financial reporting from journal publishers is desirable, but note that PLoS is no worse than other publishers in this respect. Because of this, it is necessary to make some rather heroic assumptions in order to extract meaning from their published numbers. The fact that I, and David and Kent, can draw opposite conclusions from the same numbers indicates both that we are making different assumptions, and that making dramatic charges based on these kind of projections is unsound.

The following analysis asks what would have to happen for David's charge that PLoS would be more profitable than Elsevier to come true. I am not making a projection myself, I am analyzing David's projection, and being explicit about the assumptions I am making to do so.

In 2010 PLoS ONE published 6800 papers at $1350 each. Thus the author fee income for PLoS ONE was $9.18M. This is 76% of the total net author fee income for PLoS. I assume that PLoS is not cross-subsidizing PLoS ONE from its other journals, which means that PLoS ONE should account for 76% of PLoS expenses, or $9.28M. Thus the cost per paper of PLoS ONE is $1365. The $15 per paper loss is more than made up by PLOS ONE's share of PLOS' 2010 $1M other income from advertising, membership and interest, which would be $760K, or $112 per paper.

Suppose PLoS ONE publishes 12,000 papers in 2011. What would it take for PLoS ONE to achieve a 35% operating margin? Some linear combination of these three possibilities would have to take place:

  • If the author fee remained the same and other income totaled the same $760K, there would be income of $16.96M, of which costs would have to represent $11.02M, so the cost per paper would have to be $919. This would represent a decrease of 36% in per-paper costs.
  • If the per-paper cost remained the same, total costs would be $16.38M. So total income would have to be $25.2M. If the author fee remained the same, it would contribute $16.2M, leaving $9M to be supplied by other income, or $750 per paper. This would represent a 670% increase.
  • If the per-paper costs remained the same, and other income remained at $760K, the author fees would have to contribute $24.44M, or $2037 per paper, an increase of 51%
Alternatively, if we are to treat grants as operating income, grants attributable to PLoS ONE would have to rise from $1.6M (76% of $2.1M) to $8.24M, an increase of 515%. I find it hard to describe any of these possibilities as "the trends continue".

In response, David writes:
But I have a hard time accepting your central premise, that if PLoS ONE brings in 76% of author fee revenue that it must automatically generate 76% of costs. I think this is a flawed assumption. PLoS has (I believe) 9 publications. Each must have its own staff, its own office space, electricity, etc. Given that PLoS ONE is handling a larger number of papers than the others, it likely generates some higher costs, but it’s unclear if there’s a linear 1:1 relationship between the two. Note also that because of PLoS ONE’s high acceptance rate, the papers published there have to pay for a much smaller percentage of rejected papers as compared with the higher end PLoS journals, reducing the likely per-article cost.
He misses the point that the other PLoS journals, precisely because their cost per paper is higher, have higher author charges. I agree that it is an assumption that PLoS is not cross-subsidizing its journals. But given that the other journal's charges are 67% or 115% higher than PLoS ONE's, their costs would also have to be massively higher than PLoS ONE's to need subsidy. David continues:
If your assumption holds true, then PLoS ONE is actually losing money on each paper and the massive increase in publication volume would be a financial disaster rather than a boon.
No, PLoS ONE generates income from advertising, memberships and interest just as the other journals do. Even if we unrealistically assume no increase in these income sources, PLoS ONE would have a 3.4% margin if it published 12,000 papers in 2011. David ends:
That to me doesn’t sound like a journal that must rely on grants to break even.
I am not suggesting that PLoS ONE relies on treating grants as recurring income. David was the one doing that to support his charge that it is more profitable than Elsevier. I am simply pointing out that David's charge requires him to make assumptions that appear to me completely implausible, even if we treat grants as recurring income.

It seems clear that PLoS is comfortably above the break-even point without grant funding. Some combination of reduced per-paper costs through economies of scale and increased income from advertising, memberships and interest should enable them to continue to be comfortably above break-even even if PLoS ONE continues to double its output every year. But that is very different from running Elsevier-like margins.

Read More......