DSHR's Blog: 2011

Wednesday, December 28, 2011

Adding cloud storage to the economic model

The next stage in building the economic model of long-term storage is to add the ability to model cloud storage, and to use it to investigate the circumstances under which it is cheaper than local storage. The obvious first step is to collect historical data on cloud storage, to compare how rapidly it is decreasing against the Kryder's Law decrease in disk cost. The somewhat surprising results from looking at Amazon S3's price history are below the fold. I'd be grateful if anyone could save me the trouble of getting equivalent price histories for other cloud storage providers.

CNI Talk on the Economic Model

I gave a talk at the Fall CNI meeting on the work I've been doing on economic models of long-term storage. CNI recorded the talk and I'm expecting them to post the video and the slides. Much of the talk expanded on the talk I gave at the Library of Congress Storage Workshop. The new part was that I managed to remove the assumption that storage prices could never go up, so I was able to model the effect of spikes in storage costs, such as those caused by the floods in Thailand.. Below the fold is the graph.

Progress on the Economic Model of Storage

I've been working more on the economic model of long-term storage. As an exercise, I tried to model the effect on the long-term cost storage on disk of the current floods in Thailand. The more I work on this model, the more complex the whole problem of predicting the cost of long-term storage becomes. This time, what emerged is that, despite my skepticism about Kryder's Law, in a totally non-obvious way I had wired in to the model the assumption that disk prices could never rise! So when I tried to model the current rise in disk prices, things went very wrong. So, until I get this fixed, the best I can do is to model a pause of a varying number of years before disk prices resume their Kryder's Law decrease.

For this simulation, I assume that interest rates reflect the history of the last 20 years, that the service life of disks is 4 years, that the planning horizon is 7 years, that the disk cost is 2/3 of the 3-year cost of ownership, and that the initial cost of the unit of storage is $100.

The graph plots the endowment required to have a 98% probability of surviving 100 years (z-axis) against the length of the initial pause in disk cost decrease in years (y-axis), and the percentage annual decrease in disk cost thereafter (x-axis).

As expected, the faster the disk price drops and the shorter the pause before it does, the lower the endowment needed. In this simulation the endowment needed ranges from 4.2 to 17.6 times the initial cost of storage, but these numbers should be taken with a grain of salt. It is early days and the model has many known deficiencies.

Monday, October 31, 2011

PLoS Is Not As Lucrative As Elsevier

David Crotty of Oxford University Press made the headline-grabbing charge that PLoS will this year be more profitable than Elsevier. I responded skeptically in comments, and Kent Anderson, a society publisher, joined in to support David. Comments appear to have closed on this post, but I have more to say. Below the fold I present a more complete version of my analysis and respond to David's objections.

What Problems Does Open Access Solve?

The library at the University of British Columbia invited me to speak during their Open Access Week event. Thinking about the discussions at the recent Dagstuhl workshop I thought it would be appropriate to review the problems with research communication and ask to what extent open access can help solve them. Below the fold is an edited text of the talk with links to the sources.

Seminar at UC Santa Cruz

I gave a seminar in Ethan Miller's CMPS290S course at UC Santa Cruz. It was a mash-up of my blog posts on Paying For Long-Term Storage and Modeling the Economics of Long-Term Storage. The links to the sources are all in those posts, except for Wide-Body: The Triumph of the 747, which I used to illustrate the importance for successful engineering of understanding the market for your product.

Wednesday, October 19, 2011

Do Digital Signatures Assure Long-Term Integrity?

Duane Dunston has posted a long description of the use of digital signatures to assure the integrity of preserved digital documents. I agree that the maintaining the integrity of preserved documents is important. I agree that digital signatures are very useful. For example, the fact the GPO is signing government documents is important and valuable. It provides evidence that the document contains information the federal government currently wants you to believe. Similarly, the suggestion by Eric Hellman to use signatures to verify that Creative Commons licenses have been properly applied.

However, caution is needed when applying digital signatures to the problem of maintaining the integrity of digital documents in the long term. Details below the fold.

Wiley's Financials

Two postings by Ann Okerson to the "liblicense" mail alias about Wiley's latest financial report reveal a detail about the way Wiley reports its financial data that I missed, and that means I may have somewhat over-estimated its profitability in my post on What's Wrong With Research Communication?. Follow me below the fold for the details.

ACM & Copyrights

ACM's copyright policy has been a subject of controversy. The latest Communications of the ACM has an article by Ronald Boisvert and Jack Davidson, the co-chairs of the ACM Publications Board, outlining recent changes. One change they feature is:

One new feature that ACM will roll out in the fall will enable authors to obtain a special link for any of their ACM articles that they may post on their personal page. Anyone who clicks on this link can freely download the definitive version of the paper from the DL. In addition, authors will receive a code snippet they can put on their Web page that will display up-to-date citation counts and download statistics for their article from the DL.

Below the fold I look this gift horse in the mouth.

Preserving Linked Data

I attended the workshop on Semantic Digital Archives that was part of the Theory and Practice of Digital Libraries conference in Berlin. Only a few of the papers were interesting from the preservation viewpoint:

Kun Qian of the University of Magdeburg addressed the fact that the OAIS standard does not deal with security issues, proposing an interesting framework for doing so.
Manfred Thaller described work in the state of North-Rhine Westphalia to use open source software such as IRODS to implement a somewhat LOCKSS-like distributed preservation network for cultural heritage institutions using their existing storage infrastructure. Information in the network will be aggregated by a single distribution portal implemented with Fedora that will feed content to sites such as Europeana.
Felix Ostrowski of Humboldt University, who works on the LuKII project, discussed an innovative approach to handling metadata in the LOCKSS system using RDFa to include the metadata in the HTML files that LOCKSS boxes preserve. Unlike the normal environment in which LOCKSS boxes operate, where they simply have to put up with whatever the e-journal publisher decides to publish, LuKII has control over both the publisher and the LOCKSS boxes. They can therefore use RDFa to tightly bind metadata to the content it describes.

My take on the preservation issues of linked data is as follows.

Linked data uses URIs. Linked data can thus be collected for preservation by archives other than the original publisher using existing web crawling techniques such as the Internet Archive’s Heritrix. Enabling multiple archives to collect and preserve linked data will be essential; some of the publishers will inevitably fail for a variety of reasons. Motivating web archives to do this will be important, as will tools to measure the extent to which they succeed. The various archives preserving linked data items can republish them, but only at URIs different from the original one, since they do not control the original publisher’s DNS entry. Links to the original will not resolve to the archive copies, removing them from the world of linked data. This problem is generic to web archiving. Solving it is enabled by the Memento technology, which is on track to become an IETF/W3C standard. It will be essential that both archives preserving, and tools accessing linked data implement Memento. There are some higher level issues in the use of Memento, but as it gets wider use they are likely to be resolved before they become critical for linked data. Collection using web crawlers and re-publishing using Memento provide archives with a technical basis for linked open data preservation, but they also need a legal basis. Over 80% of current data sources do not provide any license information; these sources will be problematic to archive. Even those data sources that do provide license information may be problematic, their license may not allow the operations required for preservation. Open data licenses do not merely permit and encourage re-use of data, they permit and encourage its preservation.

Tuesday, September 27, 2011

Modeling the Economics of Long-Term Storage

I gave a talk at the Library of Congress workshop on Designing Storage Architectures entitled Modeling the Economics of Long-Term Storage. It was talk about work in progress with Library of Congress funding, expanding ideas I described in these two blog posts about ways to compare the costs of different approaches to long term storage. I had only 10 minutes to speak, so below the fold is an expanded and edited text of the talk with links to the sources.

Ithaka Does A Good Thing

I've been very critical of Ithaka, so it is only fair that I point out that yesterday they did a very good thing:

"Today, we are making journal content on JSTOR published prior to 1923 in the United States and prior to 1870 elsewhere freely available to the public for reading and downloading. This content includes nearly 500,000 articles from more than 200 journals, representing approximately 6% of the total content on JSTOR."

It was, after all, hard to justify charging JSTOR's punitive or obscure per-article access charges for content that had entered the public domain. These charges may seem like a bug, in that they are so high as to almost completely deter access and thus generate very little income. But in fact they are a feature. Suppose Ithaka was to decide that, since there was very little use of per-article access and thus very little income, they would charge for it on a cost-recovery basis. The marginal cost of an extra access to the JSTOR system is minimal, so per-article access would be very cheap. Libraries that currently subscribe to JSTOR would drop their subscriptions and go pay-per-view, destroying the subscription business model. Punitive per-article charges are essential to preserving Ithaka's cash cow.

They are, however, part of a much bigger problem that I touched on in my post on the problems of research communication. It is well illustrated by Kent Anderson's vehement response to George Monbiot's diatribe against academic publishers:

"Let’s assume I can read the whole paper. Like 99.9% of the population, I’m not going to know what to make of it. It’s for specialists, or better, subspecialists ... There is no price in the world that’s going to make that scientific paper, or thousands of others, intelligible, relevant, or meaningful to me in any way that’s going to affect my ability to function in a democracy."

In other words, the general public has no business reading the academic literature.

But it is the general public that pays for the research that results in the papers that Kent thinks they shouldn't be reading, and that JSTOR and other academic publishers price beyond their means. If the general public is going to continue to pay for the research, and pay for the entire research communication system that includes both Ithaka, and Kent's Society for Scholarly Publishing, they need to believe that they're getting benefits in return.

Increasingly, thanks for example to well-funded campaigns of denial (e.g. tobacco, global warming, evolution) or fraud (e.g. MMR vaccine), the public is coming to believe that science is a conspiracy of insiders to feather their own nests at public expense. Even if I agreed with Kent that lay people would be unable to understand his example paper, the pricing model that ensures they can't afford to read it, and the attitude that they shouldn't be allowed to read it, are both very unhelpful.

There is a lot of research into the effect of Internet information on patients. The conclusions in terms of outcomes are fewer, but positive:

"Provision of information to persons with cancer has been shown to help patients gain control, reduce anxiety, improve compliance, create realistic expectations, promote self-care and participation, and generate feelings of safety and security. Satisfaction with information has been shown to correlate with quality of life, and patients who feel satisfied with the adequacy of information given are more likely to feel happy with their level of participation in the overall process of decision making."

These studies don't distinguish between the academic literature and sites targeted at lay readers, but it is clear that patients searching for information who encounter paywalls are less likely to "feel satisfied with the adequacy of information given" and thus have poorer quality of life.

The story of the illness of Larry Lessig's newborn daughter Samantha Tess (about 11:10 into the video) makes the case against the elitist view of access to research. To be sure, Larry is at Harvard and thus has free access to most of the literature. But consider an equally smart lawyer with an equally sick daughter in a developing country. He would no longer have free access via HINARI. According to Larry, he would have had to pay $435 for the 20 articles Larry read for free. Would he "feel satisfied with the adequacy of information given"?

Friday, September 2, 2011

What's Wrong With Research Communication

The recent Dagstuhl workshop on the Future of Research Communication has to produce a report. I was one of those tasked with a section of the report entitled What's Wrong With The Current System?, intended to motivate later sections describing future activities aimed at remedying the problems. Below the fold is an expanded version of my draft of the section, not to be attributed to the workshop or any other participant. Even my original draft was too long; in the process of getting to consensus, it will likely be cut and toned down. I acknowledge a debt to the very valuable report of the UK House of Common's Science & Technology Committee entitled Peer review in scientific publications.

Future's So Bright, We Gotta Wear Shades

In Sun Microsystems' first decade the company was growing exponentially and one of Scott McNealy's favorite slogans was "The Future's So Bright, We Gotta Wear Shades", adapted from a song by Timbuk 3. McNealy missed the ironic intent of the song, but the slogan was appropriate. Exponential growth continued most of the way through the company's second decade. Eventually, however, growth stopped.

It is always easy to believe that exponential growth will continue indefinitely, but it won't. It can't. There are always resource or other constraints that, at best, cause it to flatten. More likely, the growth will overshoot and collapse, as Sun's did. This problem is common to all exponential growth; a wonderful illustration is this post by Prof. Tom Murphy of UCSD:

U.S. energy use since 1650 shows a remarkably steady growth trajectory, characterized by an annual growth rate of 2.9%

...

For a matter of convenience, we lower the energy growth rate from 2.9% to 2.3% per year so that we see a factor of ten increase every 100 years. We start the clock today, with a global rate of energy use of 12 terawatts (meaning that the average world citizen has a 2,000 W share of the total pie).

...

No matter what the technology, a sustained 2.3% energy growth rate would require us to produce as much energy as the entire sun within 1400 years. A word of warning: that power plant is going to run a little warm. Thermodynamics require that if we generated sun-comparable power on Earth, the surface of the Earth—being smaller than that of the sun—would have to be hotter than the surface of the sun!

Views of the future for digital preservation are formed by two exponential growth projections; Moore's Law, projecting that density of transistors on a chip will increase exponentially, and Kryder's Law, projecting that the density of bits in storage will increase exponentially. Thus we plan on the basis that future costs of processing and storage will be much less than current costs.

An accessible explanation of the troubles Moore's Law is encountering is in Rik Myslewski's report for The Register on Simon Segars' keynote at the Hot Chips conference. Segars is employee #16 and head of the Physical IP division at ARM.

"Silicon scaling has been great." Segars reminisces. "We've gotten huge gains in power, performance, and area, but it's going to end somewhere, and that's going to affect how we do design and how we run our businesses, so my advice to you is get ready for that.

"It's coming sooner than a lot of people want to recognize."

I've been warning that we're likely to fall behind the Kryder's Law curve for some time, because the transition to the next disk drive technology is proving harder than expected and because the advantages of solid state storage come with higher costs. One thing I failed to anticipate is:

Multiple manufacturers in the IT industry have been keeping a wary eye on China's decision to cut back on rare earth exports and the impact it may have on component prices. An article from DigiTimes suggests consumers will see that decision hit the hard drive industry this year, with HDD prices trending upwards an estimated 5-10 percent depending on capacity.

This isn't likely to be as significant a factor as the others, but it is yet another straw in the wind.

Monday, August 22, 2011

Moonalice plays Palo Alto

On July 23^rd the band Moonalice played in Rinconada Park as part of Palo Alto's Twilight Concert series. The event was live-streamed over the Internet. Why is this interesting? Because Moonalice streams and archives their gigs in HTML5. Moonalice is Roger McNamee's band. Roger is a long-time successful VC (he invested in Facebook and Yelp, for example) and, based on his experience using HTML5 for the band, believes that HTML5 will transform the Web. On June 28^th he spoke at the Paley Center for the Media and explained why. An overview appeared at Business Insider, but you really need to watch the full video of the talk. Watch it then follow me below the fold. I'll explain some implications for digital preservation.

A Brief History of E-Journal Preservation

The workshop on the Future of Research Communication opened with a set of talks about the past, providing background for the workshop discussions. My talk covered the history of e-journal preservation, and the lessons that can be drawn from it. An edited text of the talk, with links to the sources, is below the fold.

Rebuttals

I'll save those who might point to evidence against some of my positions the trouble by pointing to the evidence myself:

I've used the comparison between libraries and Starbucks several times. According to Jim Romenesko via Gawker and Yves Smith we learn that Starbucks is succeeding so well that the students and other "laptop hobos" are driving out people who want to drink coffee. So maybe I was wrong and libraries are showing how hard it is to compete with free.
On the other hand, I have been very skeptical of the New York Times' paywall. In the Columbia Journalism Review Felix Salmon points to Seth Mnookin reporting on the first 4 months of the paywall.
its digital-subscription plan has thus far been an enormous success. The ... the goal was to amass 300,000 online subscribers within a year of launch. On Thursday, the company announced that after just four months, 224,000 users were paying for access to the paper’s website. Combined with the 57,000 Kindle and Nook readers who were paying for subscriptions and the roughly 100,000 users whose digital access was sponsored by Ford’s Lincoln division, that meant the paper had monetized close to 400,000 online users. (Another 756,000 print subscribers have registered their accounts on the Times’ website.)
Clearly that is encouraging, and Salmon goes on to point out that:
Those paying digital subscribers, however, are much more valuable than their subscription streams alone would suggest. They’re hugely loyal, they read loads of stories, they’re well-heeled, and advertisers will pay a premium to reach them. ... it seems as though total digital ad revenues are going up, not down, as subscriptions get introduced: the holy grail of paywalls.
However, this is only one digital subscriber for every two paper subscribers, and the NYT is making much less from each digital subscriber even after advertising revenue, so I'm sticking to my position.
I have been skeptical of the long-term prospects of large publishers such as Elsevier, although I was careful to note my track record of being right way too soon. In the short term, Elsevier's parent Reed Elsevier recently turned in modestly better resultsfor the first half of 2011. Its stock has been performing roughly in line with the S&P500. I still believe that the scope for Elsevier to acquire large numbers of lucrative new customers is limited, and their ability to squeeze more income from the existing customers is obviously limited by research and teaching budgets.

Monday, August 8, 2011

Fujitsu agrees with me

I've been saying for some time that flash memory is unlikely to be the long-term solid state memory technology. In an interview with Chris Mellor at The Register, Fujitsu's CTO agrees with me:

[Joseph Reger] reckons Phase Change Memory (PCM) is the closest, in terms of time to become a usable technology, than other post-flash contenders such as HP's Memristor.

In the more interesting part of the story, he agrees with me that the real potential of these post-flash technologies is that they can be packaged as persistent RAM rather than block storage:

[Reger] asks if everything will be rewritten and re-orchestrated to work with data memory management. Is there effectively only going to be one tier, memory in one form or another?

"Currently, having data in storage means it's not in memory. Is it going to stay like that?" After all, storage was invented to deal with memory-size limitations. If those limitations go away then who needs storage?

Reger said: "I truly believe we are going to have a data orientation rather than memory and storage orientations." But this is really far out in the future.

I'm old enough to remember when computer memory persisted across power cycles because it was magnetic cores. I'd love to see this feature return. The major software change that would be needed is far more than simply using in-memory databases. The RAM data structures would need to be enhanced with metadata and backups, especially for long-term integrity, if we were to get rid of block storage entirely.

Friday, July 29, 2011

Flash more reliable than hard disk?

Andrew Ku has an interesting article at Tom's Hardware showing that, despite claims by the vendors, the evidence so far available as to the reliability of flash-based SSDs in the field shows that they fail at rates comparable with hard drives. These failures have nothing to do with the limited write lifetime of flash, these are enterprise SSDs which have been in service a year or two and are thus nowhere close to their write life.

Although it is easy to believe that SSDs' lack of moving parts should make them more reliable, these results are not surprising:

Experience shows that vendors tend to exaggerate the reliability of their products.
Root-cause analysis of failures in hard-drive systems shows that 45-75% of failures are not due to the disks themselves, so that even if SSDs were a lot more reliable that hard drives but were used in similar systems, the effect on overall system reliability would be small. In fact, most of the SSDs surveyed were used in hybrid systems alongside disks.
The flash SSDs have even more software embedded in them than disk drives. Thus even if their raw storage of bits was more reliable than the disk platters, they would be more vulnerable to software errors.

These are early days for the kinds of mass deployments of SSDs that can generate useful data on field reliability, so there is no suggestion that they should be avoided. The article concludes:

The only definitive conclusion we can reach right now is that you should take any claim of reliability from an SSD vendor with a grain of salt.

"Paying for Long-Term Storage" Revisited

In this comment to my post onPaying for Long-Term Storage I linked to this speech by Andrew Haldane and Richard Davies of the Bank of England. They used historical data from 624 UK and US companies to estimate the extent to which short-term thinking is reflected in the prices of their stock. In theory, the price of each company's stock should reflect the net present value of the stream of future dividends. They find that:

First, there is statistically significant evidence of short-termism in the pricing of companies’ equities. This is true across all industrial sectors. Moreover, there is evidence of short-termism having increased over the recent past. Myopia is mounting. Second, estimates of short-termism are economically as well as statistically significant. Empirical evidence points to excess discounting of between 5% and 10% per year.

In other words, the interest rate being charged when using standard discounted cash flow computations to justify current investments is systematically 5-10% too high, resulting in investments that would be profitable not being made.

This is obviously a serious problem for all efforts to preserve content for future readers; the value the content eventually delivers to the future readers has to be enormous to justify even minimal investment now in preserving it. Below the fold I describe how even this analysis fails to capture the scale of the problem of short-termism.

More on de-duplicating flash controllers

My ACM Queue piece describing the problems caused by storage devices invisibly doing de-duplication attracted the attention of Robin Harris who actually asked SandForce and other manufacturers to comment. The details of SandForce's response are in Robin's StorageMojo article but the key claims are:

There is no more likelihood of DuraWrite loosing data than if it was not present.

and:

That is why SandForce created RAISE (Redundant Array of Independent Silicon Elements) and includes it on every SSD that uses a SandForce SSD Processor. ... if the ECC engine is unable to correct the bit error RAISE will step in to correct a complete failure of an entire sector, page, or block. ...This combination of ECC and RAISE protection provides a resulting UBER of 10^-29 virtually eliminates the probabilities of data corruption.

I would regard both claims with considerable skepticism:

SandForce has not disclosed the details of their technology, but the research performed by Michael Wei and his co-authors at UCSD revealed that it definitely includes de-duplication. Thus the first claim is not credible; there is only a single copy of the supposedly redundant data in the flash array instead of multiple copies. The metadata must therefore be at higher risk.
As regard the second claim, in my earlier ACM Queue article I show that manufacturers claims of error rates such as 10^-29 are not credible because they are not the result of experiments, but of models which are unrealistic and unverified.

Wednesday, June 15, 2011

Library of Congress interview

An interview with me inaugurates an interview series called Insights on the Library of Congress' digital preservation blog The Signal.

Tuesday, May 31, 2011

Solid State Memory for Archival Use

In last year's JCDL keynote I pointed to work at Carnegie-Mellon on FAWN, the Fast Array of Wimpy Nodes and suggested that the cost savings FAWN realizes by distributing computation across a very large number of very low-power nodes might also apply to storage. Now, Ian Adams and Ethan Miller of UC Santa Cruz's Storage Systems Research Center and I have looked at this possibility more closely in a Technical Report entitled Using Storage Class Memory for Archives with DAWN, a Durable Array of Wimpy Nodes. We show that it is indeed plausible that, even at current flash memory prices, the total cost of ownership over the long term of a storage system built from very low-power system-on-chip technology and flash memory would be competitive with disk. More on this below the fold.

Amazon's Outage

I've been looking at the problems of specifying, measuring and auditing (PDF) the reliability of storage technologies since 2006. When I heard that Amazon's recent outage had lost customer's data I hoped that I could use this example as an Awful Warning that my Cassandra-like prophecies were coming true.

After some research, I can't claim that this is an example of the doom that awaits unsuspecting customers. But the outage and data loss does illustrate a number of interesting aspects of cloud storage. Details below the fold.

Disk Drive Mergers vs. Kryder's Law

A few weeks ago there were five disk drive manufacturers. With the announcements of Western Digital's acquisition of Hitachi's disk drive business, and Seagate's acquisition of Samsung's disk drive business, there will shortly be only three. Western Digital will have about 48% of the market,
Seagate will have about 40% of the market, and Toshiba will be in a marginally viable place with the remaining 12% or less.

Part of the reason for this abrupt consolidation must surely be the much higher than expected costs of the transition to the next recording technology, which I have been discussing for more than a year. This consolidation will allow the two market leaders to amortize the costs of the transition across more of the market. These costs will be much harder for Toshiba to bear, so they are likely not to survive. Even if Toshiba survives, its impact on the ability of the market leaders to increase margins will be negligible. Paying for the transition and increasing margins will both act to slow the Kryder's Law cost per byte decrease, as I have been predicting.

Update: here is the view from industry commentator Tom Coughlin.

Wednesday, April 13, 2011

Update on Open Access and NLM's Policy Change

We now have an interesting illustration of the effects of the change of policy at the National Library of Medicine which I discussed last November. I've been told that questions to NLM about the policy change have been met with claims that few if any e-only journals will choose to deposit content in Portico in order to be indexed in MedLine without their content being in PubMed Central. Below the fold I describe a counter-example.

Technologies Don't Die

Kevin Kelly finds the same reaction of incredulity when he pointed out that physical technologies do not die as I did when I pointed out that digital formats are not becoming obsolete. Robert Krulwich of NPR challenged Kelly, but had to retire defeated when he and the NPR listeners failed to find any but trivial examples of dead technology.

And, in related news, The Register has two articles on a working 28-year-old Seagate ST-412 disk drive from an IBM 5156 PC expansion box. They point out, as I have, that disk drives are not getting faster as fast as they are getting bigger:

The 3TB Barracuda still has one read/write head per platter surface and each head now has 300,000MB to look after, whereas the old ST-412 heads each have just 5MB to look after.

The Barracuda will take longer today to read or write an entire platter surface's capacity than the 28-year-old ST-412 will. We have increased capacity markedly but disk I/O has become a bottleneck at the platter surface level, and is set to remain that way. The Register

Revised 4/12/11 to make clear that the disk drive still works.

Monday, April 4, 2011

The New York Times Paywall

In my 2010 JCDL keynote I reported Marc Andreesen's talk at the Stanford Business School describing how legacy thinking prevented the New York Times from turning off the presses and making much more profit from a smaller company. Part of the problem Marc described was how the legacy cost base led the NYT to want to extract more money from the website than it was really capable of generating. In other words, the problem was to get costs in line with income rather than to increase income to match existing costs. This is essentially the same insight that the authors self-publishing on Amazon have made; the ability to price against a very low cost base creates demand that generates income. The NYT's paywall demonstrates that Marc was right that they are hamstrung by legacy thinking.

First, in order not to suffer a crippling loss in traffic (Murdoch's London Times lost 90% of its readership) and thus advertising revenue, the paywall cannot extract revenue from almost everyone who reads the NYT website:

Subscribers to the print edition, who get free access.
People who read the website a lot, who get free access sponsored by advertisers.
People who don't read the website a lot, who can read 20 articles a month for free.
People who get to articles free via links from search engines, blogs, etc.
Technically sophisticated readers, who can get free access by evading the paywall

From whom can the paywall generate income? Technically unsophisticated non-subscribers who read a moderate amount but not via links from elsewhere. That's a big market. Legacy thinking requires generating lots more income but the ways to do it are all self-defeating.

Second, because the paywall that would be unnecessary if the cost base were addressed needs all these holes, it is unnecessarily complex. Therefore, as Philip Greenspun reveals, it is unnecessarily expensive. How did an organization one would think was tech-savvy end up paying $40-50M to implement a paywall, even if it is a complex one? The answer appears to be that some time ago, presumably to reduce costs on the digital side of the business, the NYT outsourced their website to Atypon. Apparently, they have repented and have taken the website back in-house, so the $40-50M is not just implementing the paywall but also insourcing.

These costs for undoing a decision to outsource, together with Boeing's 787 outsourcing fiasco, should be warnings to libraries currently being seduced to outsource their collections and functions.

Tuesday, March 15, 2011

Bleak Future of Publishing

In my JCDL2010 keynote last June I spoke about the bleak future for publishers in general, and academic journal publishers such as Elsevier in particular. As I expected, I was met with considerable skepticism. Two recent signs indicate that I was on the right track:

A few days ago an analyst's report on Reed Elsevier points out, as I did, that Elsevier cannot generate organic growth from their existing market because their customers don't have the money.
A fascinating blog interview between two self-publishing e-book authors reveals that the Kindle is providing them a profitable business model. John Locke then held the #1, #4 and #10 spots on the Amazon Top 100, with another 3 books in the top 40. Joe Konrath had the #35 spot. Of the top 100, 26 slots were held by independent authors. John and Joe had been charging $2.99 per download, of which Amazon gave them 70%. When they dropped the price to $0.99 per download of which Amazon only gives them 35%, not just their sales but also their income exploded. John is making $1800/day from $0.99 downloads. Kevin Kelly predicts that in 5 years, the average price of e-books will be $0.99. As he points out:
$1 is near to the royalty payment that an author will receive on, say, a paperback trade book. So in terms of sales, whether an author sells 1,000 copies themselves directly, or via a traditional publishing house, they will make the same amount of money.
If publishers were doing all the things they used to do to promote books, maybe this would not be a problem. But they aren't. Tip of the hat to Slashdot.

How Few Copies?

I spoke at the Screening the Future 2011 conference at the Netherlands Beeld en Geluid in Hilversum, on the subject of "How Few Copies?". Below the fold is an edited text of the talk with links to the resources.

ACM/IEEE copyright policy

Matt Blaze is annoyed at the ACM and IEEE copyright policy. So am I. In an update to his post he reports:

A prominent member of the ACM asserted to me that copyright assignment and putting papers behind the ACM's centralized "digital library" paywall is the best way to ensure their long-term "integrity". That's certainly a novel theory; most computer scientists would say that wide replication, not centralization, is the best way to ensure availability, and that a centrally-controlled repository is more subject to tampering and other mischief than a decentralized and replicated one.

This is deeply ironic, because ACM bestowed both a Best Paper award and an ACM Student Research award on Petros Maniatis, Mema Roussopoulos, TJ Giuli, David S.H. Rosenthal, Mary Baker, and Yanto Muliadi, "Preserving Peer Replicas By Rate-Limited Sampled Voting", 19th ACM Symposium on Operating Systems Principles (SOSP) , Bolton Landing, NY, October, 2003. for demonstrating that the "prominent member" is wrong and Matt is right.

If the "prominent member" wants the full details, they are available in the ACM's own Transactions on Computing Systems Vol. 23 No. 1, February 2005, pp 2-50.

Deduplicating Devices Considered Harmful

In my brief report from FAST11 I mentioned that Michael Wei's presentation of his paper on erasing information from flash drives (PDF) revealed that at least one flash controller was, by default, doing block-level deduplication of data written to it. I e-mailed Michael about this, and learned that the SSD controller in question is the SandForce SF-1200. This sentence is a clue:

DuraWrite technology extends the life of the SSD over conventional controllers, by optimizing writes to the Flash memory and delivering a write amplification below 1, without complex DRAM caching requirements.

This controller is used in SSDs from, for example, Corsair, ADATA and Mushkin.

It is easy to see the attraction of this idea. Flash controllers need a block re-mapping layer, called the Flash Translation Layer (FTL) (PDF) and, by enhancing this layer to map all logical blocks written with identical data to the same underlying physical block, the number of actual writes to flash can be reduced, the life of the device improved, and the write bandwidth increased. However, it was immediately obvious to me that this posed risks for file systems. Below the fold is an explanation.

File systems write the same metadata to multiple logical blocks as a way of avoiding a single block failure causing massive, or in some cases total, loss of user data. An example is the superblock in UFS. Suppose you have one of these SSDs with a UFS file system on it. Each of the multiple alternate logical locations for the superblock will be mapped to the same underlying physical block. If any of the bits in this physical block goes bad, the same bit will go bad in every alternate logical superblock,

I discussed this problem with Kirk McKusick, and he with the ZFS team. In brief, that devices sometimes do this is very bad news indeed, especially for file systems such as ZFS intended to deliver the level of reliability that large file systems need.

Thanks to the ZFS team, here is a more detailed explanation of why this is a problem for ZFS. For critical metadata (and optionally for user data) ZFS stores up to 3 copies of each block. The checksum of each block is stored in its parent, so that ZFS can ensure the integrity of its metadata before using it. If corrupt metadata is detected, it can find an alternate copy and use that. Here are the problems:

If the stored metadata gets corrupted, the corruption will apply to all copies, so recovery is impossible.
To defeat this, we would need to put a random salt into each of the copies, so that each block would be different. But the multiple copies are written by scheduling multiple writes of the same data in memory to different logical block addresses on the device. Changing this to copy the data into multiple buffers, salt them, then write each one once would be difficult and inefficient.
Worse, it would mean that the checksum of each of the copies of the child block would be different; at present they are all the same. Retaining the identity of the copy checksums would require excluding the salt from the checksum. But ZFS computes the sum of every block at a level in the stack where the kind of data in the block is unknown. Loosing the identity of the copy checksums would require changes to the on-disk layout.

This isn't something specific to ZFS; similar problems arise for all file systems that use redundancy to provide robustness. The bottom line is that drivers for devices capable of doing this need to turn it off. But the whole point of SSDs is that they live behind the same generic disk driver as all SATA devices. It may be possible to use mechanisms such as FreeBSD's quirks to turn deduplication off, but that assumes that you know the devices with controllers that deduplicate, that the controllers support commands to disable deduplication, and that you know what the commands are.

Friday, February 25, 2011

Paying for Long-Term Storage

I was part of a panel on Economics at the Personal Digital Archiving 2011 conference at the Internet Archive. I talked about the idea of endowing data with a sum sufficient to pay for its indefinite storage. I first blogged about this in 2007. Below the fold is an edited and expanded version of my talk with links to sources.

FAST'11

I attended USENIX's File And Storage Technologies conference. Here's a brief list of the things that caught my attention:

The first paper, and one of the Best Paper awardees, was "A Study of Practical Deduplication" (PDF), an excellent overview of deduplication applied to file systems. It makes available much valuable new data. In their environment whole-file deduplication achieves about 3/4 of the total savings from aggressive block-level deduplication.
In fact, deduplication and flash memory dominated the conference. "Reliably Erasing Data From Flash-Based Solid State Drives" from a team at UCSD on revealed that, because flash memories effectively require copy-on-write techniques, they contain many logically inaccessible copies of a file. These copies are easily accessible by de-soldering the chips and thus gaining a physical view of the storage. Since existing "secure delete" techniques can't go around the controller, and most controllers either don't or don't correctly implement the "sanitization" commands, it is essential to use encrypted file systems on flash devices if they are to store confidential information.
Even worse, Michael Wei's presentation of this paper revealed that at least one flash controller was doing block deduplication "under the covers". This is very tempting, in that it can speed up writes and extend the device lifetime considerably. But it can play havoc with the techniques file systems use to improve robustness.
"AONT-RS: Blending Security and Performance in Dispersed Storage Systems" was an impressive overview of how all-or-nothing transforms can provide security in Cleversafe's k-of-n dispersed storage system, without requiring complex key management schemes. I will write more on this in subsequent posts.
"Exploiting Memory Device Wear-Out Dynamics to Improve NAND Flash Memory System Performance" from RPI provides much useful background on the challenges flash technology faces in maintaining reliability as densities increase.
Although it is early days, it was interesting that several papers and posters addressed the impacts that non-volatile RAM technologies such as Phase Change Memory and memristors will have.
"Repairing Erasure Codes" was an important Work In Progress talk from a team at USC, showing how to reduce one of the more costly functions of k-of-n dispersed storage systems, organizing a replacement when one of the n slices fails. Previously, this required bringing together at least k slices, but they showed that it was possible to manage it with many fewer slices for at least some erasure codes, though so far none of the widely used ones. The talk mentioned this useful Wiki of papers about storage coding.

Tuesday, February 15, 2011

Disk growth

The Register interprets a recent analyst briefing by Seagate as predicting that this year could see the long-awaited 4TB 3.5" drive introduction. This is based on Seagate's claim of a 6-th generation of Perpendicular Magnetic Recording (PMR) technology, and The Register's guess that this would provide a 30% increase in areal density. Thomas Coughlin makes similar projections but with only an 18% increase in areal density. These projections can be viewed optimistically, as continuing the somewhat slower growth in capacity of recent years, or pessimistically, as the industry being forced to stretch PMR technology because the transition to newer technologies (HARM and BPM) offering much higher densities is proving much more difficult and expensive than anticipated.

On a related note, Storage Newsletter reports on Trend Focus's estimate that the industry shipped 88 Exabytes of disk capacity in the last quarter of 2010, made up of 29.3 Exabytes of mobile drives, 48.4 Exabytes of desktop drives, 2.6 Exabytes of enterprise drives, and 8 Exabytes of drives for consumer equipment (primarily DVRs). There were 73 million mobile and 64 million desktop drives, confirming that the market is moving strongly to the (lower capacity) 2.5" form factor.

Cisco estimates that the global IP traffic was 15 Exabytes/month at the start of 2010 growing at 45%/year. If they were right, the rate at the end of 2010 would be 66 Exabytes per quarter. The 88 Exabytes per quarter rate of disk shipments is still capable of storing all the IP traffic in the world. Because unit shipments of disks are growing slowly, and the capacity of each unit is growing less than 45%/year, they will shortly become unable to do so.

Tuesday, February 8, 2011

Are We Facing a "Digital Dark Age?"

Last October I gave a talk to the Alumni of Humboldt University in Berlin as part of the celebrations of their 200^th anniversary. It was entitled "Are We Facing A 'Digital Dark Age?'". Below the fold is an edited text of this talk, which was aimed at a non-technical audience.

Threats to preservation

More than 5 years ago we published the LOCKSS threat model, the set of threats to preserved content against which the LOCKSS system was designed to preserve content. We encouraged other digital preservation systems to do likewise; it is hard to judge how effective systems are in achieving their goal of preserving content unless you know what they are intended to preserve content against. We said:

We concur with the recent National Research Council recommendations to the National Archives that the designers of a digital preservation system need a clear vision of the threats against which they are being asked to protect their system's contents, and those threats under which it is acceptable for preservation to fail.

I don't recall any other system rising to the challenge; I'd be interested in any examples of systems that have documented their threat model that readers could provide in comments.

This lack of clarity as to the actual threats involved is a major reason for the misguided focus on format obsolescence that consumes such a large proportion of digital preservation attention and resources. As I write this two ongoing examples illustrate the kinds of real threats attention should be focused on instead.

In an attempt to damp down anti-government protests, the Egyptian government shut down the Internet in their country. One copy of the Internet Archive's Wayback Machine is hosted at the Bibliotheca Alexandrina. As I write it is accessible, but the risk is clear. But, you say, the US government would never do such a thing, so the Internet Archive is quite safe. Think again. Senators Joe Lieberman and Susan Collins are currently pushing a bill, the Protecting Cyberspace as a National Asset Act of 2010, to give the US government the power to do exactly that whenever it feels like doing so.

Also as I write this SourceForge is unavailable, shut down in the aftermath of a compromise. The LOCKSS software, in common with many other digital preservation technologies, is preserved in SourceForge's source code control system. Other systems essential to digital preservation use one of a small number of other similar repositories. When SourceForge comes back up, we will have to audit the copy it contains of our source code against our backups and working copies to be sure that the attackers did not tamper with it.

I have argued for years, again with no visible effect, that national libraries should preserve these open source repositories. Not merely because, as the SourceForge compromise illustrates, their contents are the essential infrastructure for much of digital preservation, and that there are no economic, technical or legal barriers to doing so, but even more importantly they are major cultural achievements, just as worthy of future scholar's attention as books, movies and even tweets.

Monday, January 17, 2011

Why Migrate Formats? The Debate Continues

I am grateful for two recent contributions to the debate about whether format obsolescence is an exception, or the rule, and whether migration is a viable response to it:

Andy Jackson posts an argument for format migration to improve access rather than for preservation.
Rob Sharpe critiques my discussion of Microsoft Project 98 in a comment.

I respond to Andy below the fold. Responding to Rob involves some research to clear up what appears to be confusion on my part, so I will postpone that to a later post.

Andy gives up the position that format migration is essential for preservation and moves the argument to access, correctly quoting an earlier post of mine saying that the question about access is how convenient it is for the eventual reader. As Andy says:

What is the point of keeping the bits safe if your user community cannot use the content effectively?

In this shift Andy ends up actually agreeing with much, but not quite all, of my case.

He says, quite correctly, that I argue that a format with an open source renderer is effectively immune from format obsolescence. But that isn't all I'm saying. Rather, the more important observation is that formats are not going obsolete, they are continuing to be easily render-able by the normal tools that readers use. Andy and I agree that reconstructing the entire open source stack as it was before the format went obsolete is an imposition on an eventual reader. That isn't actually what would have to happen if obsolescence happened, but the more important point is that obsolescence isn't going to happen.

The digital preservation community has failed to identify a single significant format that has gone obsolete in the 15+ years since the advent of the Web, which is one quarter of the entire history of computing. I have put forward a theory that explains why format obsolescence ceased; I have yet to see any competing theory that both explains the lack of format obsolescence since the advent of the Web and, as it would have to in order to support the case for format migration, predicts a resumption in the future. There is unlikely to be any reason for a reader to do anything but use the tools they have to hand to render the content, and thus no need to migrate it to a different format to provide "sustainable access".

Andy agrees with me that the formats of the bulk of the British Library's collection are not going obsolete in the foreseeable future:

The majority of the British Library's content items are in formats like PDF, TIFF and JP2, and these formats cannot be considered 'at risk' on any kind of time-scale over which one might reasonably attempt to predict. Therefore, for this material, we take a more 'relaxed' approach, because provisioning sustainable access is not difficult.

This relaxed approach to format obsolescence, preserving the bits and dealing with format obsolescence if and when it happens, is the one I have argued for since we started the LOCKSS program.

Andy then goes on to discuss the small proportion of the collection that is not in formats that he expects to go obsolete in the future, but in formats that are hard to render with current tools:

Unfortunately, a significant chunk of our collection is in formats that are not widely used, particularly when we don't have any way to influence what we are given (e.g. legal deposit material).

The BL eases access this content by using migration tools on ingest to create an access surrogate and, as the proponents of format migration generally do, keeping the original.

Naturally, we wish to keep the original file so that we can go back to it if necessary,

Thus, Andy agrees with me that it is essential to preserve the bits. Preserving the bits will ensure that these formats stay as hard to render as they are right now. Creating an access surrogate in a different format may be a convenient thing to do, but it isn't a preservation activity.

Where we may disagree is on the issue of whether is is necessary to preserve the access surrogate. It isn't clear whether the BL does, but there is no real justification for doing so. Unlike the original bits, the surrogate can be re-created at any time by re-running the tool that created it in the first place. If you argue for preserving the access surrogate, you are in effect saying that you don't believe that you will be able to re-run the tool in the future. The LOCKSS strategy for handling format obsolescence, which was demonstrated and published more than 6 years ago, takes advantage of the transience of access surrogates; we create an access surrogate if a reader ever accesses content that is preserved in a original format that the reader regards as obsolete. Note that this approach has the advantage of being able to tailor the access surrogate to the reader's actual capabilities; there is no need to guess which formats the eventual reader will prefer. These access surrogates can be discarded immediately, or cached for future readers; there is no need to preserve them.

The distinction between preservation and access is valuable, in that it makes clear that applying preservation techniques to access surrogates is a waste of resources.

One of the most interesting features of this debate has been detailed examinations of claims that this or the other format is obsolete; the claims have often turned out to be exaggerated. Andy says:

The original audio 'master' submitted to us arrives in one of a wide range of formats, depending upon the make, model and configuration of the source device (usually a mobile phone). Many of these formats may be 'exceptional', and thus cannot be relied upon for access now (never mind the future!).

But in the comments he adds:

The situation is less clear-cut in case of the Sound Map, partly because I'm not familiar enough with the content to know precisely how wide the format distribution really is.

The Sound Map page says:

Take part by publishing recordings of your surroundings using the free AudioBoo app for iPhone or Android smartphones or a web browser.

This implies that, contra Andy, the BL is in control of the formats used for recordings. It would be useful if someone with actual knowledge would provide a complete list of the formats ingested into Sound Map, and specifically identify those which are so hard to render as to require access surrogates.

Tuesday, January 4, 2011

Apology to Safari Users

I'm a Firefox user, so I have only just noticed that in Safari the "front page" of this blog does not render correctly. The first part of the material "below the fold" correctly does not appear, but mysteriously as soon as I use "blockquote", "ol" or "ul" tags, it reappears. So in Safari most of the post appears on the "front page" but with a chunk in the middle elided. Fortunately, clicking on the headline gets you a properly rendered version. Sorry about this. I'm looking in to the problem.

Monday, January 3, 2011

Memento & the Marketplace for Archiving

In a recent post I described how Memento allows readers to access preserved web content, and how, just as accessing current Web content frequently requires the Web-wide indexes from keywords to URLs maintained by search engines such as Google, access to preserved content will require Web-wide indexes from original URL plus time of collection to preserved URL. These will be maintained by search-engine-like services that Memento calls Aggregators (which will, I predict, end up being called something snappier and less obscure).

As we know, a complex ecology of competition, advertising, optimization and spam has grown up around search engines, and we can expect something similar to happen around Aggregators . Below the fold I use an almost-real-life example to illustrate my ideas about how this will play out.

Wednesday, December 28, 2011

Tuesday, December 13, 2011

Thursday, November 17, 2011

Monday, October 31, 2011

Saturday, October 29, 2011

Thursday, October 20, 2011

Wednesday, October 19, 2011

Friday, October 14, 2011

Monday, October 10, 2011

Sunday, October 2, 2011

Tuesday, September 27, 2011

Thursday, September 8, 2011

Friday, September 2, 2011

Thursday, August 25, 2011

Monday, August 22, 2011

Wednesday, August 17, 2011

Tuesday, August 16, 2011

Monday, August 8, 2011

Friday, July 29, 2011

Tuesday, June 28, 2011

Wednesday, June 15, 2011

Tuesday, May 31, 2011

Wednesday, May 11, 2011

Tuesday, April 19, 2011

Wednesday, April 13, 2011

Monday, April 11, 2011