Thursday, September 27, 2012

Notes from Desiging Storage Architectures workshop

Below the fold are some notes from this year's Library of Congress Designing Storage Architectures meeting.

Friday, September 21, 2012

Talk at "Designing Storage Architectures"

I gave a talk at the Library of Congress' Designing Storage Architecture workshop entitled The Truth Is Out There: Long-Term Economics in the Cloud. Below the fold is an edited text with links to the resources. 

Wednesday, September 19, 2012

Two New Papers

A preprint of our paper The Economics of Long-Term Digital Storage (PDF) is now on-line. It was accepted for the UNESCO conference The Memory of the World in the Digital age: Digitization and Preservation in Vancouver, BC, and I am scheduled to present it on the morning of September 27 (PDF). The paper pulls together the work we had done on economic models up to the submission deadline earlier this summer, and the evidence that Kryder's Law is slowing.

LOCKSS Boxes in the Cloud (PDF) is also on-line. It is the final report of the project we have been working on for the Library of Congress' NDIIPP program to investigate the use of cloud storage for LOCKSS boxes.

Friday, September 14, 2012

Correction, please! Thank you!

I'm a critic of the way research communication works much of the time, so it is nice to draw attention to an instance where it worked well.

The UK's JISC recently announced the availability of the final version of a report the Digital Curation Centre put together for a workshop last March. "Digital Curation and the Cloud" cited my work on cloud storage costs thus:
David Rosenthal has presented a case to suggest that cost savings associated with the cloud are likely to be negligible, although local costs extend far beyond those associated just with procuring storage media and hardware.
Apparently this was not the authors' intention, but the clear implication is that my model ignores all costs for local storage except those "associated just with procuring storage media and hardware". Clearly, there are many other costs involved in storing data locally. But in the cited blog post I say, describing the simulations leading to my conclusion:
A real LOCKSS box has both capital and running costs, whereas a virtual LOCKSS box in the cloud has only running costs. For an apples-to-apples comparison, I need to compare cash flows through time.
Further, I didn't "suggest that cost savings associated with the cloud are likely to be negligible". In the cited blog post I showed that for digital preservation, not merely were there no "cost savings" from cloud storage but rather:
Unless things change, cloud storage is simply too expensive for long-term use.
Over last weekend I wrote to the primary author asking for a correction. Here it is Friday and the report now reads:
David Rosenthal has presented a case to suggest that cloud storage is currently "too expensive for long term use" in comparison with the capital and running costs associated with local storage.
Kudos to all involved for the swift and satisfactory resolution of this issue. But, looking back at my various blog posts, I haven't been as clear as I should have been in describing the ways in which my model of local storage costs leans over backwards to be fair to cloud storage. Follow me below the fold for the details.

Tuesday, September 11, 2012

More on Glacier Pricing

I used our prototype long-term economic model to investigate Amazon's recently announced Glacier archival storage service. The results are less dramatic than the hype. In practice, unlike S3, it appears that it can have long-term costs roughly the same as local storage, subject to some caveats. Follow me below the fold for the details.

Tuesday, September 4, 2012

Threat Models For Archives

I spent some time this week looking at a proposed technical architecture for a digital archive. It seems always to be the case that these proposals lack any model of the threats against which the data has to be protected. This one was no exception.

It is a mystery how people can set out to design a system without specifying what the system is supposed to do. In 2005 a panel of the National Science Board reported on the efforts of the US National Archives and Records Administration to design an Electronic Records Archive (ERA):
"It is essential that ERA design proposals be analyzed against a threat model in order to gain an understanding of the degree to which alternative designs are vulnerable to attack."
That same year the LOCKSS team published the threat model the LOCKSS design uses. Despite these examples from 7 years ago, it seems that everyone thinks that the threats to stored data are so well understood and agreed upon that there is no need to specify them, they can simply be assumed.

From 4 years ago, here is an example of the threats paper archives must guard against. The historian Martin Allen published a book making sensational allegations about the Duke of Windsor's support for the Nazis. It was based on "previously unseen documents" from the British National Archives, but:
"... an investigation by the National Archives into how forged documents came to be planted in their files ... uncovered the full extent of deception. Officials discovered 29 faked documents, planted in 12 separate files at some point between 2000 and 2005, which were used to underpin Allen's allegations."
Is this a threat your digital preservation system could prevent or even detect? Could your system detect it if the perpetrator had administrative privileges? The proposed system I was looking at would definitely have failed the second test, because insider abuse was not part of their assumed threat model. If the designers had taken the time to write down their threat model, perhaps using our paper as a template, they would have either caught their omission, or explained how it came to be that their system administrators were guaranteed to be both infallible and incorruptible.