Friday, December 26, 2014

Crypto-currency as a basis for preservation

Although I have great respect for the technology underlying crypto-currencies such as Bitcoin, I've been skeptical for some time as to its viability as a product in the market both as a currency and as the basis for peer-to-peer storage proposals such as Permacoin and MaidSafe. The attraction of crypto-currencies is their decentralized nature, but if they become successful enough to be generally useful, economies of scale lead to their centralization. It was easy to get caught up in the enthusiasm as Bitcoin grew rapidly, but:
Bitcoin was the worst investment of 2014, as its value halved.
Bitcoin's hash rate had been growing exponentially since the start of 2013 but has been approximately flat for the last quarter, indicating that investment in new mining hardware has dried up.
The reason for investment drying up is likely that the revenue from mining is less than a third of what it was.
The Bitcoin market capitalization dropped from $11B to $4.4B.
Even if you don't accept my economies of scale arguments, these numbers should temper your enthusiasm for basing peer-to-peer storage on a crypto-currency.

Thursday, December 18, 2014

Economic Failures of HTTPS

Bruce Schneier points me to Assessing legal and technical solutions to secure HTTPS, a fascinating, must-read analysis of the (lack of) security on the Web from an economic rather than a technical perspective by Axel Arnbak and co-authors from Amsterdam and Delft universities. Do read the whole paper, but below the fold I provide some choice snippets.

Tuesday, December 16, 2014

Hardware I/O Virtualization

At, Timothy Prickett Morgan has an interesting post entitled A Rare Peek Into The Massive Scale Of AWS. It is based on a talk by Amazon's James Hamilton at the re:Invent conference. Morgan's post provides a hierarchical, network-centric view of the AWS infrastructure:
  • Regions, 11 of them around the world, contain Availability Zones (AZ).
  • The 28 AZs are arranged so that each Region contains at least 2 and up to 6 datacenters.
  • Morgan estimates that there are close to 90 datacenters in total, each with 2000 racks, burning 25-30MW.
  • Each rack holds 25 to 40 servers.
AZs are no more than 2ms apart measured in network latency, allowing for synchronous replication. This means the AZs in a region are only a couple of kilometres apart, which is less geographic diversity than one might want, but a disaster still has to have a pretty big radius to take out more than one AZ. The datacenters in an AZ are not more than 250us apart in latency terms, close enough that a disaster might take all the datacenters in one AZ out.

Below the fold, some details and the connection between what Amazon is doing now, and what we did in the early days of NVIDIA.

Thursday, December 11, 2014

"Official" Senate CIA Torture Report

Please go and read James Jacobs' post The Official Senate CIA Torture Report to understand the challenges government documents librarians face. You would think that a document generating such worldwide interest would be easy to find and preserve. In your dreams, as it turns out.

Tuesday, December 9, 2014

Talk at Fall CNI

I gave a talk at the Fall CNI meeting entitled Improving the Odds of Preservation, with the following abstract:
Attempts have been made, for various types of digital content, to measure the probability of preservation. The consensus is about 50%. Thus the rate of loss to future readers from "never preserved" vastly exceeds that from all other causes, such as bit rot and format obsolescence. Will persisting with current preservation technologies improve the odds of preservation? If not, what changes are needed to improve them?
It covered much of the same material as Costs: Why Do We Care, with some differences in emphasis. Below the fold, the text with links to the sources.

Thursday, December 4, 2014

A Note of Thanks

I have a top-of-the-line MacBook Air, which is truly a work of art, but I discovered fairly quickly that subjecting a machine that cost almost $2000 to the vicissitudes of today's travel is worrying. So for years now the machine I've travelled with is a netbook, an Asus Seashell 1005PE. It is small, light, has almost all-day battery life and runs Ubuntu just fine. It cost me about $250, and with both full-disk encryption and an encrypted home directory, I just don't care if it gets lost, broken or seized.

But at last the signs of the hard life of a travelling laptop are showing. I looked around for a replacement and settled on the Acer C720 Chromebook. This cost me $387 including tax and same-day delivery from Amazon. Actually, same-day isn't accurate. It took less than 9 hours from order to arrival! If I'd waited until Black Friday to order it would have been more than $40 cheaper.

For that price, the specification is amazing:
  • 1.7GHz 4-core Intel Core i3
  • 4GB RAM
  • 32GB SSD
  • 11.6" 1366x768 screen
Thanks to these basic instructions from Jack Wallen and the fine work of HugeGreenBug in assembling a version of Ubuntu for the C720, 24 hours after ordering I had a light, thin, powerful laptop with a great display running a full 64-bit installation of Ubuntu 14.0.4. I'm really grateful to everyone who contributed to getting Linux running on Chromebooks in general and on the C720 in particular. Open source is wonderful.

Of course, there are some negatives. The bigger screen is great, but it makes the machine about an inch bigger in width and depth. Like the Seashell and unlike full-size laptops, it will be usable in economy seats on the plane even if the passenger in front reclines their seat. But it'll be harder than it was with the Seashell to claim that the computer and the drink can co-exist on the economy seat-back table.

Below the fold, some details for anyone who wants to follow in my footsteps.

Tuesday, December 2, 2014

Henry Newman's Farewell Column

Henry Newman has been writing a monthly column on storage technology for Enterprise Storage Forum for 12 years, and he's decided to call it a day. His farewell column is entitled Follow the Money: Picking Technology Winners and Losers and it starts:
I want to leave you with a single thought about our industry and how to consistently pick technology winners and losers. This is one of the biggest lessons I’ve learned in my 34 years in the IT industry: follow the money.
Its an interesting read. Although Henry has been a consistent advocate for tape for "almost three decades", he uses tape as an example of the money drying up. He has a table showing that the LTO media market is less than half the size it was in 2008. He estimates that the total tape technology market is currently about $1.85 billion, whereas the disk technology market it around $35 billion.
Following the money also requires looking at the flip side and following the de-investment in a technology. If customers are reducing their purchases of a technology, how can companies justify increasing their spending on R&D? Companies do not throw good money after bad forever, and at some point they just stop investing.
Go read the whole thing and understand why Henry's regular column will be missed, and how perceptive the late Jim Gray was when in 2006 he stated that Tape is Dead, Disk is Tape, Flash is Disk.

Tuesday, November 25, 2014

Dutch vs. Elsevier

The discussions between libraries and major publishers about subscriptions have only rarely been actual negotiations. In almost all cases the libraries have been unwilling to walk away and the publishers have known this. This may be starting to change; Dutch libraries have walked away from the table with Elsevier. Below the fold, the details.

Friday, November 21, 2014

Steve Hetzler's "Touch Rate" Metric

Steve Hetzler of IBM gave a talk at the recent Storage Valley Supper Club on a new, scale-free metric for evaluating storage performance that he calls "Touch Rate". He defines this as the proportion of the store's total content that can be accessed per unit time. This leads to some very illuminating graphs that I discuss below the fold.

Tuesday, November 18, 2014

Talk "Costs: Why Do We Care?"

Investing in Opportunity: Policy Practice and Planning for a Sustainable Digital Future sponsored by the 4C project and the Digital Preservation Coalition featured a keynote talk each day. The first, by Fran Berman, is here.

Mine was the second, entitled Costs: Why Do We Care? It was an update and revision of The Half-Empty Archive, stressing the importance of collecting, curating and analyzing cost data. Below the fold, an edited text with links to the sources.

Monday, November 17, 2014

Andrew Odlyzko Strikes Again

Last year I blogged about Andrew Odlyzko's perceptive analysis of the business of scholarly publishing. Now he's back with an invaluable, must-read analysis of the economics of the communication industry entitled Will smart pricing finally take off?. Below the fold, a taste of the paper and a validation of one of his earlier predictions from the Google Scholar team.

Friday, November 14, 2014

Talk at Storage Valley Supper Club

I gave a very short talk to the Storage Valley Supper Club's 8th meeting. Below the fold, an edited text with links to the sources.

Wednesday, November 12, 2014

Five Minutes Of Fame

On Monday, Chris Mellor at The Register had a piece with a somewhat misleading title that provides a good summary of the argument we've been making since at least early 2011 that the Kryder rate, the rate of annual decrease in the cost per byte of storage, had slowed dramatically. As we have shown, this slowing has huge implications for the cost of long-term storage.

Today, Chris is back with a similar summary of Preeti Gupta et al's MASCOTS paper, An Economic Perspective of Disk vs. Flash Media in Archival Storage. This paper reports on some more sophisticated economic modelling that supports the argument of DAWN: a Durable Array of Wimpy Nodes. This 2011 technical report showed that, using a similar fabric to Carnegie-Mellon's 2009 FAWN: a Fast Array of Wimpy Nodes for long-term storage instead of computation, the running costs would be low enough to overcome the much higher cost of the flash media as compared to disk

Monday, November 10, 2014

Gossip protocols: a clarification

a subtype of “gossip” protocols" and cites LOCKSS as an example, saying:
Not coincidentally, LOCKSS “consists of a large number of independent, low-cost, persistent Web caches that cooperate to detect and repair damage to their content by voting in “opinion polls” (PDF). In other words, gossip and anti-entropy.
The main use for gossip protocols is to disseminate information in a robust, randomized way, by having each peer forward information it receives from other peers to a random selection of other peers. As the function of LOCKSS boxes is to act as custodians of copyright information, this would be a very bad thing for them to do.

It is true that LOCKSS peers communicate via an anti-entropy protocol, and it is even true that the first such protocol they used, the one I implemented for the LOCKSS prototype, was a gossip protocol in the sense that peers forwarded hashes of content to each other. Alas, that protocol was very insecure. Some of the ways in which it was insecure related directly to its being a gossip protocol.

An intensive multi-year research effort in cooperation with Stanford's CS department to create a more secure anti-entropy protocol led to the current  protocol, which won "Best Paper" at the 2003 Symposium on Operating System Principles. It is not a gossip protocol in any meaningful sense (see below the fold for details). Peers never forward information they receive from other peers, all interactions are strictly pair-wise and private.

For the TRAC audit of the CLOCKSS Archive we provided an overview of the operation of the LOCKSS anti-entropy protocol; if you are interested in the details of the protocol this, rather than the long and very detailed paper in ACM Transactions on Computer Systems (PDF), is the place to start.

Monday, November 3, 2014

First US web page

Stanford's Web Archiving team of Nicholas Taylor and Ahmed AlSum have bought up SWAP, the Stanford Web Archive Portal, using the Open Wayback code developed under IIPC auspices from the Internet Archive's original. And, thanks to the Stanford staff's extraordinary ability to recover data from old backups, it features the very first US web page, bought up by Paul Kunz at SLAC around 6th Dec. 1991.

Friday, October 31, 2014

This is what an emulator should look like

Via hackaday, [Jörg]'s magnificently restored PDP10 console, connected via a lot of wiring to a BeagleBone running the SIMH PDP10 emulator. He did the same for a PDP11. Two computers that gave me hours of harmless fun back in the day!

Kids today have no idea what a computer should look like. But even they can run [Jörg]'s Java virtual PDP10 console!

Tuesday, October 28, 2014

Familiarity Breeds Contempt

In my recent Internet of Things post I linked to Jim Gettys' post Bufferbloat and Other Challenges. In it Jim points to a really important 2010 paper by Sandy Clarke, Matt Blaze, Stefan Frei and Jonathan Smith entitled Familiarity Breeds Contempt: The Honeymoon Effect and the Role of Legacy Code in Zero-Day Vulnerabilities.

Clarke et al analyze databases of vulnerabilities to show that the factors influencing the rate of discovery of vulnerabilities are quite different from those influencing the rate of discovery of bugs. They summarize their findings thus:
We show that the length of the period after the release of a software product (or version) and before the discovery of the first vulnerability (the ’Honeymoon’ period) is primarily a function of familiarity with the system. In addition, we demonstrate that legacy code resulting from code re-use is a major contributor to both the rate of vulnerability discovery and the numbers of vulnerabilities found; this has significant implications for software engineering principles and practice.
Jim says:
our engineering processes need fundamental reform in the face of very long lived devices.
Don't hold your breath. The paper's findings also have significant implications for digital preservation, because external attack is an important component of the threat model for digital preservation systems:
  • Digital preservation systems are, like devices in the Internet of Things (IoT), long-lived.
  • Although they are designed to be easier to update than most IoT devices, they need to be extremely cheap to run. Resources to make major changes to the code base within the "honeymoon" period will be inadequate.
  • Scarce resources and adherence to current good software engineering resources already mean that much of the code in these systems is shared.
Thus it is likely that digital preservation systems will be more vulnerable than the systems whose content they are intended to preserve. This is a strong argument for diversity of implementation, which has unfortunately turned out to increase costs significantly. Mitigating the threat from external attack increases the threat of economic failure.

Thursday, October 23, 2014

Facebook's Warm Storage

Last month I was finally able to post about Facebook's cold storage technology. Now, Subramanian Muralidhar and a team from Facebook, USC and Princeton have a paper at OSDI that describes the warm layer between the two cold storage layers and Haystack, the hot storage layer. f4: Facebook's Warm BLOB Storage System is perhaps less directly aimed at long-term preservation, but the paper is full of interesting information. You should read it, but below the fold I relate some details.

Monday, October 20, 2014

Journal "quality"

Anurag Acharya and co-authors from Google Scholar have a pre-print at entitled Rise of the Rest: The Growing Impact of Non-Elite Journals in which they use article-level metrics to track the decreasing importance of the top-ranked journals in their respective fields from 1995 to 2013. I've long argued that the value that even the globally top-ranked journals add is barely measurable and may even be negative; this research shows that the message is gradually getting out. Authors of papers subsequently found to be "good" (in the sense of attracting citations) are slowly but steadily choosing to publish away from the top-ranked journals in their field. You should read the paper, but below the fold I have some details.

Wednesday, October 15, 2014

The Internet of Things

In 1996, my friend Steven McGeady gave a fascinating and rather prophetic keynote address to the Harvard Conference on the Internet and Society. In his introduction, Steven said:
I was worried about speaking here, but I'm even more worried about some of the pronouncements that I have heard over the last few days, ... about the future of the Internet. I am worried about pronouncements of the sort: "In the future, we will do electronic banking at virtual ATMs!," "In the future, my car will have an IP address!," "In the future, I'll be able to get all the old I Love Lucy reruns - over the Internet!" or "In the future, everyone will be a Java programmer!"

This is bunk. I'm worried that our imagination about the way that the 'Net changes our lives, our work and our society is limited to taking current institutions and dialling them forward - the "more, better" school of vision for the future.
I have the same worries that Steven did about discussions of the Internet of Things that looms so large in our future. They focus on the incidental effects, not on the fundamental changes. Barry Ritholtz points me to a post by Jon Evans at TechCrunch entitled The Internet of Someone Else's Things that is an exception. Jon points out that the idea that you own the Smart Things you buy is obsolete:
They say “possession is nine-tenths of the law,” but even if you physically and legally own a Smart Thing, you won’t actually control it. Ownership will become a three-legged stool: who physically owns a thing; who legally owns it; …and who has the ultimate power to command it. Who, in short, has root.
What does this have to do with digital preservation? Follow me below the fold.

Tuesday, October 7, 2014

Economies of Scale in Peer-to-Peer Networks

In a recent IEEE Spectrum article entitled Escape From the Data Center: The Promise of Peer-to-Peer Cloud Computing, Ozalp Babaoglu and Moreno Marzolla (BM) wax enthusiastic about the potential for Peer-to-Peer (P2P) technology to eliminate the need for massive data centers. Even more exuberance can be found in Natasha Lomas' Techcrunch piece The Server Needs To Die To Save The Internet (LM) about the MaidSafe P2P storage network. I've been working on P2P technology for more than 16 years, and although I believe it can be very useful in some specific cases, I'm far less enthusiastic about its potential to take over the Internet.

Below the fold I look at some of the fundamental problems standing in the way of a P2P revolution, and in particular at the issue of economies of scale. After all, I've just written a post about the huge economies that Facebook's cold storage technology achieves by operating at data center scale.

Tuesday, September 30, 2014

More on Facebook's "Cold Storage"

So far this year I've attended two talks that were really revelatory; Krste Asanović's keynote at FAST 13, which I blogged about earlier, and Kestutis Patiejunas' talk about Facebook's cold storage systems. Unfortunately, Kestutis' talk was off-the-record, so I couldn't blog about it at the time. But he just gave a shorter version at the Library of Congress' Designing Storage Architectures workshop, so now I can blog about this fascinating and important system. Below the fold, the details.

Thursday, September 25, 2014

Plenary Talk at 3rd EUDAT Conference

I gave a plenary talk at the 3rd EUDAT Conference's session on sustainability entitled Economic Sustainability of Digital Preservation. Below the fold is an edited text with links to the sources.

Tuesday, September 23, 2014

A Challenge to the Storage Industry

I gave a brief talk at the Library of Congress Storage Architecture meeting, pulling together themes from a number of recent blog posts. My goal was twofold:
  • to outline the way in which current storage architectures fail to meet the needs of long-term archives,
  • and to set out what an architecture that would meet those needs would look like.
Below the fold is an edited text with links to the earlier posts here that I was condensing.

Monday, September 22, 2014

Three Good Reads

Below the fold I'd like to draw your attention to two papers and a post worth reading.

Saturday, September 20, 2014

Utah State Archives has a problem

A recent discussion on the NDSA mailing list featured discussion about the Utah State Archives struggling with the costs of being forced to use Utah's state IT infrastructure for preservation. Below the fold, some quick comments.

Tuesday, September 16, 2014

Two Sidelights on Short-Termism

I've often referred to the empirical work of Haldane & Davies and the theoretical work of Farmer and Geanakoplos, both of which suggest that investors using Discounted Cash Flow (DCF) to decide whether an investment now is justified by returns in the future are likely to undervalue the future. This is a big problem in areas, such as climate change and digital preservation, where the future is some way off.

Now Harvard's Greenwood & Shleifer, in a paper entitled Expectations of Returns and Expected Returns, reinforce this:
We analyze time-series of investor expectations of future stock market returns from six data sources between 1963 and 2011. The six measures of expectations are highly positively correlated with each other, as well as with past stock returns and with the level of the stock market. However, investor expectations are strongly negatively correlated with model-based expected returns.
They compare investors' beliefs about the future of the stock market as reported in various opinion surveys, with the outputs of various models used by economists to predict the future based on current information about stocks. They find that when these models, all enhancements to DCF of one kind or another, predict low performance investors expect high performance, and vice versa. If they have experienced poor recent performance and see a low market, they expect this to continue and are unwilling to invest. If they see good recent performance and a high market they expect this to continue. Their expected return from investment will be systematically too high, or in other words they will suffer from short-termism.

Yves Smith at Naked Capitalism has a post worth reading critiquing a Washington Post article entitled America’s top execs seem ready to give up on U.S. workers. It reports on a Harvard Business School survey of its graduates entitled An Economy Doing Half Its Job. Yves writes:
In the early 2000s, we heard regularly from contacts at McKinsey that their clients had become so short-sighted that it was virtually impossible to get investments of any sort approved, even ones that on paper were no-brainers. Why? Any investment still has an expense component, meaning some costs will be reported as expenses on the income statement, as opposed to capitalized on the balance sheet. Companies were so loath to do anything that might blemish their quarterly earnings that they’d shun even remarkably attractive projects out of an antipathy for even a short-term lowering of quarterly profits.
Note "Companies were so loath". The usually careful Yves falls into the common confusion between companies (institutions) and their managers (individuals). Managers evaluate investments not in terms of their longer-term return to the company, but in terms of their short-term effect on the stock price, and thus on their stock-based compensation. Its the IBGYBG (I'll Be Gone, You'll Be Gone) phenomenon, which amplifies the underlying problems of short-termism.

Tuesday, September 9, 2014

Tuesday, September 2, 2014

Interesting report on Digital Legal Deposit

Last month the International Publishers Association (IPA) put out an interesting report about the state of digital legal deposit for copyright purposes, with extended status reports from the national libraries of Germany, the Netherlands, the UK, France and Italy, and short reports from many other countries. The IPA's conclusions echo some themes I have mentioned before:
  • "It is clear that the more voluntary a digital legal deposit scheme is at the outset, the better."
  • "The best schemes are those where an emphasis has been put on publishers and librarians collaborating to address key concerns"
My reason for saying these things is based on experience. It shows that, no matter what the law says, if the publishers don't want you to collect their stuff, you will have a very hard time collecting it. On-line publishers need to have robust defenses against theft, which even national libraries would have difficulty overcoming without the publishers' cooperation.

The publishers' reason for saying these things is different. What are the publishers' "key concerns" on which voluntary collaboration is needed?
  • "copyright protection, digital security and monitored access"
  • "clear, mutually agreed and flexible rules on access which protect publishers' normal exploitation; who is authorized to use deposited material, where they can access it and what they can lawfully do with it."
In other words, they are happy to deposit their content only under conditions that make it almost useless, such as that it only be accessible to one reader at a time physically at the library, just like a paper book.

Given that the finances of many national libraries are in dire straits, the publishers have a helpful suggestion:
"Countries might usefully consider other models, such as larger publishers self-archiving material, agreeing to make it available on request to libraries."
Or, in other words, lets just forget the whole idea of legal deposit.

Note: everything in quotes is from the report, emphasis in the original.

Friday, August 22, 2014

"Cloud Storage Is Eating The World Alive" Really?

I'm naturally happy when someone cites my blog and uses my data, as Alex Teu did in his post Cloud Storage Is Eating The World Alive on TechCrunch. I'm less happy with the some of the conclusions Alex drew. Below the fold, I argue with him.

Thursday, August 21, 2014

Is This The Dawn of DAWN?

More than three years ago, Ian Adams, Ethan Miller and I were inspired by a 2009 paper FAWN: A Fast Array of Wimpy Nodes from David Andersen et al at C-MU. They showed how a fabric of nodes, each with a small amount of flash memory and a very low-power processor, could process key-value queries as fast as a network of beefy servers using two orders of magnitude less power.

We put forward a storage architecture called DAWN: Durable Array of Wimpy Nodes, similar hardware but optimized for long-term storage. Its advantages were small form factor, durability, and very low running costs. We argued that these would outweigh the price premium for flash over disk. Recent developments are starting to make us look prophetic - details below the fold.

Tuesday, August 19, 2014

TRAC Audit: Do-It-Yourself Demos

In my post TRAC Audit: Process I explained how we demonstrated the LOCKSS Polling and Repair Protocol to the auditors, and linked to the annotated logs we showed them. These demos have been included in the latest release of the LOCKSS software. Below the fold, and now in the documentation, are step-by-step instructions allowing you to replicate this demo.

Thursday, August 14, 2014

"National Hosting" of archives

The LOCKSS team are working with some countries to build in-country Private LOCKSS Networks (PLNs) to preserve the content such as e-journals and e-books that they pay for. Other countries are considering outsourcing their national archive of this content to foreign providers. One of the questions that countries ask about these efforts is "where is the data stored?" Recent developments in the US and the UK mean that this is no longer the right question to ask. Follow me below the fold to find out what the right question has become.

Tuesday, August 12, 2014

TRAC Audit: Lessons

This is the third in a series of posts about CRL's TRAC audit of the CLOCKSS Archive. Previous posts announced the release of the certification report, and recounted the audit process. Below the fold I look at the lessons we and others can learn from our experiences during the audit.

Tuesday, August 5, 2014

TRAC Audit: Process

This is the second in a series of posts about CRL's audit of the CLOCKSS Archive. In the first, I announced the release of the certification report. In this one I recount the process of being audited and what we did during it. Follow me below the fold for a long story, but not as long as the audit process.

Update: the third post discussing the lessons to be drawn is here.

Monday, August 4, 2014

Post-Flash Solid State Storage Gets Real-er

HGST announced today that they are demonstrating an SSD that is based on Phase-Change Memory (PCM), one of the technologies competing to take over as flash runs out of steam. The selling point of the SSD is that it is extremely fast:
The demonstration shows unprecedented SSD performance levels that are achieved by utilizing a combination of HGST's new, latency-optimized interface protocols with next-generation non-volatile memory components.
The SSD demonstration utilizes a PCIe interface and delivers three million random read IOs per second of 512 bytes each when operating in a queued environment and a random read access latency of 1.5 microseconds (us) in non-queued settings, delivering results that cannot be achieved with existing SSD architectures and NAND Flash memories. This performance is orders of magnitude faster than existing Flash based SSDs, resulting in a new class of block storage devices. 
The SSD is based on 1Gb PCM chips. The new protocols that are needed to squeeze this performance out of PCIe were described by Dejan Vučinić et al in their paper DC Express: Shortest Latency Protocol for Reading Phase Change Memory over PCI Express at this year's FAST conference.

Monday, July 28, 2014

TRAC Certification of the CLOCKSS Archive

The CLOCKSS Archive is a dark archive of e-journal and e-book content, jointly managed by publishers and libraries, implemented using the LOCKSS technology and operated on behalf of the CLOCKSS not-for-profit by the LOCKSS team at the Stanford Library. For well over a year the LOCKSS team and CLOCKSS management have been preparing for and undergoing the Trustworthy Repositories Audit and Certification (TRAC) process for the CLOCKSS Archive with the Center for Research Libraries (CRL).

CRL just released the Certification Report on the CLOCKSS Archive. I'm happy to report that our work was rewarded with an overall score that equals the previous best, and the first ever perfect score in the "Technologies, Technical Infrastructure, Security" category. We are grateful for this wonderful endorsement of the LOCKSS technology.

In the interests of transparency the LOCKSS team have released all the non-confidential documentation submitted during the audit process. As you will see, there is a lot of it. What you see at the link is not exactly what we submitted. It has been edited to correct errors and obscurities we found during the audit, and to add material from the confidential part of the submission that we decided was not really confidential. These documents will continue to be edited as the underlying reality changes, to keep them up-to-date and satisfy one of the on-going requirements of the certification.

This is just a news item. In the near future I will follow up with posts describing the process of being audited, what we did to make the process work, and the lessons we learned that may be useful for future audits.

Update: the post describing the audit process is here and the post discussing the lessons to be drawn is here.

Friday, July 25, 2014

Coronal Mass Ejections

In my talk What Could Possibly Go Wrong last April I referred to a paper on the 2012 Coronal Mass Ejection (CME) that missed Earth by only nine days:
Most of the information needed to recover from such an event exists only in digital form on magnetic media. These days, most of it probably exists only in "the cloud", which is this happy place immune from the electromagnetic effects of coronal mass ejections and very easy to access after the power grid goes down.
NASA has a post discussing recent research into CMEs which is required reading:
Analysts believe that a direct hit by an extreme CME such as the one that missed Earth in July 2012 could cause widespread power blackouts, disabling everything that plugs into a wall socket.  Most people wouldn't even be able to flush their toilet because urban water supplies largely rely on electric pumps.
An extreme CME called the "Carrington Event" actually did hit the Earth in September 1859:
Intense geomagnetic storms ignited Northern Lights as far south as Cuba and caused global telegraph lines to spark, setting fire to some telegraph offices and thus disabling the "Victorian Internet."
A similar storm today could have a catastrophic effect. According to a study by the National Academy of Sciences, the total economic impact could exceed $2 trillion or 20 times greater than the costs of a Hurricane Katrina. 
Not to worry, because:
In February 2014, physicist Pete Riley of Predictive Science Inc. published a paper in Space Weather entitled "On the probability of occurrence of extreme space weather events."  In it, he analyzed records of solar storms going back 50+ years.  By extrapolating the frequency of ordinary storms to the extreme, he calculated the odds that a Carrington-class storm would hit Earth in the next ten years.

The answer: 12%.
Only 12%. I'd say that CMEs need to be part of the threat model of digital preservation systems.

Tuesday, July 1, 2014

Discounting the far future

In 2011 Andrew Haldane and Richard Davies of the Bank of England (HD) presented research showing that, when making investment decisions, investors applied discount rates much higher than the prevailing interest rates, and that this gap was increasing through time. One way of looking at their results was as an increase in short-termism; investors were increasingly reluctant to make investments with a long-term payoff. This reluctance clearly has many implications, including making dealing with climate change even more difficult. Their work has influenced our efforts to build an economic model of long-term storage, another area where the benefits accrue over a long period of time.

Now, Stefano Giglio of the Booth School and Matteo Maggiori and Johannes Stroebel of the Stern School (GMS) have a post entitled Discounting the very distant future announcing a paper entitled Very Long-Run Discount Rates. Their work, at first glance, seems to contradict HD. Below the fold, I look into this apparent disagreement.

Tuesday, June 24, 2014


For a long time there have been a number of possible "holy grails" for digital preservation, ideas that if it were possible to implement them would transform the problem. One of them has been the idea of an Internet-scale peer-to-peer network that would use excess disk storage at everyone's computers, in the same way that networks like Folding@Home use excess CPU, to deliver a robust, attack-resistant, decentralized storage infrastructure. Intermemory, from NEC's Princeton lab in 1998, was one of the first, but the concept is so attractive that there have been many others, such as Berkeley's Oceanstore. None have succeeded in attracting the mass participation of projects such as Folding@Home. None have become a widely-used infrastructure for digital preservation because without mass participation none provides the needed robustness or capacity.

By far the most successful peer-to-peer network in attracting participation has been Bitcoin, because the reward for participation is monetary. Now, it seems to me that Andrew Miller and his co-authors from the University of Maryland and Microsoft Research have taken a giant step towards this "holy grail" with their paper Permacoin: Repurposing Bitcoin Work for Data Preservation (hereafter MJSPK). This is despite the fact that, as I predicted in a comment last April, the current Bitcoin implementation has now definitively failed in its goal of establishing a decentralized currency because GHash has, for extended periods, controlled an absolute majority of the mining power. Follow me below the fold for my analysis of Permacoin and how this failure affects it.

Friday, June 20, 2014

X Window System turns 30

Yesterday was the 30th anniversary of Bob Scheifler's announcement of the first release of the X Window System. Congratulations to Bob and the other pioneers! That doesn't include me - I started work at Sun on X a couple of years later by doing the first port (of Version 10) to non-DEC hardware.

Thursday, June 19, 2014

More on long-lived media

I've already written skeptically about the concept of quasi-immortal media as a solution to the problem of digital preservation. But the misplaced enthusiasm continues. The latest wave surrounds Facebook's prototype Petabyte Blu-Ray jukebox; one of its touted features was the the media had a 50-year life. The prototype is extraordinarily interesting, and I hope to write more about it soon. But I doubt Facebook or anyone expects that the hardware will still be in use in 10 years, let alone 50. After all, you can search any large-scale data center in vain for 10-year-old hardware. So why is a 50-year media life interesting in this application? Follow me below the fold for yet another dose of skepticism.

Tuesday, June 17, 2014

Digital New York Times

More than four years ago Marc Andreesen gave a talk at Stanford's Business School in which, among many other interesting topics, he talked about the problems the New York Times had dealing with digital media. The recently leaked NYT Innovation Report 2014, the result of a six-month review headed by the Times' heir apparent, shows how prescient Andreesen was. Below the fold, some evidence.

Tuesday, June 3, 2014

Rare good news on Intellectual Property

Two recent pieces of good news on the intellectual property front:
  • The Hargreaves process in the UK has resulted in copyright reform legislation which has passed Parliament and is due to receive Royal Assent this month. Among the numerous improvements it contains are that data mining is included in the right to access, that libraries can make fair dealing copies for their readers, and that sound and video are treated the same as text in most cases. Of particular importance is that in most cases contracts will not be able to override these permissions.
  • In the recent Octane Fitness case, the US Supreme Court changed the rules for awards of fees in patent cases to deter patent trolls. The first such case has just been decided, and an obvious patent troll has not merely lost the case, but has had to pay for the victim's lawyers! Congratulations to FindTheBest CEO Kevin O'Connor for fighting back. Here's hoping that the loss of his RICO suit against the troll can be reversed on appeal.

Friday, May 23, 2014

Bezos' Law

Greg O'Connor had a piece at Gigaom entitled Moore's Law Gives Way To Bezos' Law sparked by the recent price cuts by Google and Amazon in which he claimed that:
The latest cuts make it clear there’s a new business model driving cloud that is every bit as exponential in growth — with order of magnitude improvements to pricing — as Moore’s Law has been to computing.
If you need a refresher, Moore’s Law is “the observation that, over the history of computing hardware, the number of transistors on integrated circuits doubles approximately every two years.” I propose my own version, Bezos’s law. Named for Amazon CEO Jeff Bezos, I define it as the observation that, over the history of cloud, a unit of computing power price is reduced by 50 percent approximately every three years.
Both Moore's and Kryder's laws held for multiple decades. Below the fold I ask whether a putative Bezos' law could be equally long-lived?

Wednesday, May 21, 2014

DAWN is breaking

I posted last October on Seagate's announcement of Kinetic, their object storage architecture for Ethernet-connected hard drives (and ultimately other forms of storage). This is a conservative approach to up-levelling the interface to storage media, providing an object storage architecture with a fixed but generally useful set of operations. In that way it is similar to, but less ambitious than, our proposed DAWN architecture.

The other half of the disk drive industry has now responded with a much more radical approach. Western Digital's HGST unit has announced Ethernet connected drives that run Linux. This approach has some significant advantages:
  • It sounds great as a marketing pitch.
  • It gets computing as close as possible to the data, which is the architecturally correct direction to be moving. This is something that DAWN does but Kinetic doesn't.
  • It will be easy to make HGST's drives compatible with Seagate's by running an implementation of the Kinetic protocol on them.
  • It provides a great deal of scope for researching and developing suitable protocols for communicating with storage media over IP.
But it is also very risky:
  • In many cases manufacturers find disks returned under warranty work fine; the cause of the failure was an unrepeatable bug in the disk firmware. Running Linux on the drive will provide a vastly increased scope for such failures, and make diagnosing them much harder for the manufacturer.
  • If the interface between the Linux and the drive hardware emulates the existing SATA or other interface, the benefits of the architecture will be limited to some extent. On the other hand, to the extent it exposes more of the hardware it will increase the risk that applications will screw up the hardware.
  • Kinetic's approach takes security of the communication with the drives seriously. HGST's "anything goes" approach leaves this up to the application.
On balance I think that HGST's acceptance that up-levelling the interface to media is important is a very positive development.

Thursday, May 15, 2014

Stored safe in the Cloud

Steve Kolowich at The Chronicle of Higher Education reports on a major outage and data loss on May 6 at Dedoose:
Dedoose, a cloud-based application for managing research data, suffered a “devastating” technical failure last week that caused academics across the country to lose large amounts of research work, some of which may be gone for good.
The crash nonetheless has dealt frustrating setbacks to a number of researchers, highlighting the risks of entrusting data to third-party stewards.
Below the fold, I look at what has been reported and discuss some of these risks.

Wednesday, May 14, 2014

Talk at Seagate

I gave a talk at Seagate entitled:
Storage Will Be
Much Less Free
Than It Used To Be
Below the fold is an edited text with links to the sources.

Tuesday, May 13, 2014

Named Data Networking gets major grant

Gigaom explains some great news for the future of the Internet from yesterday. The Named Data Networking project is one of three projects originally funded under the NSF's Future Internet Architecture program to share a $15M grant to support trial deployments of their new architectures.

Named Data Networking is inspired by Van Jacobson's work on Content-Centric Networking (CCN), which continues. They just announced a further code release. I explained the importance of CCN for digital preservation in a long blog post early last year.

Tuesday, May 6, 2014

On the Economics of Throwing Stuff Away

I've been arguing for some time that storing bits will be a lot less free than it used to be. The Big Data zealots who say:
Save it all—you never know when it might come in handy for a future data-mining expedition.
will have to adapt to this new reality. Below the fold I look at possible adaptations.

Tuesday, April 29, 2014

Another endowed data service launches

A friend pointed me to a Wired piece on Longaccess, a new endowed data service for archiving personal files. Below the fold I look at their numbers.

Monday, April 21, 2014

Skeptical about emulation?

If you're skeptical about two trends I've been pointing to, the rapid rise of emulation technology, and the evolution of the Web's language from HTML to Javascript, you need to watch Gary Bernhardt's video that fell through a time-warp from 2035.

Also, at the recent EverCloud workshop Mahadev Satyanarayanan, my colleague from the long-gone days of the Andrew Project, gave an impressive live demo of C-MU's Olive emulation technology. The most impressive part was that the emulations started almost instantly, despite that fact that they were demand-paging over the hotel's not super-fast Internet.

Thursday, April 17, 2014

Henry Newman on HD vs. SSD Economics

Henry Newman has an excellent post entitled SSD vs. HDD Pricing: Seven Myths That Need Correcting. His seven myths are:
  • First, some assume that the price of MLC NAND flash will continue to decrease at a rapid and predictable rate that will make it competitive with HDDs for bandwidth, and nearly for capacity, by 2014 or 2015. This downward trend, it is assumed, will make flash a viable alternative for large storage and to act as a memory or “buffer” to improve performance.
  • Second, there is a general assumption that prices for bandwidth ($/GB/s) for SSDs is much lower than for HDDs, and that enterprises will measure costs in these terms instead of capacity.
  • Third, there is no distinction made between flash in general, such as consumer SSDs, and enterprise storage SSDs. It is assumed that MLC NAND will not only reduce in price ($/GB) but also that it will increase in density and larger capacity drives will be developed.
  • Fourth, it is assumed that the quality of MLC NAND will either remain constant or increase as prices decrease and densities increase, allowing it to improve not only performance, but also reliability and power consumption of the systems it is used in.
  • Fifth, it is assumed that power consumption for SSDs is, or will shortly be, significantly lower than that of HDDs overall, on a per GB basis and on a per GB/s basis.
  • Sixth, they assume disk performance will grow at a constant rate of about 20 percent per generation and not improve.
  • Seventh, they assume file system data layout will not improve to allow better disk utilization.
Henry is looking at the market for performance storage, not for long-term storage, but given that limitation I agree with nearly everything he writes. However, I think there is a simpler argument that ends up at the same place that Henry did:
  • Flash can do everything that hard disk can, but there are many markets where hard disk cannot do what flash can do.
  • The supply of both flash and hard disk is constrained. Flash is constrained because investing in new flash fabs would not be profitable, especially given the obviously limited scope for shrinking flash cells. Hard disk is constrained because the market is effectively a duopoly, and both players are struggling to transition from the current PMR technology to HAMR.
  • Thus flash will command a premium over hard disk prices so that the market directs the limited supply of flash to those applications, such as tablets, smartphones, and high-performance servers, where its added value is highest.

Monday, April 7, 2014

What Could Possibly Go Wrong?

I gave a talk at UC Berkeley's Swarm Lab entitled "What Could Possibly Go Wrong?" It was an initial attempt to summarize for non-preservationistas what we have learnt so far about the problem of preserving digital information for the long term in the more than 15 years of the LOCKSS Program. Follow me below the fold for an edited text with links to the sources.

Wednesday, April 2, 2014

EverCloud workshop

I was invited to a workshop sponsored by ISAT/DARPA entitled The EverCloud: Anticipating and Countering Cloud-Rot that arose from Yale's EverCloud project. I gave a brief statement on an initial panel; an edited text with links to the sources is below the fold.

Monday, March 31, 2014

The Half-Empty Archive

Cliff Lynch invited me to give one of UC Berkeley iSchool's "Information Access Seminars" entitled The Half-Empty Archive. It was based on my brief introductory talk at ANADP II last November, an expanded version given as a staff talk at the British Library last January, and the discussions following both. An edited text with links to the sources is below the fold.

Friday, March 28, 2014


We were asked if the CLOCKSS Archive uses PREMIS metadata. The answer is no, and a detailed explanation is below the fold.

Tuesday, March 18, 2014

Krste Asanović Keynote at FAST14

The standout presentation at Usenix's FAST conference this year was Krste Asanović's keynote on UC Berkeley's ASPIRE Project. His introduction was:
The first generation of Warehouse-Scale Computers (WSC) built everything from commercial off-the-shelf (COTS) components: computers, switches, and racks. The second generation, which is being deployed today, uses custom computers, custom switches, and even custom racks, albeit all built using COTS chips. We believe the third generation of WSC in 2020 will be built from custom chips. If WSC architects are free to design custom chips, what should they do differently?
There is much to think about in the talk, which stands out because it treats the entire stack, from hardware to applications, in a holistic way. It is well worth your time to read the slides and watch the video. Below the fold, I have some comments.

Monday, March 17, 2014

Seagate's Kinetic hard drives

I was impressed by Seagate's announcement last October of their Kinetic Open Storage Platform and blogged about it at the time. I should have paid more attention. My ex-colleague at Sun, Geoff Arnold, who knows far more than I do about scale-out systems (he worked at Amazon) also blogged about the announcement, and his post is so worth reading that it got over 40K hits in the first two weeks! You should join the crowd.

And this week Seagate and the German storage firm Rausch announced at CeBit the first storage system product I've seen based on Kinetic drives, the BigFoot Object Storage Solution. A rack of these 4U units would hold 2.8PB.

Wednesday, March 12, 2014

Dan Geer at RSA

Dan Geer gave a must-read talk at the recent RSA conference. Dan is especially strong on the fragility of the systems upon which society is coming to depend, a theme that also ran through my friend Dewayne Hendricks' recent talk at Stanford's EE380 seminar series. Below the fold, I quibble with one part of Dan's talk to show how persistent projections based on exponential growth are in the face of facts.

Wednesday, March 5, 2014

Windows XP

The idea that format migration is integral to digital preservation was for a long time reinforced by people's experience of format incompatibility in Microsoft's Office suite. Microsoft's business model used to depend on driving the upgrade cycle by introducing gratuitous forward incompatibility, new versions of the software being set up to write formats that older versions could not render. But what matters for digital preservation is backwards incompatibility; newer versions of the software being unable to render content written by older versions. Six years ago the limits of Microsoft's ability to introduce backwards incompatibility were dramatically illustrated when they tried to remove support for some really old formats.

The reason for this fiasco was that Microsoft greatly over-estimated its ability to impose the costs of migrating old content on their customers, and the customer's ability to resist. Old habits die hard. Microsoft is trying to end support of Windows XP and Office 2003 on April 8 but it isn't providing cost-effective upgrade paths for what is now Microsoft's fastest-growing installed base. Joel Hruska writes:
Microsoft has come under serious fire for some significant missteps in this process, including a total lack of actual upgrade options. What Microsoft calls an upgrade involves completely wiping the PC and reinstalling a fresh OS copy on it — or ideally, buying a new device. Microsoft has misjudged how strong its relationship is with consumers and failed to acknowledge its own shortcomings. Not providing an upgrade utility is one example — but so is the general lack of attractive upgrade prices or even the most basic understanding of why users haven't upgraded.
This resistance to change has obvious implications for digital preservation.

Wednesday, February 19, 2014

Talk at FAST BoF

I gave a talk during a Birds-of-a-Feather session on "Long-Term Storage" at Usenix's FAST (File And Storage Technologies) meeting. Below the fold is an edited text with links to the sources.

Thursday, February 6, 2014

Worth Reading

I'm working on a long post about a lot of interesting developments in storage, but right now they are happening so fast I have to keep re-writing it. In the meantime, follow me below the fold for links to some recent posts on other topics that are really worth reading.

Thursday, January 30, 2014

Amazon's Q4 2013 Results

Jack Clark at The Register estimates that Amazon's cloud computing business put over a billion dollars on the bottom line in Q4 2013. The competition was left in the dust:
This compares with a claim by Microsoft that its Azure cloud wing was a billion-dollar business when measured on an annual basis, and Rackspace's most recent quarterly earnings of $108.4m for its public cloud. Google also operates its own anti-Amazon cloud products via Google App Engine and Google Compute Engine, but doesn't break out revenue in a meaningful format.
Of course, much of this profit comes from selling computing rather than storage, but this is further evidence against the idea that "the cloud is cheaper". Cloud services can save money in a situation of spiky demand, but for base-load tasks such as preservation they are uneconomic.

Monday, January 13, 2014

Economics of the PC Market

Charles Arthur has an interesting, well-researched piece at The Guardian detailing the terrible economics faced by makers of Windows PCs, and the resulting threat to Microsoft posed by Chromebooks:
The PC business is in a slump which has seen year-on-year shipments (and so sales) of Windows PCs fall for five (imminently, six) quarters in a row, after seven quarters where they barely grew by more than 2%.
And it's not only growth that's fallen. Analysis by the Guardian suggests that as well as falling sales, the biggest PC manufacturers now have to contend with falling prices and dwindling margins on the equipment they sell.
In the first quarter of 2010, the weighted average profit per PC was $15.71 - a 2.55% margin. (So the overall per-PC cost of manufacture, sales and marketing was just under $599.)
So much so that by the third quarter of 2013, the weighted average profit had fallen to $14.87. That actually marks an improvement in margin, to 2.73%
It has been true for a long time that Microsoft made more money from each Windows PC than the makers.
The most obvious beneficiary of every Windows PC sale is Microsoft. It gets revenue from the sale of the Windows licence - but it then captures extra value through the high likelihood that even consumer buyers of PCs will buy its Office suite, and probably buy another version of Windows at some point in that computer's life. It's the reason why Microsoft is so fabulously profitable, while PC manufacturers are struggling.
The makers used to get enough to live on. Now they don't. Selling Chromebooks is a way of cutting Microsoft out of the picture.

Friday, January 10, 2014

Alex Stamos at EE380

Alex Stamos gave an excellent talk yesterday in Stanford's EE380 course. The video is linked from the EE380 schedule page. His title was Building a Trustworthy Business in the Post-Snowden Era, and the talk was based on analyzing the source material that has been released, rather than the media interpretation of those materials. The video is well worth your time to watch because, as Alex says, even if you are sure you will never do anything to attract the attention of the NSA:
  • You have to assume that, in a few years, many of the capabilities the NSA has today will be available in the market for exploits and be usable by the average bad guy.
  • Among the few products whose markets the US still dominates are Internet services and networking hardware. Success in these markets depends heavily on trust, and the revelations have destroyed this trust.
  • In particular, you have to assume that much of the software on which the integrity of your archive depends have backdoors inserted at the request of the three-letter agencies.
More generally, Robert Puttnam in Making Democracy Work and Bowling Alone has shown the vast difference in economic success between high-trust and low-trust societies. The way the revelations have been able to repeatedly disprove successive Government denials is, together with the too-big-to-jail banksters, a serious threat to the US and other developed nations remaining high-trust societies. So even if you think you don't care about this stuff, you do.

Matt Blaze's piece in The Guardian is well worth a read too.

Saturday, January 4, 2014

Threat Model for Archives

Discussing the recent vulnerability in the Bitcoin protocol, I pointed out that:
One of the key ideas of the LOCKSS system was to decentralize custody of content to protect against powerful adversaries' attempts to modify it. Governments and other powerful actors have a long history of censorship and suppression of inconvenient content. A centralized archive system allows them to focus their legal, technical or economic power on a single target.
Today Boing-Boing points us to a current example of government suppression of inconvenient content that drives home the point.
Scientists say the closure of some of the world's finest fishery, ocean and environmental libraries by the Harper government has been so chaotic that irreplaceable collections of intellectual capital built by Canadian taxpayers for future generations has been lost forever.
Many collections such as the Maurice Lamontagne Institute Library in Mont-Joli, Quebec ended up in dumpsters while others such as Winnipeg's historic Freshwater Institute library were scavenged by citizens, scientists and local environmental consultants. Others were burned or went to landfills, say scientists.
Read the whole piece, especially if you think single, government-funded archives are a solution to anything.

Wednesday, January 1, 2014

Implementing DAWN?

In a 2009 paper "FAWN A Fast Array of Wimpy Nodes" David Andersen and his co-authors from C-MU showed that a network of large numbers of small CPUs coupled with modest amounts of flash memory could process key-value queries at the same speed as the networks of beefy servers used by, for example, Google, but using 2 orders of magnitude less power. In 2011, Ian Adams, Ethan Miller and I proposed extending this concept to long-term storage in a paper called “Using Storage Class Memory for Archives with DAWN, a Durable Array of Wimpy Nodes”. DAWN was just a concept, we never built a system.

Now, in a fascinating talk at the Chaos Computer Conference called "On Hacking MicroSD Cards" the amazing Bunnie Huang and his colleague xobs revealed that much of the hardware for a DAWN system may already be on the shelves at your local computer store. Below the fold, details of the double-edged sword that is extremely low-cost hardware, to encourage you to read the whole post and watch the video of their talk.