Friday, September 6, 2013

"Preservation at Scale" at iPRES2013

I took part in the Preservation at Scale (PDF) workshop at iPRES2013. Below the fold is an edited text of my presentation, entitled Diversity and Risk at Scale, with links to the sources.

I'm David Rosenthal from the LOCKSS Program at the Stanford Libraries, which next month will celebrate its 15th anniversary. I'm going to start with the economics of operating at scale in general, drawing some lessons from general technology markets for the digital preservation business. Then I'll discuss some lessons we have learned as the various networks using the LOCKSS technology have scaled up in ways that are different from other digital preservation systems.

Technology markets are hostile to diversity. In 1994 the economist W. Brian Arthur published Increasing Returns and Path Dependence in the Economy. The two key concepts are in the title:
  • Technology markets normally have increasing returns to scale, meaning that they tend to be dominated by one big player.
  • Path dependence, otherwise known as the butterfly wing effect, means that random influences early in a market's development can be amplified by increasing returns to scale and determine the winner.
One obvious example of this phenomenon has been Amazon's dominance of the cloud computing market:
more than five times the compute capacity in use than the aggregate total of the other fourteen providers
I've been researching the economics of long-term digital storage for some time, and this has meant paying close attention to Amazon. Daniel Vargas and I documented our experiments running a LOCKSS box in Amazon's cloud at the last IDCC, and I presented work by a team including UC Santa Cruz, NetApp and Stony Brook (PDF) on the overall picture at UNESCO's Memory of the World in the Digital Age conference last year.

There are two parts to increasing returns to scale:
  • Network effects, otherwise known as Metcalfe's Law, which states that the value of a network goes as n2. The more people look on Amazon when they want to buy, the more sellers it will attract, which makes it more likely that people will look there to buy. As far as I can see, no-one has identified strong network effects in the marketplace for digital preservation.
  • Economies of scale. It is clear that the massive scale at which Amazon, and other large Internet companies such as Google and Facebook, operate has provided them with very large economies of scale. For them, even small innovations in reducing costs such as system administration and energy consumption translate into large dollar savings. This makes it worth, for example, designing and building custom hardware, software and even data centers.
Conventional wisdom is that there are large economies of scale in digital preservation. Here, for example, is Bill Bowen (a well-known economist, but with an axe to grind) enthusing about the prospects for Portico in 2005:
Fortunately, the economies of scale are so pronounced that if the community as a whole steps forward, cost per participant (scaled to size) should be manageable.
The idea that there are large economies of scale also leads to the recent enthusiasm for using cloud storage for preservation. Here is Duracloud from 2012:
give Duracloud users the ability to be in control of their content ... taking advantage of cloud economies of scale
But there are two important questions that are often overlooked:
  • Who benefits from the economies of scale, the customer or the supplier? Our research shows that, at least in cloud services, Amazon does. We're not alone in this conclusion, Cade Metz in Wired reports on the many startups figuring this out, and Jack Clark at The Register reports that even cloud providers concede the point. Amazon's edge is that they price against their customer's marginal or peak cost, whereas they aggregate a large number of peaks into a steady base load that determines their costs. Peak pricing and base-load costs mean killer margins.
  • Are the economies of scale monotonically increasing, and if not how big do you need to be to get them? Backblaze, the PC backup service, publishes detailed cost breakdowns for their custom-designed storage hardware, which other institutions confirm. We use them to show that, even at the 100TB scale and with assumptions favorable to Glacier, doing it yourself is cheaper, despite Glacier's attention-grabbing 1c/GB/mo headline price (NB - read the small print). The reason is that, if Amazon is grabbing most of the economies of scale, you only have to get more than the crumbs Amazon leaves on the table to come out a winner.
It is easy to assume that bigger is always better; because economies of scale exist, scaling up your operations will always capture them. In the Amazon example they are captured, but not by you. But in addition to the economies of scale, there are dis-economies of scale, problems that arise or increase with scale, that may also be captured.

The Amazon example shows the most obvious dis-economy of scale; the concentration of market power allows the dominant supplier to extract monopoly rents. Even though much of digital preservation is not-for-profit, the same mechanism operates, as we see for example with JSTOR, ITHAKA's cash-cow.

But there are other, arguably more important dis-economies. The whole point of digital preservation is to find ways to mitigate risks to content. I have long argued that the most important risk is economic, in that no-one has the money to preserve even the stuff they think is important. So the economies of scale are important; they act to mitigate this risk. But they reduce diversity by causing content to concentrate in fewer and fewer hands, which increases many other risks, including the impact of an economic failure or an arbitrary decision of the dominant supplier.

Where do economies of scale come from? They come from stamping out diversity, imposing a mono-culture, in other words increasing the correlation between things. Differences cost money, applying the same operations to everything saves money. This is good if the operations are correct, bad if they are erroneous or being invoked by the bad guys. The flip side of the cost savings is increased risk from increased correlation. For example:
  • If all copies of all of the content are preserved by the same software, a bug in the software can affect all of it irrecoverably. If the software preserving the copies of the content is diverse, this is much less likely. Note that this applies not just to the top-level software, but to the libraries and operating system as well.
  • If all copies of all of the content are preserved in the same security domain, a breach can cause irrecoverable loss. For example, Stanford has a single-sign-on system for almost all University IT; a recent breach of this system compromised many of Stanford's systems. Fortunately, the LOCKSS team has never had the time (or the inclination) to integrate our systems with the single-sign-on system.
  • If all copies of all of the content are under a single administration, human error or a disgruntled administrator can cause irrecoverable loss.
A good example of an institution taking steps to mitigate these risks is the British Library, whose 4 (I believe) replica repositories are under completely separate administration, and are upgraded asynchronously. Note the resulting cost increase.

One of the reasons the global paper library system was so successful at preserving information was its distributed, highly replicated but de-correlated and diverse architecture. The LOCKSS technology was consciously designed 15 years ago (PDF) to reproduce these attributes. The protocol that LOCKSS boxes currently use exploits these attributes to detect and repair damage while resisting attack even by a powerful adversary.

There are now at least 13 independent networks of LOCKSS boxes with well over 200 boxes in libraries around the world collecting and preserving everything from e-journals and e-books to government documents and social science datasets.The first of them was the Global LOCKSS Network, which consists of about 150 LOCKSS boxes, collecting and preserving the e-journals to which the libraries subscribe, and open-access e-journals. About 62K volumes have been released for preservation, from about 600 publishers.

The network preserving the most digital objects is the CLOCKSS Archive, a dark archive of (currently) 12 large LOCKSS boxes at institutions around the world. It is intended to preserve the entire output of e-journal and e-book publishers, currently about 125 of them. Each box is configured to preserve all content publishers submit. Content is only disseminated from this network if it ever becomes unavailable from any publisher; so far 8 journals have been triggered in this way.

Between these two networks we have ingested well over 11M articles and 400K e-books. What lessons have we learned since the LOCKSS software went into production use in 2004? The most important one is that the relationship between diversity and risk is very complex.

Distribution at our current scale, together with the randomization that is an important feature of the LOCKSS technology, is an effective form of diversity. At any given time, each box is doing something different. If a box encounters a problem, it is very unlikely that any other box is encountering the same problem at the same time. We do not observe "convoys", in which the communication between peers in a network synchronizes their behavior. Convoys would be bad; they might lead, for example, to many boxes requesting the same content from a publisher at the same time. This might look like a Denial of Service attack.

On the other hand, this diversity of action means that one cannot infer what the network as a whole is doing from observing any one box. Monitoring the system involves collecting data from every box. A lot of data. As part of the work for our current grant from the Mellon Foundation, we have developed and are testing a third generation of our network monitoring technology. The database it builds from the GLN grows at nearly 10M rows/week. It takes Amazon's cloud to analyze this database.

For obvious reasons we can only sample the network's performance at this full level of detail intermittently. Here is a graph of 40K polls extracted from 18M data items collected from the CLOCKSS network over 5 weeks. The overwhelming majority of polls, at the right end, are fine. The only ones that might repay examination are the tiny minority at the left end. The long-tailed distribution means there is a significant signal-to-noise problem.

Everyone, especially the criteria for the TRAC audit we're currently undergoing, thinks logging the actions of a system is a good thing. The LOCKSS daemon software has extensive and highly configurable logging capabilities that can provide an enormous amount of detailed diagnostic information about what each box is doing. We configure almost all this logging off almost all the time, only turning the relevant parts on when we are diagnosing a specific problem. Keeping vast volumes of log data is both wasteful and counter-productive, since it buries the signal in the noise.We have implemented a separate system of "Alerts", which boxes send via e-mail when they notice something that deserves attention by humans.

The most important lesson about diversity is that it is inherent in the real world artifacts that we need to preserve. A lot of the time it shows up in long-tailed distributions. Faria et al (PDF) report:
With 21 large publishers we cover 50% of the journal titles listed by
EBSCO. But they also show that we have to face a huge long tail with 80% of the publishing companies publishing only one title.
Preserving the big publisher's output doesn't cost a lot, because all their journals look much the same, and the counts of journals, articles, bytes and so on make it look like you're achieving a lot. In fact, the resources devoted to preserving the big publishers are essentially wasted. There are two reasons for preserving e-journals:
  • Post-cancellation access. Big publishers, with their "Big Deals", are very skilled at making sure that libraries cancel only smaller publishers. Even if this fails, big publishers provide post-cancellation access from their own systems. They understand that in the Web world, hits are worth money even if they aren't from a paying subscriber.
  • Preserving the record. Elsevier is older than the vast majority of libraries. They aren't going away. They have both a strong incentive, and the resources, to preserve and continue to provide access to an asset that puts around $1B on the bottom line each year. Even if they did go bankrupt, the journals would be purchased by another company, who would thereby purchase the same motivations.
The output of the big publishers isn't at risk. The output of the small publishers is at risk. Their business models are fragile. Much of their output is open access, invalidating one of the two reasons for preservation and, sadly, the one that librarians care about. And their output is diverse, making it expensive to ingest. It is expensive to get permission to ingest it, because one negotiation gets one journal. And it is expensive to ingest, because the technical work needed to configure the system for it and test that it has been done right, gets only one journal.

Thus the idea of economies of scale and the resulting disinclination to tackle diversity has two malign effects:
  • It misdirects preservation resources to content that is at low risk, and away from content that is at high risk.
  • It reduces the effectiveness of the resources by deploying them in concentrated, highly correlated forms.
It is hard to see how to re-direct resources in a more productive way; perverse incentives and unsupported assumptions are everywhere you look.

I want to finish by asking the "Preservation at Scale" workshop whether we have in fact achieved "scale"?
  • In 2010 the ARL reported that the median research library received about 80K serials. Stanford's numbers support this. The Keepers Registry, across its 8 reporting repositories, reports just over 21K "preserved" and about 10.5K "in progress". Thus under 40% of the median research library's serials are at any stage of preservation.
  • Luis Faria and co-authors (PDF) at this meeting compare information
    extracted from publisher's web sites with the Keepers Registry and conclude:
    We manually repeated this experiment with the more complete
    Keepers Registry and found that more than 50% of all journal
    titles and 50% of all attributions were not in the registry and
    should be added.
  • Scott Ainsworth and his co-authors tried to estimate the probability that a publicly-visible URI was preserved, as a proxy for the question "How Much of the Wed is Archived?" Their results are somewhat difficult to interpret, but for their two more random samples they report:
    URIs from search engine sampling have about 2/3 chance of being archived [at least once] and bit.ly URIs just under 1/3.
Thus it seems likely that across the whole spectrum we're currently at least a factor of 2, and probably much more, away from operating at the scale we need.

No comments: