- to outline the way in which current storage architectures fail to meet the needs of long-term archives,
- and to set out what an architecture that would meet those needs would look like.
What is the biggest problem facing digital archiving efforts today? It is, as I discussed in The Half-Empty Archive, that we aren't preserving enough of the stuff we should be preserving.
Measuring the proportion of the stuff that should be preserved that actually is being preserved is hard; as Cliff Lynch has pointed out neither the numerator nor the denominator is easy to measure. Nevertheless, attempts have been made in two areas in which at least the denominator is fairly well-known:
- In 2010 the ARL reported that the median research library received about 80K serials. Stanford's numbers support this. The Keepers Registry, across its 8 reporting repositories, reports just over 21K "preserved" and about 10.5K "in progress". Thus under 40% of the median research library's serials are at any stage of preservation.
- Luis Faria and co-authors (PDF) compare information extracted from publisher's web sites with the Keepers Registry and conclude:
We manually repeated this experiment with the more complete Keepers Registry and found that more than 50% of all journal titles and 50% of all attributions were not in the registry and should be added.
- Scott Ainsworth and his co-authors tried to estimate the probability
that a publicly-visible URI was preserved, as a proxy for the question "How Much of the Wed is Archived?"
They generated lists of "random" URLs using several different
techniques including sending random words to search engines and random
strings to the bit.ly URL shortening service. They then:
- tried to access the URL from the live Web.
- used Memento to ask the major Web archives whether they had at least one copy of that URL.
URIs from search engine sampling have about 2/3 chance of being archived [at least once] and bit.ly URIs just under 1/3.
- It isn't risk-adjusted. We are preserving the stuff that isn't at risk and not preserving the stuff that will get lost.
- It isn't adjusted for difficulty. We only preserve the easy stuff.
- It is backwards-looking. We preserve content the way it used to be delivered, not the way it is being delivered now.
- The measurements are biased.
We can do nothing, in which case we needn't worry about bit rot, format obsolescence, and all the other risks any more because they only lose a few percent. The reason why more than 50% of the stuff won't make it to future readers would be can't afford to preserve.
We can more than double the budget for digital preservation. This is so not going to happen; we will be lucky to sustain the current funding levels. Worse, the cost per unit content is projected to rise, so the proportion of stuff lost to can't afford to preserve will rise too.
We can more than halve the cost per unit content. It is safe to say that we won't do this with current system architectures.
Where does the money go in current architectures? This is the subject of a good deal of research, but precise conclusions are hard to come by. My rule of thumb to summarize most of the research is that, in the past, approximately half the cost was in ingest, one-third in preservation (mostly storage) and one-sixth in access. Of the preservation cost, roughly one-third was media and two-thirds running costs. In The Half-Empty Archive I made projections of these costs going forward:
- Ingest costs will rise, both because we've already ingested the easy stuff, and because the evolution of the Web is making ingest more difficult.
- Preservation costs will rise, because Kryder's Law has slowed dramatically.
- Access costs will rise. In the past access to archives was typically rare and sparse. Increasingly scholars want to data-mine out of large collections not just access a few data objects. Current web archives struggle to provide basic keyword search, let alone data-mining.
What the primary copy needs looking forward is a storage architecture that provides data-mining capabilities with moderate performance, reasonable reliability, and very low running costs. Capital cost can be somewhat higher than, for example, Backblaze's boxes, provided running costs are very low. Unfortunately, the funding model for most archives makes trading higher capital for lower running costs hard; the payback time for the investment must be short.
What would such an architecture look like? Lets work backwards from access. We need to run queries against the data. We can either move the data to the query or the query to the data. Current architectures pair powerful CPUs with many Terabytes of storage each, and thus require expensive, power-sucking I/O hardware to move the data to the query.
Five years ago David Andersen and co-authors at C-MU took the opposite approach. They published FAWN: a Fast Array of Wimpy Nodes, showing that a large number of very low-power CPUs each paired with a small amount of flash memory could run queries as fast as the conventional architecture using two orders of magnitude less power. Three years ago Ian Adams, Ethan Miller and I were inspired by this to publish DAWN: a Durable Array of Wimpy Nodes. We showed that, using a similar fabric for long-term storage, the running costs would be low enough to overcome the much higher cost of the flash media as compared to disk. And it would run the queries better than disk.
Disk is not a good medium for a move-the-query-to-the-data architecture. Here, from an interesting article at ACM Queue, is a much prettier graph explaining why. It plots the time it would take to write (approximately equals read) the entire content of a state-of-the-art disk against the year. Although disks have been getting bigger rapidly, they haven't been getting correspondingly faster. In effect, the stored data has been getting further and further away from the code. There's a fundamental reason for this - the data rate depends on the inverse of the diameter of a bit, but the capacity depends on the inverse of the area of a bit. The reason that FAWN-like systems can out-perform traditional PCs with conventional hard disks is that the bandwidth between the data and the CPU is so high and the amount of data per CPU so small that it can all be examined in a very short time.
A year and a half ago Micron announced a very small TLC (Triple-Level Cell) flash memory chip, 128Gb in a chip 12mm on a side, built in a 20nm process. It was very low cost but slow, with very limited write endurance. Facebook talked about using this TLC flash for cold data, where high speed and high write endurance aren't needed. Chris Mellor at The Register wrote:
There's a recognition that TLC flash is cheap as chips and much faster to access than disk or even, wash your mouth out, tape. Frankovsky, speaking to Ars Technica, said you could devise a controller algorithm that tracked cell status and maintained, in effect, a bad cell list like a disk drive's bad block list. Dead TLC flash cells would just be ignored. By knowing which cells were good and which were bad you could build a cold storage flash resource that would be cheaper than disk, he reckons, because you wouldn't need techies swarming all over the data centre replacing broken disk drives from the tens of thousands that would be needed.At the recent Flash Memory Summit, a startup called NxGnData founded two years after we proposed DAWN announced a flash memory controller intended to pair with very low-cost Triple-Level-Cell flash such as Micron's to support both cold storage and "In-Situ Processing":
“The company believes that by moving computational tasks closer to where the data resides, its intelligent storage products will considerably improve overall energy efficiency and performance by eliminating storage bandwidth bottlenecks and the high costs associated with data movement.”On average, the output of a query is much smaller than its input, otherwise what's the point? So moving queries and their results needs much less communication infrastructure, so wastes much less power.
are actually (system-on-chip+media) modules. The storage industry has been moving in that direction, with first Seagate and then Western Digital announcing disk drives connected via Ethernet. Storage is part of the Internet of Things, and recent events such as the Doomed Printer show the need for a security-first approach.
What protocols are we going to talk over IP to the media? WD ducks that question, their drives run Linux. This provides ultimate flexibility, but it also gives the media an unpleasantly large attack surface. Seagate's answer is an object storage protocol, which provides a minimal attack surface but doesn't address the data-mining requirement, its still a move-the-data-to-the-query architecture. NxGnData's presentation shows they've been thinking about this question:
The challenges:Rackspace and others have been thinking along the same lines too, and their ZeroVM virtualization technology may be a viable compromise.
• Protection against malicious code
• Fast runtime environment setup time
• Freedom in selecting programming language (C, Python, Perl...)
• Multi-tenant environments
• Security and
• Data isolation
In these architectures there is no separate computer and operating system mediating access to the media. There are no "drivers" hiding device-specific details. The media are talking directly to each other and the outside world. The only need for centralized computing is for monitoring and management, the sysadmin's UI to the storage fabric.
Also, the fabric itself must evolve, it has to be upgraded incrementally; it will contain media from different generations and different vendors. We know that in the life of the fabric both of the current media technologies, disk and flash, will hit technological limits, but we aren't sure of the characteristics of the technologies that will replace them. If the architecture is to succeed, these considerations place some strong requirements on the media protocol:
- The media protocols must be open, long-lived industry standards; there is no good place to put adaptors between proprietary protocols.
- The protocols must be at the object storage level, they have to hide the specific details of how the object is represented in the medium so that they can survive technology transitions.
- The protocols must cover the essential archive functions, access, query, anti-entropy, media migration and storage management.
To sum up, the storage architecture archives need is well within current technological capabilities. Please, storage industry, stop selling us stuff designed to do something completely different, and build it for us.