Tuesday, September 23, 2014

A Challenge to the Storage Industry

I gave a brief talk at the Library of Congress Storage Architecture meeting, pulling together themes from a number of recent blog posts. My goal was twofold:
  • to outline the way in which current storage architectures fail to meet the needs of long-term archives,
  • and to set out what an architecture that would meet those needs would look like.
Below the fold is an edited text with links to the earlier posts here that I was condensing.

What is the biggest problem facing digital archiving efforts today? It is, as I discussed in The Half-Empty Archive, that we aren't preserving enough of the stuff we should be preserving.

Measuring the proportion of the stuff that should be preserved that actually is being preserved is hard; as Cliff Lynch has pointed out neither the numerator nor the denominator is easy to measure. Nevertheless, attempts have been made in two areas in which at least the denominator is fairly well-known:
  • In 2010 the ARL reported that the median research library received about 80K serials. Stanford's numbers support this. The Keepers Registry, across its 8 reporting repositories, reports just over 21K "preserved" and about 10.5K "in progress". Thus under 40% of the median research library's serials are at any stage of preservation.
  • Luis Faria and co-authors (PDF) compare information extracted from publisher's web sites with the Keepers Registry and conclude:
    We manually repeated this experiment with the more complete Keepers Registry and found that more than 50% of all journal titles and 50% of all attributions were not in the registry and should be added.
  • Scott Ainsworth and his co-authors tried to estimate the probability that a publicly-visible URI was preserved, as a proxy for the question "How Much of the Wed is Archived?" They generated lists of "random" URLs using several different techniques including sending random words to search engines and random strings to the bit.ly URL shortening service. They then:
    • tried to access the URL from the live Web.
    • used Memento to ask the major Web archives whether they had at least one copy of that URL.
    Their results are somewhat difficult to interpret, but for their two more random samples they report:
    URIs from search engine sampling have about 2/3 chance of being archived [at least once] and bit.ly URIs just under 1/3.
Unfortunately, there are a number of reasons why this simplistic assessment is wildly optimistic:
  • It isn't risk-adjusted. We are preserving the stuff that isn't at risk and not preserving the stuff that will get lost.
  • It isn't adjusted for difficulty. We only preserve the easy stuff.
  • It is backwards-looking. We preserve content the way it used to be delivered, not the way it is being delivered now.
  • The measurements are biased.
Overall, its clear that we are preserving much less than half of the stuff that we should be preserving. What can we do to preserve the rest of it?

We can do nothing, in which case we needn't worry about bit rot, format obsolescence, and all the other risks any more because they only lose a few percent. The reason why more than 50% of the stuff won't make it to future readers would be can't afford to preserve.

We can more than double the budget for digital preservation. This is so not going to happen; we will be lucky to sustain the current funding levels. Worse, the cost per unit content is projected to rise, so the proportion of stuff lost to can't afford to preserve will rise too.

We can more than halve the cost per unit content. It is safe to say that we won't do this with current system architectures.

Where does the money go in current architectures? This is the subject of a good deal of research, but precise conclusions are hard to come by. My rule of thumb to summarize most of the research is that, in the past, approximately half the cost was in ingest, one-third in preservation (mostly storage) and one-sixth in access. Of the preservation cost, roughly one-third was media and two-thirds running costs. In The Half-Empty Archive I made projections of these costs going forward:
  • Ingest costs will rise, both because we've already ingested the easy stuff, and because the evolution of the Web is making ingest more difficult.
  • Preservation costs will rise, because Kryder's Law has slowed dramatically.
  • Access costs will rise. In the past access to archives was typically rare and sparse. Increasingly scholars want to data-mine out of large collections not just access a few data objects. Current web archives struggle to provide basic keyword search, let alone data-mining.
Ingest costs are archiving's problem, and we're working on it. Preservation and access costs are the storage industry's problem. Facebook, Amazon and others are working on the problem of low-cost, reliable cold storage. But archives are not cold. The Blue Ribbon Task Force and others have shown that in almost all cases the only way to justify the cost of preservation is to provide access to the preserved content. I'm on the staff of a truly dark archive, CLOCKSS (recently certified as a Trusted Repository), and I can testify that sustainable funding for dark archives that can use cold storage technologies is very difficult. For light archives they work fine for the second and subsequent copies, but they don't work for the primary copy, the one that supports the access that justifies the funding.

What the primary copy needs looking forward is a storage architecture that provides data-mining capabilities with moderate performance, reasonable reliability, and very low running costs. Capital cost can be somewhat higher than, for example, Backblaze's boxes, provided running costs are very low. Unfortunately, the funding model for most archives makes trading higher capital for lower running costs hard; the payback time for the investment must be short.

What would such an architecture look like? Lets work backwards from access. We need to run queries against the data. We can either move the data to the query or the query to the data. Current architectures pair powerful CPUs with many Terabytes of storage each, and thus require expensive, power-sucking I/O hardware to move the data to the query.

Five years ago David Andersen and co-authors at C-MU took the opposite approach. They published FAWN: a Fast Array of Wimpy Nodes, showing that a large number of very low-power CPUs each paired with a small amount of flash memory could run queries as fast as the conventional architecture using two orders of magnitude less power. Three years ago Ian Adams, Ethan Miller and I were inspired by this to publish DAWN: a Durable Array of Wimpy Nodes. We showed that, using a similar fabric for long-term storage, the running costs would be low enough to overcome the much higher cost of the flash media as compared to disk. And it would run the queries better than disk.

Disk is not a good medium for a move-the-query-to-the-data architecture. Here, from an interesting article at ACM Queue, is a much prettier graph explaining why. It plots the time it would take to write (approximately equals read) the entire content of a state-of-the-art disk against the year. Although disks have been getting bigger rapidly, they haven't been getting correspondingly faster. In effect, the stored data has been getting further and further away from the code. There's a fundamental reason for this - the data rate depends on the inverse of the diameter of a bit, but the capacity depends on the inverse of the area of a bit. The reason that FAWN-like systems can out-perform traditional PCs with conventional hard disks is that the bandwidth between the data and the CPU is so high and the amount of data per CPU so small that it can all be examined in a very short time.

A year and a half ago Micron announced a very small TLC (Triple-Level Cell) flash memory chip, 128Gb in a chip 12mm on a side, built in a 20nm process. It was very low cost but slow, with very limited write endurance. Facebook talked about using this TLC flash for cold data, where high speed and high write endurance aren't needed. Chris Mellor at The Register wrote:

There's a recognition that TLC flash is cheap as chips and much faster to access than disk or even, wash your mouth out, tape. Frankovsky, speaking to Ars Technica, said you could devise a controller algorithm that tracked cell status and maintained, in effect, a bad cell list like a disk drive's bad block list. Dead TLC flash cells would just be ignored. By knowing which cells were good and which were bad you could build a cold storage flash resource that would be cheaper than disk, he reckons, because you wouldn't need techies swarming all over the data centre replacing broken disk drives from the tens of thousands that would be needed.
At the recent Flash Memory Summit, a startup called  NxGnData founded two years after we proposed DAWN announced a flash memory controller intended to pair with very low-cost Triple-Level-Cell flash such as Micron's to support both cold storage and "In-Situ Processing":
“The company believes that by moving computational tasks closer to where the data resides, its intelligent storage products will considerably improve overall energy efficiency and performance by eliminating storage bandwidth bottlenecks and the high costs associated with data movement.”
On average, the output of a query is much smaller than its input, otherwise what's the point? So moving queries and their results needs much less communication infrastructure, so wastes much less power.

In move-the-query-to-the-data architectures, we need to talk IP all the way to the media, which is perfectly feasible since even the cheapest media these days are actually (system-on-chip+media) modules. The storage industry has been moving in that direction, with first Seagate and then Western Digital announcing disk drives connected via Ethernet. Storage is part of the Internet of Things, and recent events such as the Doomed Printer show the need for a security-first approach.

What protocols are we going to talk over IP to the media? WD ducks that question, their drives run Linux. This provides ultimate flexibility, but it also gives the media an unpleasantly large attack surface. Seagate's answer is an object storage protocol, which provides a minimal attack surface but doesn't address the data-mining requirement, its still a move-the-data-to-the-query architecture. NxGnData's presentation shows they've been thinking about this question:
The challenges:
•  Protection against malicious code
•  Fast runtime environment setup time
•  Freedom in selecting programming language (C, Python, Perl...)
•  Multi-tenant environments
•  Security and
•  Data isolation
Rackspace and others have been thinking along the same lines too, and their ZeroVM virtualization technology may be a viable compromise.
In these architectures there is no separate computer and operating system mediating access to the media. There are no "drivers" hiding device-specific details. The media are talking directly to each other and the outside world. The only need for centralized computing is for monitoring and management, the sysadmin's UI to the storage fabric.

Also, the fabric itself must evolve, it has to be upgraded incrementally; it will contain media from different generations and different vendors. We know that in the life of the fabric both of the current media technologies, disk and flash, will hit technological limits, but we aren't sure of the characteristics of the technologies that will replace them. If the architecture is to succeed, these considerations place some strong requirements on the media protocol:
  • The media protocols must be open, long-lived industry standards; there is no good place to put adaptors between proprietary protocols.
  • The protocols must be at the object storage level, they have to hide the specific details of how the object is represented in the medium so that they can survive technology transitions.
  • The protocols must cover the essential archive functions, access, query, anti-entropy, media migration and storage management.
In this context work that has been on-going since at least 2006 by Van Jacobsen and others to implement what is now called Named Data Networking is something that the storage industry should support. The idea is to replace the TCP/IP infrastructure of the Internet with a new but functionally compatible infrastructure whose basic concept is a named data object.

To sum up, the storage architecture archives need is well within current technological capabilities. Please, storage industry, stop selling us stuff designed to do something completely different, and build it for us.

1 comment:

  1. In the discussion Dave Anderson pointed to proof that Seagate, at least, is thinking along these lines. In a 2013 paper entitled Exploiting Free Silicon for Energy-Efficient Computing Directly in NAND Flash-based Solid-State Storage Systems,, Peng Li and Kevin Gomez of Seagate and David Lilja of the Univ. of MN describe their concept of the Storage Processing Unit (SPU). This involves integrating a low-power CPU with the flash memory, which is what DAWN would require, and matches what Bunnie Huang found. They come to the same conclusion that, with very different hardware, FAWN did:

    "Simulation results show that the SPU-based system is at least 100 times more energy-efficient than the conventional system for data-intensive applications."

    The interesting part is why they believe the SPU is made of "free silicon". To understand this, you need to understand the chip design concept of "pad-limited". The pads of a chip are where the wires that transmit power and signals attach to the silicon, and their size is fixed by the process of bonding the wires to the silicon. Normally, the pads are arranged around the periphery of the chip, so the number of pads and their size determine the circumference of the chip. This means that a chip with a given number of connections has to have at least a certain area, whether or not it needs that area to implement the functions of the chip. A chip that needs less area to implement its functions than is enforced by its pads is called pad-limited. The chip area "wasted" can be used to implement other functions and is thus effectively free.

    It turns out that flash vendors are already using some of this free silicon:

    "For example, both the Micron ClearNAND and the Toshiba embedded multi-media card (eMMC) have integrated the hardware ECC into a die inside the NAND flash package. In addition, since the die area is pad-limited, manufacturers like Micron and Toshiba also have integrated a general purpose processor into the die to implement parts of the FTL [Flash Translation Layer] functions, such as block management, to further increase the SSD performance."

    "In fact, even with the integrated general purpose processor and the hardware ECC, the die still has available area. Thus, we can integrate more logic units without any additional cost."