Tuesday, January 29, 2013

DAWN vs. Twitter

I blogged three weeks ago about the Library of Congress ingesting the Twitter feed, noting that the tweets were ending up on tape. It is over 130TB and growing 190GB/day. The Library is still trying to work out how to provide access to this collection; for example they cannot afford the infrastructure that would allow readers to perform keyword searches. This leaves the 400-odd researchers who have already expressed a need for access to the collection stymied. The British Library is also running into problems providing access to large collections, although not as large as Twitter. They are reduced to delivering 30TB NAS boxes to researchers, the same approach as Amazon and other services have taken to moving large amounts of data.

I mentioned this problem in passing in my earlier post, but I have come to understand that this observation has major implications for the future of digital preservation. Follow me below the fold as I discuss them.

A few years ago I reviewed the research on the costs of digital preservation and concluded that, broadly speaking, the breakdown was half to ingest, one third to preservation, and one sixth to access. The relatively small proportion devoted to access was in line with studies of access patterns to preserved data, such as that by my co-author Ian Adams at UCSC, which showed that readers accessed them infrequently. The bulk of the accesses were for integrity checks.

It would be interesting to know what each of the 400 researchers want to do with the Twitter collection. Do they really want to do Google-style keyword searches, each of which will probably return a few million tweets for further analysis? My guess is that what they really want to do is to run a data-mining process over the entire collection. If the only access were via keyword search, data-mining would have to proceed via a keyword search that returns a large sample, then transferring the large sample over the network from the library to the researcher's computer to run the rest of the analysis. This would be inefficient, slow and expensive in bandwidth.

In the future, with collections much larger than those whose access patterns have been studied in the past, the access researchers want is likely to be quite different from the patterns we have seen so far. Instead of infrequent access to a small fraction of the collection, we will see infrequent access to much of the collection. These accesses will be more similar those for integrity checks, which read the entire collection, than to traditional reader access. But instead of happening gradually over a period of many months, they will need to be completed relatively quickly. Thus the proportion of the future cost of preservation due to access will be much larger than one sixth.

This isn't the only implication of the change in access patterns. Each data-mining type access will generate a spike in demand for computing resources. Bandwidth considerations will require that these resources be located close to the collection. Provisioning the computing resources at the archive, in the Twitter case at the Library of Congress, will result in them being under-used because there won't be enough demand from researchers to keep them busy 24/7. The Library would have to pay not merely for keeping the Twitter archive on hard drives, which they can't afford, but also a substantial compute farm, which they can't afford and would stand idle much of the time.

This pattern of demand for compute resources, being spiky, is ideal for cloud computing services. It might well be that if they invested in providing keyword search over the Twitter collection the Library would not merely be failing to provide the access researchers want, but also implementing the access they would provide uneconomically. Suppose instead that in addition to keeping the two archive copies on tape, the Library kept one copy in S3's Reduced Redundancy Storage simply to enable researchers to access it. Right now it would be costing $7692/mo. Each month this would increase by $319. So a year would cost $115,272. Scholars wanting to access the collection would have to pay for their own computing resources at Amazon, and the per-request charges; because the data transfers would be internal to Amazon there would not be bandwidth charges. The storage charges could be borne by the library or charged back to the researchers. If they were charged back, the 400 outstanding requests would each need to pay about $300 for a year's access to the collection, not an unreasonable charge. If this idea turned out to be a failure it could be terminated with no further cost, the collection would still be safe on tape. In the short term, using cloud storage for an access copy of large, popular collections may be a cost-effective approach.

If I'm right about the likely future pattern of accesses to large preserved collections, it reinforces a message Ian Adams and Ethan Miller of UCSC and I wrote about back in 2011. Fascinating work at Carnegie-Mellon demonstrated that by combining very low-power CPUs with modest amounts of flash storage it was possible to build a network of much larger numbers of much smaller nodes that could process key-value searches as fast as existing hard-disk-based networks at 2 orders of magnitude less power. They called this concept FAWN, for Fast Array of Wimpy Nodes. By analogy, we proposed DAWN, for Durable Array of Wimpy Nodes. Using the same approach of combining low-power CPUs with modest amounts of flash into cheap, durable and self-healing nodes, and using large numbers of them, we showed that it was possible to reduce running costs enough to more than justify the extra cost of storing data in flash as opposed to hard disk.

A point that we did not stress enough is that DAWN provides much higher bandwidth between the storage medium and the compute resource than conventional hard-disk storage architectures such as Amazon's, in which a powerful, many-core CPU is separated from a large number of disk drives by bridge chips, I/O controllers and in many cases networks. Although the CPU in a DAWN node is wimpy, it is directly connected to the flash chips holding the data. As flash is replaced by future solid-state technologies, with RAM-like interfaces, this advantage will increase. Data-mining accesses are typically just the sort of query that can be parallelized effectively in a DAWN network. Thus the DAWN approach would eliminate the need for a separate access copy in cloud storage. It could support the data-mining directly, eliminating a significant cost.

One aspect of this model that will bear close attention is the security aspects of allowing researcher's code to access the collection directly. The cloud model provides read-only access to applications not running as the owner of the collection, and in any case these applications are accessing a disposable access-only copy. But in the DAWN case it would be accessing one replica of the permanent collection. The necessary separation would have to be provided by sandbox software in each node.

1 comment:

Karen Stepanyan said...

You say: "This leaves the 400-odd researchers who have already expressed a need for access to the collection stymied."

Might you know how could people register their interest?