Thursday, January 31, 2013

DNA as a storage medium

I blogged last October about a paper from Harvard in Science describing using DNA as a digital storage medium. In a fascinating keynote at IDCC2013 Ewan Birney of EMBL discussed a paper in Nature with a much more comprehensive look at this technology. It has been getting a lot of press, much of it as usual somewhat misleading. Below the fold I delve into the details.

The Harvard team wrote and read about 640KB of data divided into 96-bit blocks with a 19-bit address. The EMBL team used a more sophisticated encoding scheme designed to avoid sequences prone to error in synthesis, and including parity-check error detection in each block. They wrote about 740KB, converted it to a form suitable for long-term storage, shipped it with no special precautions, read a sample of the DNA and stored the rest.

I believe that this work is pointing in an interesting direction for long-term data storage. DNA is extremely stable in shirt-sleeve environments, and the fact that it is very cheap to replicate allows the Lots Of Copies Keep Stuff Safe principle far more play than any competing medium.

But even more interesting from my point of view is that the EMBL team provide current costs for their technique, and attempt to compare its costs with current tape-based dark archival storage. They quote costs of $12.4K/MB for writing and $0.22K/MB for reading. These costs have in recent years decreased much faster than magnetic media costs, so their graphs showing 10- and 100-fold relative improvements are not unreasonable. However, this quote is way too optimistic:

"In 10 years, it's probably going to be about 100 times cheaper," said Dr. Goldman. "At that time, it probably becomes economically viable."
The EMBL team compute the cost of long-term storage in DNA using market prices for synthesis. They compare this with a model they constructed of the cost of long-term storage on tape. But they didn't need their own model of tape cost; long-term storage is available in the market from Amazon's Glacier, which is rumored to be three independent copies on LT06 tape (I am not convinced). Fewer than three copies would not be reliable enough for long-term storage.

Glacier storage currently costs 1c/GB/mo. Generously, I assume writing costs of $7.5K/MB for DNA (See the Supplementary Material for the source of this number). If, instead of spending $7.5M to write 1GB of data to DNA the money is invested at 0.1%/yr real interest it would generate an income of $7.5K/yr for ever. Even assuming that Amazon never drops Glacier's price to match technological improvement, this would be enough to store three copies of 62.5TB for ever. The same money is storing (three copies of) 62,500 times as much data.

Lets assume more realistic interest rates, say 1%, so that the $7.5M invested would generate $75K/yr and pay for 625TB at Glacier. I also assume tape costs dropping at the industry's 20%/yr projection for disk, and a 100-fold improvement in DNA synthesis in 10 years. In 10 years it will cost $75K to write the data to DNA. That $75K would generate $750/yr. Assuming, although they probably won't, that Amazon drops the Glacier price to match tape costs, in 10 years they would be charging 0.1c/GB/mo. The $750/yr would pay to store (three copies of) 62.5TB. The 100-fold improvement in DNA synthesis less the 10-fold improvement in tape leads to a 10-fold gain in competitiveness of DNA in 10 years. But it still needs at least a further 6,250-fold improvement.

I've pointed out that Kryder's Law for disks has slowed; its prospects more than a decade out are dim. Nevertheless, even if tape stopped improving in 2023 and DNA synthesis continued to improve 10-fold each year it wouldn't catch up until around 2648. So, yes, in the long run DNA will probably become cheaper than tape. But in 10 years it will still be far too expensive for practical data storage. It looks like DNA needs around a 1M-fold improvement in synthesis cost in 10 years before it would be competitive with contemporary tape technology.

Why is EMBL's model of tape costs misleading them by around 4 orders of magnitude? Some reasons are:

  • They model tape costs as a series of migration events whose cost has a fixed part and a part that decreases according to a Kryder's Law parameter. Thus the long-term cost is dominated by the fixed cost they assign to a tape migration. They don't use discounted cash flow, which means that they systematically over-estimate the impact of these future fixed costs as compared to the up-front cost of writing the DNA.
  • Their model assumes that the data is written to a single tape each time. But no-one makes tapes that small, so as time goes by they waste a larger and larger proportion of their media. In practice, data in long--term storage is aggregated to use the whole capacity of each generation of tape, so the per-byte cost of tape migration drops with the increase in capacity of the media, because it is amortized across an increasing amount of data at each migration. Aggregation thus means that there is no fixed component of tape cost, removing the dominant part of their cost model.
  • Because their model doesn't use discounted cash flow it ignores not merely the time value of money but also the effect of organization's limited planning horizons. The comparison for a 1GB archive is between spending $7.5M the first year on DNA (and then nothing) and spending $0.12 the first year (and then $0.12 the second year, and ...). Even with a 1M-fold improvement, the comparison is between spending $7.50 the first year (then nothing) and spending $0.12 each year. Even ignoring discounted cash flow it would be over 60 years before DNA would be the cheaper option. Few organizations have 60-year horizons for return on their investments
It might be argued that Glacier is priced as a loss leader. LT06 tapes hold 6.25TB. Thus if Glacier is LT06, 3 tapes at Amazon generate income of $750/yr but cost perhaps $300. Making our usual assumption, based on SDSC numbers, that media is 1/3 the 3-year cost of ownership, running costs are $200/yr. As I recall the SDSC numbers include routine tape migration, but let us assume they don't. If the tapes were kept for 5 years the income would be $3750, costs would be $1300, leaving $2450, or almost $500/yr for profit and migration costs.

In the next decade DNA would need a miracle to be competitive with tape. In that time developments in solid state storage, such as today's paper in Nature from a Cambridge team's demonstrating 3D spintronic storage, are likely to be the best prospect for economic archival storage, especially given the likely changes in the patterns of access to archival data. Beyond that it is likely that DNA's inherent advantages of stability and replicability will eventually dominate. But any storage medium with high up-front costs and low running costs, such as DNA or our DAWN proposal, comes with a high economic barrier to adoption.


Turning to this blog's other theme, academic communication, note that the Harvard paper was received for publication 20 June 2012, accepted for publication 7 August 2012, and published on-line 16 August 1012. The EMBL paper was received for publication 15 May 2012, accepted 12 December 2012, and published 23 January 2013. The EMBL authors submitted 5 weeks earlier but Nature's slower reviewing meant their paper appeared 23 weeks later. Of course, in fairness to Nature's reviewers, they had a lot more content to review.