Thursday, May 16, 2019

Review Of Data Storage In DNA

Luis Ceze, Jeff Nivala and Karin Strauss of the University of Washington and Microsoft Research team have published a fascinating review of the history and state-of-the-art in Molecular digital data storage using DNA. The abstract reads:
Molecular data storage is an attractive alternative for dense and durable information storage, which is sorely needed to deal with the growing gap between information production and the ability to store data. DNA is a clear example of effective archival data storage in molecular form. In this Review , we provide an overview of the process, the state of the art in this area and challenges for mainstream adoption. We also survey the field of in vivo molecular memory systems that record and store information within the DNA of living cells, which, together with in vitro DNA data storage, lie at the growing intersection of computer systems and biotechnology.
They include a comprehensive bibliography. Below the fold, some commentary and a few quibbles.

At this stage of the technology's development, having an authoritative review of the field is very useful, especially to push back against the hype that DNA storage always seems to attract. The UW/MSFT team's credentials for writing such a review are unmatched.

Some may have assumed that I was exaggerating the difficulty of getting a DNA storage product into the market when I wrote:
Engineers, your challenge is to increase the speed of synthesis by a factor of a quarter of a trillion, while reducing the cost by a factor of fifty trillion, in less than 10 years while spending no more than $24M/yr.
But one of the things the UW/MSFT team has always been impressively realistic about is the scale of the technological problem they face. They roughly agree with me when they write:
The current overall writing throughput of DNA data storage is likely to be in the order of kilobytes per second. We estimate that a system competitive with mainstream cloud archival storage systems in 10 years will need to offer writing and reading throughput of gigabytes per second. This is a 6 orders-​of-magnitude gap for synthesis and approximately 2–3 orders of magnitude for sequencing. On the cost gap, tape storage cost about US$16 per tera­byte in 2016 and is going down approximately 10% per year. DNA synthesis costs are generally confidential, but leading industry analyst Robert Carlson estimates the array synthesis cost to be approximately US$0.0001 per base, which amounts to US$800 million per terabyte or 7–8 orders magnitude higher than tape.
Sadly, just getting the write cost competitive with tape isn't enough to displace tape from the market. DNA storage would need to be significantly cheaper.

A review in an academic journal is not the place for the kind of marketing analysis I undertook in DNA's Niche in the Storage Market, so the following three quibbles are just that, quibbles. First, Ceze et al omit the most important cost factor when they write:
Density, durability and energy cost at rest are primary factors for archival storage, which aims to store vast amounts of data for long-​term, future use.
One of the fundamental economic problems that I didn't discuss in Archival Media: Not A Good Business is the barrier caused by the epidemic of short-termism in society. As our work on economic models of long-term storage showed, long-lived media with high capital and write costs but low running costs are at a huge disadvantage compared to short-lived media with low capital but higher running costs (including costs of regular migration to successor media). This barrier is a big part of the reason tape is such a small part of the overall storage market. The authors should have included "system capital cost" in the quote above.

Second, the cost of the robotic and fluidic write and read hardware for DNA storage is likely to be quite high. As with Facebook's Blu-Ray cold storage, this cost needs to be amortized across a large amount of stored data. Thus the economics of DNA storage are likely best suited to data-center scale systems, making the marketing problems even more difficult because there are only a few potential customers. They are all much bigger than any storage device vendor, and thus able to squeeze the device vendor's margins, as they have been doing in the markets for hard disk and flash.

Third, Ceze et al write:
using DNA for data storage offers density of up to 1018 bytes per mm3, approximately six orders of magnitude denser than the densest media available today
In DNA's Niche in the Storage Market I compared the density of the DNA medium with the density of the hard disk medium:
State-of-the-art hard disks store 1.75TB of user data per platter in a 20nm-thick magnetic layer on each side. The platter is 95mm in diameter with a 25mm diameter hole in the middle, so the volume of the layer that actually contains the data is π*(952-252)*40*10-6 ≅ 1mm3. This volume contains 1.4*1013 usable bits, so each bit occupies about 7*10-14mm3.
The real comparison is thus 1.25*10-19mm3/bit vs. 7*10-14mm3/bit, or a factor of about 5.6*105. Six orders of magnitude is plausible, but misleading in two ways. First, it compares the "up to" theoretical density of the DNA medium with the actual media density of 2018 hard disks in volume production. Second, as I described in the same post, the density of storage devices is far lower than the density of the raw medium. For hard disk:
The overhead of the packaging surrounding the raw bits is about half a million times bigger than the bits themselves. If this overhead could be eliminated, we could store 7 exabytes in a 3.5" disk form factor.

DNA storage systems will impose a similar overhead.
Exactly how big the packaging overhead will be depends on a range of system design issues yet to be addressed, but eventual DNA storage systems are unlikely to be a million times denser than their competition.


Thomas Lindgren said...

Skepticism from a genetics/biochemistry guy:

David. said...

Prof. Sang Yup Lee's DNA Data Storage Is Closer Than You Think in Scientific American demonstrates that, while he may hold more than 680 patents in chemical and biomolecular engineering, he doesn't understand the data storage market, and hasn't read the caveats in the papers on storing data in DNA such as Ceze et al's review.

David. said...

The UW/Microsoft team continue to make progress in the technology for DNA data storage:

"A major challenge in making DNA data storage a reality is that reading DNA back into data using sequencing by synthesis remains a laborious, slow and expensive process. Here, we demonstrate successful decoding of 1.67 megabytes of information stored in short fragments of synthetic DNA using a portable nanopore sequencing platform. We design and validate an assembly strategy for DNA storage that drastically increases the throughput of nanopore sequencing."