Friday, March 10, 2017

Dr. Pangloss and Data in DNA

Last night I gave a 10-minute talk at the Storage Valley Supper Club, an event much beloved of the good Dr. Pangloss. The title was DNA as a Storage Medium; it was a slightly edited section of The Medium-Term Prospects for Long-Term Storage Systems. Below the fold, an edited text with links to the sources.

I'm David Rosenthal, your resident skeptic. You don't need to take notes, the whole text of this talk with links to the sources will be up on my blog tomorrow morning.

Seagate 2008 roadmap
My mantra about storage technologies is "it takes longer than it takes". Disk, a 60-year-old technology, shows how hard it is to predict lead times. Here is a Seagate roadmap slide from 2008 predicting that the then (and still) current technology, perpendicular magnetic recording (PMR), would be replaced in 2009 by heat-assisted magnetic recording (HAMR), which would in turn be replaced in 2013 by bit-patterned media (BPM).

In 2016, the trade press reported that:
Seagate plans to begin shipping HAMR HDDs next year.
ASTC 2016 roadmap
Here is last year's roadmap from ASTC showing HAMR starting in 2017 and BPM in 2021. So in 8 years HAMR went from next year to next year, and BPM went from 5 years out to 5 years out. The reason for this real-time schedule slip is that as technologies get closer and closer to the physical limits, the difficulty and above all cost of getting from lab demonstration to shipping in volume increases exponentially.

Today I'm aiming my skepticism at the idea of storing data in DNA. The basic idea is obvious; since there are four bases ACGT in theory you can store 2 bits per base along the DNA molecule. On the one hand in practice you can't get 2 bits, but on the other hand it has recently been shown that you can add artificial bases X and Y. There are strong arguments that DNA would be a great off-line archival medium:
  • It is very dense.
  • It is very stable in shirt-sleeve environments over the long term.
  • It is very easy to make lots of copies, which keep stuff safe.
  • The technologies needed to use it have other major applications.
The first demonstration of storing data in DNA was in 1988, but the first to store more than a kilobyte were reported in a 1-page paper in Science from Harvard, and a longer paper in Nature from the European Molecular Biology Lab. The EMBL paper was submitted first, on 15 May 2012 but the Harvard paper was published first, on 16 August 2012; because it had much less content it could be reviewed more quickly. The world thinks Harvard did DNA storage first but in reality it was close and EMBL won by a nose.

The Harvard team wrote and read about 640KB of data divided into 96-bit blocks with a 19-bit address. The EMBL team used a more sophisticated encoding scheme designed to avoid sequences prone to error in synthesis, and including parity-check error detection in each block. They wrote about 740KB, converted the DNA to a form suitable for long-term storage, shipped it with no special precautions, read a sample of the DNA and stored the rest.

So far, so good. Both teams demonstrated that it is possible to write data to, and read it back from, DNA. But neither team could resist hyping their work. Here is Harvard (my emphasis):
costs and times for writing and reading are currently impractical for all but century-scale archives. However, the costs of DNA synthesis and sequencing have been dropping at exponential rates of 5- and 12-fold per year, respectively — much faster than electronic media at 1.6-fold per year.
But the EMBL team easily won the hype competition:
our DNA-based storage scheme could be scaled far beyond current global information volumes and offers a realistic technology for large-scale, long-term and infrequently accessed digital archiving. In fact, current trends in technological advances are reducing DNA synthesis costs at a pace that should make our scheme cost-effective for sub-50-year archiving within a decade.
And the lay press was off and running, with headlines like:
The 'genetic hard drive' that could store the complete works of Shakespeare (and revolutionise the way computers work)
DNA 'perfect for digital storage'
But the serious hype came in EMBL's cost estimates:
"In 10 years, it's probably going to be about 100 times cheaper," said Dr. Goldman. "At that time, it probably becomes economically viable."
As I wrote at the time:
The EMBL team computes the cost of long-term storage in DNA using market prices for synthesis. They compare this with a model they constructed of the cost of long-term storage on tape. But they didn't need their own model of tape cost; long-term storage is available in the market from Amazon's Glacier, which is rumored to be three independent copies on LT06 tape ... Fewer than three copies would not be reliable enough for long-term storage.

Glacier storage currently costs 1c/GB/mo. Generously, I assume writing costs of $7.5K/MB for DNA ... If, instead of spending $7.5M to write 1GB of data to DNA the money is invested at 0.1%/yr real interest it would generate an income of $7.5K/yr for ever. Even assuming that Amazon never drops Glacier's price to match technological improvement, this would be enough to store three copies of 62.5TB for ever. The same money is storing (three copies of) 62,500 times as much data.
"about 100 times cheaper" doesn't even come close.

The hype died down for a while, but it has started up again. Nature recently featured a news article by Andy Extance entitled How DNA could store all the world's data, which claimed:
If information could be packaged as densely as it is in the genes of the bacterium Escherichia coli, the world's storage needs could be met by about a kilogram of DNA.
The article is based on research at Microsoft that involved storing 151KB in DNA. But this paper concludes (my emphasis):
DNA-based storage has the potential to be the ultimate archival storage solution: it is extremely dense and durable. While this is not practical yet due to the current state of DNA synthesis and sequencing, both technologies are improving at an exponential rate with advances in the biotechnology industry[4].
The Microsoft team don't claim that the solution is at hand any time soon. Reference 4 is a two year old post to Rob Carlson's blog. A more recent post to the same blog puts the claim that:
both technologies are improving at an exponential rate
in a somewhat less optimistic light. It may be true that DNA sequencing is getting cheaper very rapidly. But already the cost of sequencing (read) was insignificant in the total cost of DNA storage. What matters is the synthesis (write) cost. Extance writes:
A closely related factor is the cost of synthesizing DNA. It accounted for 98% of the expense of the $12,660 EBI experiment. Sequencing accounted for only 2%, thanks to a two-millionfold cost reduction since the completion of the Human Genome Project in 2003.
The rapid decrease in the read cost is irrelevant to the economics of DNA storage; if it were free it would make no difference. Carlson's graph shows that the write cost, the short DNA synthesis cost (red line) is falling more slowly than the gene synthesis cost (yellow line). He notes:
But the price of genes is now falling by 15% every 3-4 years (or only about 5% annually).
A little reference checking reveals that the Microsoft paper's claim that:
both technologies are improving at an exponential rate
while strictly true is deeply misleading. The relevant technology is currently getting cheaper slower than hard disk or flash memory! And since this has been true for around two decades, making the necessary 3-4 fold improvement just to keep up with the competition is going to be hard.

Last week Science published DNA Fountain enables a robust and efficient storage architecture by Yaniv Erlich and Dina Zielinski from Columbia. They describe an improved method for encoding data in DNA that, at 1.83 bits/nucleotide, gets much closer to the Shannon limit of 2 than previous attempts. Their experiment stored 2.2MB at about $3500/MB write cost.

Decades from now, DNA will probably be an important archival medium. But the level of hype around the cost of DNA storage is excessive. Extance's article admits that cost is a big problem, yet it finishes by quoting Goldman, lead author of the 2013 paper in Nature whose cost projections were massively over-optimistic. Goldman's quote is possibly true but again deeply misleading:
"Our estimate is that we need 100,000-fold improvements to make the technology sing, and we think that's very credible," he says. "While past performance is no guarantee, there are new reading technologies coming onstream every year or two. Six orders of magnitude is no big deal in genomics. You just wait a bit."
Yet again the DNA enthusiasts are waving the irrelevant absolute cost decrease in reading to divert attention from the relevant lack of relative cost decrease in writing. They need an improvement in relative write cost of at least 6 orders of magnitude. To do that in a decade means halving the relative cost every year, not increasing the relative cost by 10-15% every year.

Despite my skepticism about time-scales, I believe that in the long term DNA has great potential as an archival storage medium. Just as I believe that what is interesting about Facebook's work on optical storage is the system aspects not the medium, I believe that what is interesting about the Microsoft team's work is the system aspects. For example, they discuss how data might be encoded in DNA to permit random access.

Although this is useful research, the fact remains that DNA data storage requires a reduction in relative synthesis cost of at least 6 orders of magnitude over the next decade to be competitive with conventional media, and that currently the relative write cost is increasing, not decreasing.


David. said...

DNA isn't just capable of storing data, but also of processing data. In Large-scale design of robust genetic circuits with multiple inputs and outputs for mammalian cells Weinberg et al present:

"a robust, general, scalable system, called 'Boolean logic and arithmetic through DNA excision' (BLADE), to engineer genetic circuits with multiple inputs and outputs in mammalian cells with minimal optimization."

David. said...

The Microsoft/UW team continue their work with a second 10M-strand deal with Twist Bioscience.