Friday, March 10, 2017

Dr. Pangloss and Data in DNA

Last night I gave a 10-minute talk at the Storage Valley Supper Club, an event much beloved of the good Dr. Pangloss. The title was DNA as a Storage Medium; it was a slightly edited section of The Medium-Term Prospects for Long-Term Storage Systems. Below the fold, an edited text with links to the sources.

I'm David Rosenthal, your resident skeptic. You don't need to take notes, the whole text of this talk with links to the sources will be up on my blog tomorrow morning.

Seagate 2008 roadmap
My mantra about storage technologies is "it takes longer than it takes". Disk, a 60-year-old technology, shows how hard it is to predict lead times. Here is a Seagate roadmap slide from 2008 predicting that the then (and still) current technology, perpendicular magnetic recording (PMR), would be replaced in 2009 by heat-assisted magnetic recording (HAMR), which would in turn be replaced in 2013 by bit-patterned media (BPM).

In 2016, the trade press reported that:
Seagate plans to begin shipping HAMR HDDs next year.
ASTC 2016 roadmap
Here is last year's roadmap from ASTC showing HAMR starting in 2017 and BPM in 2021. So in 8 years HAMR went from next year to next year, and BPM went from 5 years out to 5 years out. The reason for this real-time schedule slip is that as technologies get closer and closer to the physical limits, the difficulty and above all cost of getting from lab demonstration to shipping in volume increases exponentially.

Today I'm aiming my skepticism at the idea of storing data in DNA. The basic idea is obvious; since there are four bases ACGT in theory you can store 2 bits per base along the DNA molecule. On the one hand in practice you can't get 2 bits, but on the other hand it has recently been shown that you can add artificial bases X and Y. There are strong arguments that DNA would be a great off-line archival medium:
  • It is very dense.
  • It is very stable in shirt-sleeve environments over the long term.
  • It is very easy to make lots of copies, which keep stuff safe.
  • The technologies needed to use it have other major applications.
The first demonstration of storing data in DNA was in 1988, but the first to store more than a kilobyte were reported in a 1-page paper in Science from Harvard, and a longer paper in Nature from the European Molecular Biology Lab. The EMBL paper was submitted first, on 15 May 2012 but the Harvard paper was published first, on 16 August 2012; because it had much less content it could be reviewed more quickly. The world thinks Harvard did DNA storage first but in reality it was close and EMBL won by a nose.

The Harvard team wrote and read about 640KB of data divided into 96-bit blocks with a 19-bit address. The EMBL team used a more sophisticated encoding scheme designed to avoid sequences prone to error in synthesis, and including parity-check error detection in each block. They wrote about 740KB, converted the DNA to a form suitable for long-term storage, shipped it with no special precautions, read a sample of the DNA and stored the rest.

So far, so good. Both teams demonstrated that it is possible to write data to, and read it back from, DNA. But neither team could resist hyping their work. Here is Harvard (my emphasis):
costs and times for writing and reading are currently impractical for all but century-scale archives. However, the costs of DNA synthesis and sequencing have been dropping at exponential rates of 5- and 12-fold per year, respectively — much faster than electronic media at 1.6-fold per year.
But the EMBL team easily won the hype competition:
our DNA-based storage scheme could be scaled far beyond current global information volumes and offers a realistic technology for large-scale, long-term and infrequently accessed digital archiving. In fact, current trends in technological advances are reducing DNA synthesis costs at a pace that should make our scheme cost-effective for sub-50-year archiving within a decade.
And the lay press was off and running, with headlines like:
The 'genetic hard drive' that could store the complete works of Shakespeare (and revolutionise the way computers work)
DNA 'perfect for digital storage'
But the serious hype came in EMBL's cost estimates:
"In 10 years, it's probably going to be about 100 times cheaper," said Dr. Goldman. "At that time, it probably becomes economically viable."
As I wrote at the time:
The EMBL team computes the cost of long-term storage in DNA using market prices for synthesis. They compare this with a model they constructed of the cost of long-term storage on tape. But they didn't need their own model of tape cost; long-term storage is available in the market from Amazon's Glacier, which is rumored to be three independent copies on LT06 tape ... Fewer than three copies would not be reliable enough for long-term storage.

Glacier storage currently costs 1c/GB/mo. Generously, I assume writing costs of $7.5K/MB for DNA ... If, instead of spending $7.5M to write 1GB of data to DNA the money is invested at 0.1%/yr real interest it would generate an income of $7.5K/yr for ever. Even assuming that Amazon never drops Glacier's price to match technological improvement, this would be enough to store three copies of 62.5TB for ever. The same money is storing (three copies of) 62,500 times as much data.
"about 100 times cheaper" doesn't even come close.

The hype died down for a while, but it has started up again. Nature recently featured a news article by Andy Extance entitled How DNA could store all the world's data, which claimed:
If information could be packaged as densely as it is in the genes of the bacterium Escherichia coli, the world's storage needs could be met by about a kilogram of DNA.
The article is based on research at Microsoft that involved storing 151KB in DNA. But this paper concludes (my emphasis):
DNA-based storage has the potential to be the ultimate archival storage solution: it is extremely dense and durable. While this is not practical yet due to the current state of DNA synthesis and sequencing, both technologies are improving at an exponential rate with advances in the biotechnology industry[4].
The Microsoft team don't claim that the solution is at hand any time soon. Reference 4 is a two year old post to Rob Carlson's blog. A more recent post to the same blog puts the claim that:
both technologies are improving at an exponential rate
in a somewhat less optimistic light. It may be true that DNA sequencing is getting cheaper very rapidly. But already the cost of sequencing (read) was insignificant in the total cost of DNA storage. What matters is the synthesis (write) cost. Extance writes:
A closely related factor is the cost of synthesizing DNA. It accounted for 98% of the expense of the $12,660 EBI experiment. Sequencing accounted for only 2%, thanks to a two-millionfold cost reduction since the completion of the Human Genome Project in 2003.
The rapid decrease in the read cost is irrelevant to the economics of DNA storage; if it were free it would make no difference. Carlson's graph shows that the write cost, the short DNA synthesis cost (red line) is falling more slowly than the gene synthesis cost (yellow line). He notes:
But the price of genes is now falling by 15% every 3-4 years (or only about 5% annually).
A little reference checking reveals that the Microsoft paper's claim that:
both technologies are improving at an exponential rate
while strictly true is deeply misleading. The relevant technology is currently getting cheaper slower than hard disk or flash memory! And since this has been true for around two decades, making the necessary 3-4 fold improvement just to keep up with the competition is going to be hard.

Last week Science published DNA Fountain enables a robust and efficient storage architecture by Yaniv Erlich and Dina Zielinski from Columbia. They describe an improved method for encoding data in DNA that, at 1.83 bits/nucleotide, gets much closer to the Shannon limit of 2 than previous attempts. Their experiment stored 2.2MB at about $3500/MB write cost.

Decades from now, DNA will probably be an important archival medium. But the level of hype around the cost of DNA storage is excessive. Extance's article admits that cost is a big problem, yet it finishes by quoting Goldman, lead author of the 2013 paper in Nature whose cost projections were massively over-optimistic. Goldman's quote is possibly true but again deeply misleading:
"Our estimate is that we need 100,000-fold improvements to make the technology sing, and we think that's very credible," he says. "While past performance is no guarantee, there are new reading technologies coming onstream every year or two. Six orders of magnitude is no big deal in genomics. You just wait a bit."
Yet again the DNA enthusiasts are waving the irrelevant absolute cost decrease in reading to divert attention from the relevant lack of relative cost decrease in writing. They need an improvement in relative write cost of at least 6 orders of magnitude. To do that in a decade means halving the relative cost every year, not increasing the relative cost by 10-15% every year.

Despite my skepticism about time-scales, I believe that in the long term DNA has great potential as an archival storage medium. Just as I believe that what is interesting about Facebook's work on optical storage is the system aspects not the medium, I believe that what is interesting about the Microsoft team's work is the system aspects. For example, they discuss how data might be encoded in DNA to permit random access.

Although this is useful research, the fact remains that DNA data storage requires a reduction in relative synthesis cost of at least 6 orders of magnitude over the next decade to be competitive with conventional media, and that currently the relative write cost is increasing, not decreasing.


David. said...

DNA isn't just capable of storing data, but also of processing data. In Large-scale design of robust genetic circuits with multiple inputs and outputs for mammalian cells Weinberg et al present:

"a robust, general, scalable system, called 'Boolean logic and arithmetic through DNA excision' (BLADE), to engineer genetic circuits with multiple inputs and outputs in mammalian cells with minimal optimization."

David. said...

The Microsoft/UW team continue their work with a second 10M-strand deal with Twist Bioscience.

David. said...

Digital-to-biological converter for on-demand production of biologics by Kent S. Boles et al describes a combination of off-the-shelf hardware and software that can synthesize DNA sequences, among other things. The abstract is:

"Manufacturing processes for biological molecules in the research laboratory have failed to keep pace with the rapid advances in automization and parellelization. We report the development of a digital-to-biological converter for fully automated, versatile and demand-based production of functional biologics starting from DNA sequence information. Specifically, DNA templates, RNA molecules, proteins and viral particles were produced in an automated fashion from digitally transmitted DNA sequences without human intervention."

Eliminating human intervention will reduce costs, but not by the 6+ orders of magnitude needed to make DNA a viable storage medium.

David. said...

The University of Washington team write:

"The rapid improvement in DNA sequencing has sparked a big data revolution in genomic sciences, which has in turn led to a proliferation of bioinformatics tools. To date, these tools have encountered little adversarial pressure. This paper evaluates the robustness of such tools if (or when) adversarial attacks manifest. We demonstrate, for the first time, the synthesis of DNA which — when sequenced and processed — gives an attacker arbitrary remote code execution."

David. said...

Chris Mellor at The Register reports on data storage in synthetic monomers:

"The basic storage element is a synthetic monomer, a molecule, which groups with others to form a chain called a polymer. In DNA the monomers are nucleotides and these are natural. The synthesised monomers here were designed to be sequenced (analysed) using moderate resolution mass spectroscopy (MS).

Data is "written" using an automated monomer-polymer production method, assembled, or rather synthesized by automated phosphoramidite chemistry, into readable digital sequences - poly(phosphodiester)s chains."

David. said...

The paper describing data encoded in synthetic monomers is here. The CNRS press release states:

"The team of chemists managed to synthesize polymers that can store up to eight bytes. Thus, they were able to record the word 'Sequence' in ASCII code, which assigns a unique byte to each letter and punctuation mark. By successfully decoding this word using mass spectrometry, they set a new record for the length of a molecule that may be read using this technique. Although manual analysis of the digital data does take a few hours, it should be possible to reduce the time needed to a few milliseconds by developing software to perform this task."

David. said...

Whatever you think about DNA as a storage medium, you have to be impressed with its PR:

"A three-year old Bitcoin mystery came to an end last week after Sander Wuyts, a 26-year old Belgian PhD student at the University of Antwerp, cracked a code that revealed the key to one bitcoin inside a strand of synthetic DNA.

The key—chemicals arranged to represent a string of text—was placed in the DNA as part of the DNA Storage Bitcoin Challenge. The challenge began in 2015 after Nick Goldman, a researcher at the European Bioinformatics Institute, gave a presentation on using DNA to store information at the World Economic Forum’s annual meeting in Davos, Switzerland. During this presentation, Goldman distributed tubes of DNA in which he had encoded the key to a digital wallet containing one bitcoin. "

But note:

"You have to take into account that the process of reading out the data takes a couple of days instead of milliseconds compared to data saved on a hard drive,"