Thursday, September 15, 2016

Nature's DNA storage clickbait

Andy Extance at Nature has a news article that illustrates rather nicely the downside of Marcia McNutt's (editor-in-chief of Science) claim that one reason to pay the subscription to top journals is that:
Our news reporters are constantly searching the globe for issues and events of interest to the research and nonscience communities.
Follow me below the fold for an analysis of why no-one should be paying Nature to publish this kind of stuff.

Extance's article is entitled How DNA could store all the world's data, and starts with this scary thought:
The latest experiment signals that interest in using DNA as a storage medium is surging far beyond genomics: the whole world is facing a data crunch. Counting everything from astronomical images and journal articles to YouTube videos, the global digital archive will hit an estimated 44 trillion gigabytes (GB) by 2020, a tenfold increase over 2013. By 2040, if everything were stored for instant access in, say, the flash memory chips used in memory sticks, the archive would consume 10–100 times the expected supply of microchip-grade silicon3.
He then claims a solution to this problem is at hand:
If information could be packaged as densely as it is in the genes of the bacterium Escherichia coli, the world's storage needs could be met by about a kilogram of DNA.
The article is based on research at Microsoft that involved storing 151KB in DNA. The research is technically interesting, starting to look at fundamental DNA storage system design issues. But it concludes (my emphasis):
DNA-based storage has the potential to be the ultimate archival storage solution: it is extremely dense and durable. While this is not practical yet due to the current state of DNA synthesis and sequencing, both technologies are improving at an exponential rate with advances in the biotechnology industry[4].
The paper doesn't claim that the solution is at hand any time soon. Reference 4 is a two year old post to Rob Carlson's blog. A more recent post to the same blog puts the claim that:
both technologies are improving at an exponential rate
in a somewhat less optimistic light. It is (or may be, Carlson believes the last two data points are not representative) true that DNA sequencing is getting cheaper very rapidly. But already the cost of sequencing (read) was insignificant in the total cost of DNA storage. What matters is the synthesis (write) cost. Lower down the article Extance writes:
A closely related factor is the cost of synthesizing DNA. It accounted for 98% of the expense of the $12,660 EBI experiment. Sequencing accounted for only 2%, thanks to a two-millionfold cost reduction since the completion of the Human Genome Project in 2003.
The rapid decrease in the read cost is irrelevant to the economics of DNA storage; if it was free it would make no difference. Carlson's graph shows that the write cost, the short DNA synthesis cost (red line) is falling more slowly than the gene synthesis cost (yellow line). He notes:
But the price of genes is now falling by 15% every 3-4 years (or only about 5% annually).
A little reference checking, that should have been well within the capability of one of Nature's expert news reporters, reveals that the Microsoft paper's claim that:
both technologies are improving at an exponential rate
while strictly true is deeply misleading. The relevant technology is currently getting cheaper slower than hard disk or flash memory! And since this has been true for around two decades, making the necessary 3-4 fold improvement just to keep up with the competition is going to be hard.

I actually believe that, decades from now, DNA will be an important archival medium. But I've been criticizing the level of hype around the cost of DNA storage for years. Extance's article admits that cost is a big problem, yet it finishes by quoting Goldman, lead author of a 2013 paper in Nature whose massively over-optimistic cost projections I debunked here. Goldman's quote is possibly true but again definitely deeply misleading:
"Our estimate is that we need 100,000-fold improvements to make the technology sing, and we think that's very credible," he says. "While past performance is no guarantee, there are new reading technologies coming onstream every year or two. Six orders of magnitude is no big deal in genomics. You just wait a bit."
Yet again the DNA enthusiasts are waving the irrelevant absolute cost decrease in reading to divert attention from the relevant lack of relative cost decrease in writing. They need an improvement in relative write cost of at least 6 orders of magnitude. To do that in a decade means halving the relative cost every year, not increasing the relative cost by 10-15% every year.

Extance's article doesn't simply regurgitate the hype in the paper he's reporting on by failing to scrutinize its claims, he amplifies it by headlining claims the paper is careful not to make, and giving it prominence in Nature's news section. This kind of clickbaiting is a classic example of problem #6 in The 7 biggest problems facing science, according to 270 scientists by Julia Belluz, Brad Plumer, and Brian Resnick. I blogged about their article here:
Science journalism is often full of exaggerated, conflicting, or outright misleading claims. If you ever want to see a perfect example of this, check out "Kill or Cure," a site where Paul Battley meticulously documents all the times the Daily Mail reported that various items — from antacids to yogurt — either cause cancer, prevent cancer, or sometimes do both.
My problem with the oligopoly of academic publishers isn't that they are incredibly expensive, but that they are incredibly poor value for money, as shown by the fact that it took me about an hour to show how misleading Extance's article is.


David. said...

There were some interesting discussions of DNA storage at the Library of Congress Storage Architecture workshop and there was general agreement on two points:

- DNA synthesis needs to get something like 6-9 orders of magnitude cheaper for DNA to be a cost-effective archival medium.

- Although there is scope for improvement, especially as synthesis for storage needs lower accuracy than for current markets, it isn't likely that current synthesis techniques can ever be improved that much. Radically different, more "biological" techniques would be needed.

David. said...

Last week Science published DNA Fountain enables a robust and efficient storage architecture by Yaniv Erlich and Dina Zielinski from Columbia. They describe an improved method for encoding data in DNA that, at 1.83 bits/nucleotide, gets much closer to the Shannon limit of 2 than previous attempts. Their experiment stored 2.2MB at about $3500/MB write cost.

The authors admit that write cost is a problem, but:

"we envision that the cost issue of DNA storage could be addressed by two complementary approaches, the first of which is continuous improvements to the DNA synthesis chemistry, which have been estimated to exponentially reduce the cost by one to two orders of magnitude per decade (4)."

Reference 4 is the 2013 EMBL paper, whose cost claims I debunked at the time. Rob Carlson's blog post shows that 1-2 orders of magnitude/decade is a vast over-estimate, even ignoring the fact that magnetic media is getting cheaper faster.