Wednesday, May 16, 2018

Shorter talk at MSST2018

I was invited to give both a longer and a shorter talk at the 34th International Conference on Massive Storage Systems and Technology at Santa Clara University. Below the fold is the text with links to the sources of the shorter talk, which was updated from and entitled DNA's Niche in the Storage Market .

For the past 2 decades I've been working on keeping data safe for the long term at the Stanford Libraries. The library would love to use DNA storage but I doubt I'll live to see it happen.

DNA stores your Makefile, so the idea of using it to store other data is obvious. It is a linear polymer made of 4 "bases", ACGT, so in theory you get 2 bits/base. Genetics has developed markets for synthesizing (writing) and sequencing (reading) DNA. The idea is that you write data as strings of a few hundred bases, store the resulting liquid in the arrays of little pots used for automated biochemistry, and read the data back by sampling the appropriate pot then feeding the sample into a sequencer.

I've been writing enthusiastically about the long-term but skeptically about the medium-term prospects for data in DNA for the last 5 years. I'm about to summarize a long blog post, where you can find the details and links to the sources. The post tries to look at DNA storage as a product rather than as a technology. The history of storage is littered with over-hyped technologies that never made it in the market because they never became a product anyone wanted to buy. I originally planned to talk this afternoon about blockchain storage, another over-hyped technology, but that talk got too long. And this one is more fun.

So lets imagine I've just been hired as the product manager for the DNA division of StoreYourBits Inc. My first task is a presentation to the product team setting out the challenges we will face in taking DNA into the storage market.

This is my list of the attributes of DNA as a storage medium.
  • Quasi-immortal
  • Low Kryder rate, i.e. the rate at which $/byte decreases.
  • Dense
  • Write-once
  • Read Slow, Write Slower
  • Read Cheap, Write Expensive

The popular, and even the technology, press believes that the solution to the problem of keeping data safe for the long term is quasi-immortal media. There are many problems with this idea, as I discussed this morning. But for DNA storage the most important observation is that long media life matters only if the Kryder rate is very low. The Kryder rate is one area where DNA is different from other quasi-immortal media. We already have encoding techniques that get close to the theoretical limit. The Kryder rate of DNA as a storage medium should thus be close to zero, meaning that the incentive to replace DNA storage systems with newer, more effective ones should be much less than for other media.

Alas, that isn't as true as we would like. The explanation is about density.

State-of-the-art hard disks store a Terabyte and three-quarters of user data per platter in a 20nm-thick magnetic layer on each side. The volume of the layer that actually contains the data is about 1mm3. This volume contains 1.4E13 usable bits, so each bit occupies about 7E-14mm3.

But that's not the relevant number. The volume of a 3.5" drive is about 3.8E5mm3. This volume contains 1.12E14 usable bits, thus the size of a usable bit is 3.4E-9mm3. The overhead of the packaging surrounding the raw bits is about half a million times bigger than the bits themselves. If this overhead could be eliminated, we could store 7 Exabytes in a 3.5" disk form factor.

DNA storage systems will impose a similar overhead. How big will it be? The Microsoft/UW team claim an Exabyte in a cubic inch or about 2E-15mm3 per bit, roughly 35 times smaller than a bit in a hard drive. The density advantage for a DNA system comes mostly from less overhead in the packaging and read/write hardware, rather than intrinsically smaller bits.

Although the cells in 3D flash are at least as big, it has a huge density advantage over 2D flash. Because it is a 3D not a 2D medium, there is less packaging overhead per bit. Similarly, DNA is a 3D medium and hard disk is a 2D medium, so DNA has a huge density advantage simply from less packaging per bit.

My guess is that equivalent of the hard disk usable bit at 3.4E-9mm3 would be a DNA system usable bit at about 1E-14mm3, or a factor of about 3.4E5 better than hard disk.

But, although the encoding of bits into bases is already close to the theoretical optimum, this does not mean that the Kryder rate for DNA storage will be zero. The density of usable bits is determined entirely by the various kinds of overhead, which can be reduced by improving technology.

DNA is write-once, which is good for the archival market but not for anything further up the storage hierarchy.

The Microsoft/UW team estimate that the access latency for data in DNA will be hours. This again restricts DNA storage systems to the very cold archival layer of the storage hierarchy.

The best example of a system designed to operate at this very cold layer is Facebook's Blu-Ray based system, which stores 10,000 100TB Blu-Ray disks in a rack. Although most commentary on it focused on the 50-year life of the optical media, this was irrelevant to the design.

Facebook has a very detailed understanding of the behavior of data in its storage hierarchy. It knows that at the cold, archival layer reads are very rare. and thus unimportant in the design. The keys to the design are minimizing the cost per byte, and maximizing the write bandwidth, which they do by carefully scheduling writes in parallel to 12 Blu-Ray drives per rack.

Thus the fact that writing to DNA is slow is problematic in its assigned layer in the storage hierarchy.

Carlson's 2016 graph
The other problematic aspect of writes for DNA as a storage medium is that they are expensive. The graph is from Rob Carlson's blog, and it shows that DNA reads have been getting cheaper rapidly, but DNA writes have not even kept up with hard disk. His latest estimate is that writing currently costs 1E-4 dollars per base, which would be prohibitively expensive even ignoring the following problem.

The cost of writing is paid up-front; the cost of reads, if they ever happen, is paid gradually through the life of the system. Discounted cash flow means that write costs have a much bigger impact on the total cost of ownership of an archival system than read costs. Thus it is unfortunate that DNA write costs are so much higher; they represent the overwhelming majority of the operational costs of a DNA storage system.

After I put all these thoughts together, I could write my presentation. Here it is.

Good Morning, Team!

Welcome to the first weekly meeting of the DNA Storage Product team. I'm your new product marketing guy. I'm sure we're going to have a lot of fun working together as we face the very interesting challenges of getting this amazing technology into the market.

MSFT/UW hierarchy
Lets start with the storage hierarchy. What kind of customer implements a hierarchy like this? That's right, a data center customer. So our product needs to fit neatly into the racks in the data center. The form factor is going to be a 4U box.

The state of the art fits 60 3.5" drives into a 4U box. At 14Tb/drive that's 0.84PB. Our technology is more than 340,000 times denser, so we're going to fit 300EB into our 4U box!

What's the price point for our 4U box? Spectra Logic estimates offline tape in 2017 costs 11 dollars per Terabyte. So 300 Exabytes of offline tape would cost about 3.3 billion dollars. We're going to blow tape away by being one hundred times cheaper, so our 4U box is going to sell for the low, low price of only 33 million dollars!

Of course, data center customers don't like the risk of having their only copy of 300 Exabytes of data in a 4U box that a couple of guys could pull from the rack and walk off with. They want redundancy and geographic dispersion. So we plan to sell them in three-packs with a 10% discount for the low, low price of 90 million dollars.

IBM's estimate of the size of the LTO tape market in 2016 was about two-thirds of a billion dollars. Lets say the total available market for very cold data is 50% more than that, then each three-pack we sell is 9% of the total annual market.

Sales team, your challenge is to own 90% of the market by finding ten customers a year who have 300 Exabytes of cold data they're willing to spend 90 million dollars to keep safe.

Once we've shipped them the trio of 4U boxes, the customer needs to fill them with data. How long is this going to take? Lets say it takes a year. So the customer has the equivalent of 45 million dollars worth of kit sitting idle for a year. At 6% cost of capital that costs 2.7 million dollars. We certainly don't want it to take any longer than a year.

To move 300 Exabytes into the box in a year takes 76 Gigabits per second, which is no problem over 100G Ethernet. But then we need to turn the 2.4E21 bits into 1.2E24 bases, including 3 orders of magnitude overhead. So each box needs to synthesize 3.8E16 base/sec. Last year Rob Carlson estimated the DNA synthesis industry's output for 2016 was about 5E12 bases, so the 4U box has to synthesize 7,600 times as much every second as the entire industry did in 2016.

Rob Carlson's estimate uses a cost figure of 1E-4 dollars per base, which implies that it would cost 1.2E20 dollars to fill the box. The customer can't afford that, because it is 1.5 million times the US GDP, or nearly 4 billion dollars per second. Lets say we need to get the cost under the cost of capital sitting idle as the box fills, say about 2.4 million dollars.

Lets assume that it will take a decade to meet the engineering challenges of increasing the synthesis speed and reducing the cost, and that the product will have a decade in the market before it gets replaced.

NVIDIA, an exceptionally successful R&D-heavy company, spends about 20% of revenue on R&D, and in a good year by doing so makes a 24% profit margin. Assuming our sales team got 90% of the market, 20% of revenue would be 180 million dollars per year, and 24% would be 216 million dollars per year.

At 6% cost of capital, spending a dollar a year in years 1 through 10 has a Net Present Value (NPV) of $7.69. At 16% discount rate the NPV of earning a dollar a year in years 11 through 20 is $0.87. Thus earning 216 million dollars per year for the 10-year product life has an NPV of 188 million dollars. Thus to get a 10% return we can spend no more than NPV of 188 million dollars, which is the NPV of about 24 million dollars per year for the first 10 years.

Engineers, your challenge is to increase the speed of synthesis by a factor of a quarter of a trillion, while reducing the cost by a factor of fifty trillion, in less than 10 years while spending no more than 24 million dollars a year.

2 comments:

David. said...

The talk after mine was by Brian Bramlett of Twist Bioscience about the roadmap and research efforts in DNA storage that evolved from an IARPA/SRC workshop in 2016. It sounded a lot more optimistic than my talk. As I recall, the 4-year goals were a system capable of reading and writing a Terabyte in 24 hours, and cost parity with tape.

But I don't see that there's a conflict between our two talks. Recall for example that it is 27 years since the first shipment of a flash product yet flash has not impacted the bulk storage market. There is a huge difference between a lab demo and being able to ship hundreds of Exabytes of product at a profit.

David. said...

Katyanna Quach reports that DNA storage can get denser:

"Scientists say they have crafted a semi-synthetic DNA and RNA molecular system that is able to usefully store genetic information.
...
The biologists also had to adapt the reader-transcriber (the polymerase) enzyme to be able to transcribe the DNA to RNA.

Unlike traditional DNA, this new semi-synthetic system is made up of eight key ingredients instead of the usual four: adenine, guanine, thymine and cytosine. The additional four molecules have incredibly long complex chemical names to spell out here, but have similar structures to the classical nucleotides found in traditional DNA."