Tuesday, February 6, 2018

DNA's Niche in the Storage Market

I've been writing about storing data in DNA for the last five years, both enthusiastically about DNA's long-term prospects as a technology for storage, and pessimistically about its medium-term prospects. This time, I'd like to look at DNA storage systems as a product, and ask where their attributes might provide a fit in the storage marketplace.

As far as I know no-one has ever built a storage system using DNA as a medium, let alone sold one. Indeed, the only work I know on what such a system would actually look like is by the team from Microsoft Research and the University of Washington. Everything below the fold is somewhat informed speculation. If I've got something wrong, I hope the experts will correct me.

I've just been hired as the product manager for the DNA division of StoreYourBits Inc. My first task is a presentation to the product team setting out the challenges we will face in taking the amazing technology our researchers have come up with into the market.

To put the presentation together, I need three kinds of information; to identify the attributes of the DNA storage system; to figure out the niche in the overall storage system market suitable for a system with these attributes; and to estimate the total available market for a product in that niche.

DNA's Attributes

My list of the attributes of DNA as a storage medium would be:
  • Long life
  • Low Kryder rate
  • Dense
  • Write-Once
  • Read Slow, Write Slower
  • Read Cheap, Write Expensive
Then I'd go down the list and look at the implications of each of these attributes for the product. Like all storage media, the initial product would be targeted at the enterprise. So the initial conceptual product would be a rack-mount system.

Long life

The popular, and even the technology press, believes that the solution to the problem of keeping data safe for the long term is quasi-immortal media. There are many problems with this idea, but perhaps the two most important are:
  • The attraction of quasi-immortal media is "fire and forget", the idea that data can be written and then stored with no further intervention. In the real world, you can't ignore the possibility of media failures. That a data storage medium has a long life does not mean that it is inherently more reliable than a medium with a shorter life; it means that the reliability degrades more slowly.
  • Quasi-immortal media are a bet against technological progress. The analog of Moore's Law for disks is Kryder's Law. For about 3 decades leading up to 2010 disks got 30-40% denser every year. This Kryder rate is now under 20%/yr. But even if the rate was 10%/yr, after 10 years you could store the same data in 1/3 the space. Space in the data center is expensive; the drives are going to be replaced after a few years even if they're still working. This is why disks are engineered to have a 5-year life; they could be built for much longer life but the added cost would bring the customer no benefit. The same argument applies to tapes.

Low Kryder rate

This is one area where DNA is different from other quasi-immortal media. We already have encoding techniques that get close to the theoretical limit:
Science published DNA Fountain enables a robust and efficient storage architecture by Yaniv Erlich and Dina Zielinski from Columbia. They describe an improved method for encoding data in DNA that, at 1.83 bits/nucleotide, gets much closer to the Shannon limit of 2 than previous attempts.
The Kryder rate of DNA as a storage medium should thus be close to zero, meaning that the incentive to replace DNA storage systems with newer, more effective ones should be much less than for other media. Thus DNA storage systems should be able to deliver much more of their potential service life than systems using media with higher Kryder rates.

Dense

Alas, that isn't quite as true as we would like. The reason for this takes a bit of explanation.

State-of-the-art hard disks store 1.75TB of user data per platter in a 20nm-thick magnetic layer on each side. The platter is 95mm in diameter with a 25mm diameter hole in the middle, so the volume of the layer that actually contains the data is π*(952-252)*40*10-6 ≅ 1mm3. This volume contains 1.4*1013 usable bits, so each bit occupies about 7*10-14mm3.

But that's not the relevant number. The platters are 675μm thick, so the volume of the bits and their carrier is about π*(952-252)*675*10-3 ≅ 1.8*104mm3, thus the volume of a [bit+carrier] is about 1.3*10-9mm3, or 1.8*104 times the size of the bit itself.

But even that's not the relevant number. A 3.5" drive is 25.4mm high, 101.6mm wide and 147.3mm long, for a total volume of about 3.8*105mm3. This volume contains 1.12*1014 usable bits, thus the size of a usable bit is 3.4*10-9mm3. This is 2.6 times the size of the [bit+carrier]. The overhead of the packaging surrounding the raw bits is about half a million times bigger than the bits themselves. If this overhead could be eliminated, we could store 7 exabytes in a 3.5" disk form factor.

DNA storage systems will impose a similar overhead. How big will it be? That's hard to predict in the current state of research into DNA storage systems. Here's my attempt at it.

The volume of a base pair is ~1nm3 = 10-18mm3, so a base is half that. In theory this is (a bit less than) 2 bits. But they aren't usable bits. Just as the usable bit in the magnetic coating of a disk platter includes overhead for error correction, servo positioning, block boundaries and so on, the DNA usable bit has to include redundancy, error correction, addressing and other system overhead. Working from the Microsoft/UW team's slides it looks like a usable bit has maybe 3 orders of magnitude overhead of this kind. So a usable DNA bit would be about 2.5*10-16mm3, or about 0.035% the size of a usable hard disk bit.

The density the Microsoft/UW team claim is an exabyte in a cubic inch (16*103mm3 for 8*1018 bits), which translates to about 2*10-15mm3 per usable bit, a factor of 10 bigger than my estimate above, and roughly 35 times smaller than a usable bit in a hard drive. This is a rather surprising result; the hype around DNA storage would lead one to believe that DNA bits are many orders of magnitude smaller than hard drive bits. But the density advantage for a DNA system using current technology comes mostly from less overhead in the packaging and read/write hardware, not from intrinsically smaller usable bits.

This is why claims such as this, from a Nature news article by Andy Extance entitled How DNA could store all the world's data:
If information could be packaged as densely as it is in the genes of the bacterium Escherichia coli, the world's storage needs could be met by about a kilogram of DNA.
are even more extreme than claiming a 7EB hard disk.

Although the cells are at least as big, 3D flash has a huge density advantage over 2D flash. Because it is a 3D not a 2D medium, there is less packaging overhead per bit. Similarly, DNA is a 3D medium and hard disk is a 2D medium, so DNA has a huge density advantage simply from less packaging per bit.

My guess is that the container for the DNA will add a factor of 2, and I'm going to assume that the read/write hardware adds the same factor of 2.6 as it does for hard disk. So the equivalent of the hard disk usable bit at 3.4*10-9mm3 would be a DNA system usable bit at about 10-14mm3, or a factor of about 3.4*105 better than hard disk.

But, to return to the argument of the previous section, although the basic encoding of bits into bases is already close to the theoretical optimum, this does not mean that the Kryder rate for DNA storage will be zero. The density of usable bits is determined entirely by the various kinds of overhead, which can be reduced by improving technology.

It is often claimed that the mapping between bases and bits is so stable that DNA as a storage medium would be immune from format obsolescence. But the overhead for redundancy, error correction, addressing, etc. means that in order to decode the data you need to understand the 999 metadata bases as well. How the metadata is formatted is likely to change as technology evolution tries to reduce this overhead, so DNA is not immune from format obsolescence.

Write-Once

DNA is write-once, which is good for the archival market but not for anything further up the storage hierarchy.

Read Slow, Write Slower

The Microsoft/UW team estimate that the access latency for data in DNA will be hours, but I haven't seen an estimate for read bandwidth after you've waited through the latency. This again restricts DNA storage systems to the very cold archival layer of the storage hierarchy.

The best example of a system designed to operate at this very cold layer is Facebook's Blu-Ray based system, which stores 10,000 Blu-Ray disks in a rack. Although most commentary on it focused on the 50-year life of the optical media, this was not the key to the design.

Facebook has a very detailed understanding of the behavior of data in its storage hierarchy. It knows that at the cold, archival layer reads are very rare. Thus the key to the design is maximizing the write bandwidth, which they do by carefully scheduling writes to 12 Blu-Ray drives in parallel.

Thus the fact that writing to DNA is slow is problematic in its assigned layer in the storage hierarchy.

Read Cheap, Write Expensive

Carlson's 2016 graph
The other problematic aspect of writes for DNA as a storage medium is that they are expensive. The graph is from Rob Carlson's blog, and it shows that DNA reads have been getting cheaper rapidly, but DNA writes (short oligo synthesis) have not. His latest estimate is that writing currently costs $10-4 per base, which would be prohibitively expensive even ignoring the following problem.

The cost of writing is paid up-front; the cost of reading is paid gradually through the life of the system. Discounted cash flow means that write costs have a much bigger impact on the total cost of ownership of an archival system than read costs. Thus it is unfortunate that DNA write costs are so much higher; they represent the overwhelming majority of the operational costs of a DNA storage system.

After I put all these thoughts together, I could write my presentation. Here it is.

Good Morning, Team!

Welcome to the first weekly meeting of the DNA Storage Product team. I'm your new product marketing guy. I'm sure we're going to have a lot of fun working together as we face the very interesting challenges of getting this amazing technology into the market.

MSFT/UW hierarchy
Lets start with the storage hierarchy. What kind of customer implements a hierarchy like this? That's right, a data center customer. So our product needs to fit neatly into the racks in the data center. The form factor is going to be a 4U box. The state of the art at Backblaze fits 60 3.5" drives into a 4U box. At 14Tb/drive that's 0.84PB. Our technology is more than 3.4*105 times denser, so we're going to fit 300EB into our 4U box!

What's the price point for our 4U box? Spectra Logic estimates offline tape in 2017 costs $11/TB. So 300EB of offline tape would cost about $3.3B. We're going to blow tape away by being one hundred times cheaper, so our 4U box is going to sell for the low, low price of only $33M!

Of course, data center customers don't like the risk of having their only copy of 300EB of data in a 4U box that a couple of guys could pull from the rack and walk off with. They want redundancy and geographic dispersion. So we plan to sell them in three-packs with a 10% discount for the low, low price of $90M.

IBM's estimate of the size of the LTO tape market in 2016 was about $0.65B. Lets say the total available market for very cold data is 50% more than that, then each three-pack we sell is 9% of the total available market.

Sales team, your challenge is to own 90% of the market by finding ten customers a year who have 300EB of cold data they're willing to spend $90M to keep safe.

Once we've shipped them the trio of 4U boxes, the customer needs to fill them with data. How long is this going to take? Lets say it takes a year. So the customer has the equivalent of $45M worth of kit sitting idle for a year. At 6% cost of capital that costs $2.7M. We certainly don't want it to take any longer than a year.

To move 300EB into the box in a year takes 76Gb/s, which is no problem over 100G Ethernet. But then we need to turn the 2.4*1021 bits into 1.2*1024 bases, including the 3 orders of magnitude overhead. So each box needs to synthesize 3.8*1016 base/sec. Last year Rob Carlson estimated the DNA synthesis industry's output for 2016 was about 5*1012 bases, so the 4U box has to synthesize 7,600 times as much every second as the entire industry did in 2016.

Rob Carlson's estimate uses a cost figure of $10-4 per base, which implies that it would cost $1.2*1020 to fill the box. That's 120 million trillion dollars, which the customer can't afford, because it is 1.5 million times the US GDP. Lets say we need to get the cost under the cost of capital sitting idle as the box fills, or about $2.4*106.

Lets assume that it will take a decade to meet the engineering challenges of increasing the synthesis speed and reducing the cost, and that the product will have a decade in the market before it gets replaced.

NVIDIA, an exceptionally successful R&D-heavy company, spends about 20% of revenue on R&D, and in a good year by doing so makes a 24% profit margin. Assuming our sales team got 90% of the market, 20% of revenue would be $180M/yr, and 24% would be $216M/yr.

At 6% cost of capital, spending $1/yr in years 1 through 10 has a Net Present Value (NPV) of $7.69. At 16% discount rate the NPV of earning $1/yr in years 11 through 20 is $0.87. Thus earning $216M/yr for the 10-year product life has an NPV of $188M. Thus to get a 10% return we can spend no more than NPV of $188M, which is the NPV of about $24M/yr for the first 10 years.

Engineers, your challenge is to increase the speed of synthesis by a factor of a quarter of a trillion, while reducing the cost by a factor of fifty trillion, in less than 10 years while spending no more than $24M/yr.

Finance team, your challenge is to persuade the company to spend $24M a year for the next 10 years for a product that can then earn about $216M a year for 10 years.

At next week's meeting we need to refine these numbers so we can present them to management and get the go-ahead. Lets go get this done!

Conclusion

I think my skepticism about the medium-term prospects for DNA is justified. That isn't to say that researching DNA storage technologies is a waste of resources. Eventually, I believe it will be feasible and economic. But eventually is many decades away. This should not be a surprise, time-scales in the storage industry are long. Disk is a 60-year-old technology, tape is at least 65 years old, CDs are 35 years old, flash is 30 years old and has yet to impact bulk data storage.

7 comments:

David. said...

Random access in large-scale DNA data storage is a new paper from the UW/Microsoft Research team and others improving their earlier work on random access, and also experimenting with nanopores to read the DNA.

But, as John Timmer reports at Ars Technica:

"Our ability to synthesize DNA has grown at an astonishing pace, but it started from almost nothing a few decades ago, so it's still relatively small. Assuming a DNA-based drive would be able to read a few KB per second, then the researchers calculate that it would only take about two weeks to read every bit of DNA that we could synthesize annually. Put differently, our ability to synthesize DNA has a long way to go before we can practically store much data."

David. said...

The work described in the new paper from the UW/Microsoft Research team reports impressive progress in improving random access and reducing the number of reads needed to recover error-free data. But they should not be writing stuff like:

"Recently, various groups have observed that the biotechnology industry has made substential [sic] progress and DNA data storage is nearing practical use"

without observing that the "various groups" are simply wrong. And their discussion is based on the idea that:

"The first practical ‘DNA drive’ should have a throughput of at least a few kilobytes per second."

Given the critical importance of write bandwidth in enterprise archival systems, no-one is going to buy an expensive, untried, archival-only system that can write "a few kilobytes per second". And they admit that even this un-marketable product represents:

"equivalent of the entire synthetic DNA industry annual production in just 2 weeks. Clearly, synthetic DNA production will have to increase to meet this goal."

The increase in production needed is enormously bigger than the paper suggests. The idea that using arrays can scale production by the 10 or so orders of magnitude needed is not credible.

David. said...

This post should also have linked to its predecessor from just over 5 years ago.

David. said...

David Pescovitz' The case for storing digital data in DNA is, unfortunately, another victim of the DNA storage hype:

"But as researchers continue to make technical strides in the technology, and the price of synthesizing and sequencing DNA has dropped exponentially, systems for backing up to the double helix may actually be closer than you think."

Yet again, the "dropped exponentially" headfake.

Pescovitz bases his post on Exabytes in a Test Tube: The Case for DNA Data Storage by Olgica Milenkovic et al, which is actually a good overview of the technology but, as usual, far too optimistic about the prospects for practical deployment.

David. said...

Not storage but processing with DNA in Katyanna Quach's Boffins build neural networks fashioned out of DNA molecules:

"Scientists have built neural networks from DNA molecules that can recognise handwritten numbers, a common task in deep learning, according to a paper published in Nature on Wednesday."

More information from Caltech, the paper is here.

David. said...

The Economist reports on Catalog, a company that has a somewhat more hopeful approach to storing data in DNA:

"Catalog, a biotechnology firm in Boston, hopes to bring the cost of DNA data-storage below $10 per gigabyte. ... The firm’s system is based on 100 different DNA molecules, each ten base pairs long. The order of these bases does not, however, encode the binary data directly. Instead, the company pastes these short DNA molecules together into longer ones. Crucially, the enzyme system it uses to do this is able to assemble short molecules into long ones in whatever order is desired. The order of the short molecular units within a longer molecule encodes, according to a rule book devised by the company, the data to be stored. ... The cost savings of Catalog’s method come from the limited number of molecules it starts with. Making new DNA molecules one base pair at a time is expensive, but making copies of existing ones is cheap, as is joining such molecules together."

Writing at $10/GB is a lot cheaper than current approaches, but the economics of this system will only work out for the small number of very large archival storage systems, and only if they can increase the write bandwidth enormously:

"Catalog is working with Cambridge Consultants, a British technology-development firm, to make a prototype capable of writing about 125 gigabytes of data to dna every day. If this machine works as hoped (it is supposed to be ready next year), the company intends to produce a more powerful device, able to write 1,000 times faster, within three years."

Their market is exabyte-scale systems. The machine they hope to have in 3 years will take about 22 years to write an exabyte.

Casey said...

"Their market is exabyte-scale systems. The machine they hope to have in 3 years will take about 22 years to write an exabyte."

So with just 88 such machines, they could write an exabyte in 3 months. Sounds pretty good, and far better than existing tape-drive systems.