Tuesday, December 21, 2021

Progress On DNA Storage

In 2018's DNA's Niche in the Storage Market, I addressed a hypothetical DNA storage company's engineers and posed this challenge:
increase the speed of synthesis by a factor of a quarter of a trillion, while reducing the cost by a factor of fifty trillion, in less than 10 years while spending no more than $24M/yr.
Figure 2C
Now, Scaling DNA data storage with nanoscale electrode wells by Bichlien H. Nguyen et al shows that the collaboration between Microsoft and U. Washington has made significant progress toward this goal. Their abstract reads:
Synthetic DNA is an attractive medium for long-term data storage because of its density, ease of copying, sustainability, and longevity. Recent advances have focused on the development of new encoding algorithms, automation, preservation, and sequencing technologies. Despite progress in these areas, the most challenging hurdle in deployment of DNA data storage remains the write throughput, which limits data storage capacity. We have developed the first nanoscale DNA storage writer, which we expect to scale DNA write density to 25 × 106 sequences per square centimeter, three orders of magnitude improvement over existing DNA synthesis arrays. We show confinement of DNA synthesis to an area under 1 square micrometer, parallelized over millions of nanoelectrode wells and then successfully write and decode a message in DNA. DNA synthesis on this scale will enable write throughputs to reach megabytes per second and is a key enabler to a practical DNA data storage system.
Below the fold I discuss the details.

Clearly, it isn't possible to speed up the chemistry required to add bases to a DNA strand by orders of magnitude. The only viable approach to the required level of speed-up is massive parallelism. Nguyen et al write:
A practical minimum throughput for writing digital data into DNA strands is in the kilobytes per second range, which is not achievable with existing synthesis infrastructure. Achieving the necessary parallel write throughput (not to be confused with latency or nucleotide incorporation time) while maintaining a realistic infrastructure footprint will thus require increasing the synthesis density, the number of different sequences that a single platform can synthesize per unit area.
As with 2D chips, parallelism is controlled by the area of each parallel unit (feature size) and their spacing (pitch). Reducing them increases throughput and reduces cost per unit output:
Smaller feature size and pitch result in higher synthesis density. High-density array synthesis amortizes the fixed costs of reagents and equipment over a larger number of oligonucleotides, which is reflected in the historic decrease in synthesis cost per base with the transition from column-based to array-based oligonucleotide synthesis.
The team fabricated a small chip with an electrode arry and:
demonstrated independent electrode-specific control of DNA synthesis with electrode sizes and pitches that enable synthesis density of 25 million oligonucleotides/cm2, the estimated electrode density required to achieve the minimum target of kilobytes per second of data storage in DNA.
What does this demonstration show?
Our work pushes the state of the art in electronic-chemical control, outpaces the previously reported densest synthesis of arbitrary DNA sequences by a margin of three orders of magnitude, and provides the first experimental indication that the write bandwidth required for DNA data storage can be achieved.
The major problem is scaling the synthesis down is cross-talk between adjacent electrodes:
The smaller the pitch, the closer the electrodes are, and the easier it is for acid generated at one electrode to diffuse to neighboring electrodes. This may cause unintended deblocking of DNA at neighboring electrodes, resulting in insertion errors in the final sequence.
Their approach to minizing cross talk was:
a layout where each working electrode, the anode where the acid formation happens in phosphoramidite synthesis, is sunk in a well and surrounded by four common counter electrodes, cathodes that drive base formation, to confine the acid to the region immediately around the anodes.
Their figures 2E and 2F show that, at least at their scale, it works:
Figures 2E 2F
To verify acid confinement to the generating electrode experimentally, we had an array of 650-nm electrodes manufactured and performed a fluorescence assay using fluorophore-labeled phosphoramidites. ... Strict confinement of fluorescence to the activated electrodes indicated that the electro-chemically generated acid was similarly confined. To demonstrate independent control of the anodes, we parallelized the synthesis process to generate two different sequences as envisioned in Fig. 2E: AAA-fluorescein for one anode (green) and AAA-AquaPhluor for a second (red). The resulting array showed the two fluorophores confined to their respective electrodes (Fig. 2F).
Since DNA, like all raw storage media, is noisy, the key consideration is whether the noise level is low enough to allow error-correcting codes to work
maintained error rates compatible with DNA data storage. The synthesized oligos contained deletion, substitution, and insertion errors at a cumulative rate of 4 to 8% per synthesized base. The observed error rates are well within the acceptable range for modern error correcting codes for data storage in DNA, where average error rates as high as 15% have been shown to be tolerable.
The authors are optimistic about further scaling:
While the electrode densities used in these experiments were limited by the 130-nm process node used to produce the microelectrode array, we project that the technology will scale further to billions of features per square centimeter, enabling synthesis throughput to reach megabytes-per-second levels in a single write module, competitive with the write throughput of other storage devices. As an additional benefit, since synthesis operations within each module happen in parallel, increases in the synthesis density amortize the cost of reagents across more reaction sites and substantially reduce the cost per DNA sequence.
Is this a solution to the problems I outlined in DNA's Niche in the Storage Market? DNA is only ever going to address the large enterprise archival storage market. As Facebook's designs show, there are two important attributes of storage media in this space; low cost per bit, and high write bandwidth. Facebook implements a two-level archival storage architecture, with hard drives above optical drives. Typical enterprise hard drives have sustained write bandwidths above 250MB/s, and Facebook writes to many of them in parallel. Of course, these DNA writers could work in parallel too, but to get from low single-digit to 250 megabyte/sec requires a large number of writers to match the write bandwidth of a single disk.

So, the team has made really impressive progress, but they're still a long way from being competitive in their market niche.


Geoff said...

Promising indeed. You might want to spell-check this for posterity.

David. said...

Thanks for catching the typos, Geoff!