Tuesday, June 22, 2021

DNA Data Storage: A Different Approach

Last month I continued my blogging about storing data in DNA with an update on the work of the University of Washington/Microsoft Molecular Information Systems Lab (MISL). They are not the only group working on this technology. John Timmer's A DNA-based storage system with files and metadata discusses Random access DNA memory using Boolean search in an archival file storage system by James L. Banal et al from MIT and the Broad Institute. Their abstract reads:
DNA is an ultrahigh-density storage medium that could meet exponentially growing worldwide demand for archival data storage if DNA synthesis costs declined sufficiently and if random access of files within exabyte-to-yottabyte-scale DNA data pools were feasible. Here, we demonstrate a path to overcome the second barrier by encapsulating data-encoding DNA file sequences within impervious silica capsules that are surface labelled with single-stranded DNA barcodes. Barcodes are chosen to represent file metadata, enabling selection of sets of files with Boolean logic directly, without use of amplification. We demonstrate random access of image files from a prototypical 2-kilobyte image database using fluorescence sorting with selection sensitivity of one in 106 files, which thereby enables one in 106N selection capability using N optical channels. Our strategy thereby offers a scalable concept for random access of archival files in large-scale molecular datasets.
Below the fold, some commentary on this significantly different approach to accessing DNA databases.

The approach used by, among others, MISL of encoding both data and metadata tags in short DNA sequences has two problems the MIT/Broad team are trying to address, as Timmer explains:
First, the amplification step, done using a process called PCR, has limits on the size of the sequence that can be amplified. And each tag takes up some of that limited space, so adding more detailed tags (as might be needed for a complicated file system) cuts into the amount of space for data.

The other limit is that the PCR reaction that amplifies specific pieces of data-containing DNA consumes some of the original library of DNA. In other words, each time you pull out some data, you destroy piles of unrelated data. Access data often enough and you'll end up burning through the entire repository. While there are ways to re-amplify everything, each time this is done, it increases the chance of introducing an error.
The MIT/Broad approach physically separates the DNA used to encode the data from the DNA used to encode the metadata tags, so that they don't compete for space in the DNA strands. And it allows many different tag strands to be applied to a single piece of encoded data, so that the tags do not compete with each other for space either.

In essence, the process that creates each entry in the database is as follows:
  • The data is encoded in DNA strands.
  • The data strands are encapsulated in tiny silica capsules.
  • The metadata is encoded in DNA strands.
  • The metadata strands are applied to the outside of the silica capsules, labelling their content.
  • The labelled capsules are encapsulated a second time to protect the metadata strands.
The labelled, double-encapsulated database entry consisting of at least a million tiny silica capsules can then be combined with all the other entries. In essence, the process that queries the database for entries with appropriate metadata tags is as follows:
  • For each metadata tag in the query, DNA strands that will base-pair with it are synthesized.
  • Each metadata tag in the query is assigned a color, and a molecule whose fluorescence is that color attached to the tag's query strands.
  • A sample of the database is extracted.
  • The outer encapsulation is removed, exposing the metadata tags.
  • The query strands are mixed with the database sample, so that they base-pair with the corresponding tags.
  • Machines that sort cells based on fluorescence colors are used to separate the capsules which have query tags from those which don't.
  • The capsules that lack query tags are re-encapsulated and returned to the database.
  • The capsules with query tags are de-encapsulated, then the data DNA is sequenced and decoded to form the answer to the query.
  • The data is copied, encapsulated, tagged, encapsulated and returned to the database.
This sounds good but, as Timmer points out it is slow:
While this research represents a significant leap in complexity for DNA-based storage, it's still just DNA-based storage. That means it's slow on a scale that makes even tape drives seem quick. The researchers calculate that even if they crammed far more data into each glass bead, searches would start topping out at about 1GB of data per second. That would mean searching a petabyte of data would take a bit over two weeks.

And that's just finding the right glass beads. Cracking them open and getting the DNA into bacteria and then doing the sequencing needed to actually determine what is stored in the bead would likely add a couple of days to the process.
Three years ago Organick et al encoded 200MB of data in DNA, but the MIT/Broad team report their experiment used only a tiny amount of data:
As a proof-of-principle of our archival DNA file system, we encapsulated 20 image files, each composed of a ~0.1 kilobyte image file encoded in a 3,000-base-pair plasmid
They write:
our file system may in principle be scaled to considerably larger sets of images, limited primarily by the cost of DNA synthesis and the need to develop strategies for high-throughput silica encapsulation of distinct file sequences and surface-based DNA labelling for barcoding ... Because physical encapsulation separates file sequences from external barcodes that are used to describe the encapsulated information, our file system offers long-term environmental protection of encoded file sequences via silica encapsulation for permanent archival storage, where external barcodes may be renewed periodically, further protected with secondary encapsulation, or data pools may simply be stored using methods implemented in PCR-based random access, such as dehydrating the data pool and immersing the dried molecular database in oil.
The UW/Microsoft team used even less data for their Demonstration of End-to-End Automation of DNA Data Storage but in contrast it was fully automated and relatively cheap:
It has a bench-top footprint and costs approximately $10 k USD, though careful calibration and elimination of costly sensors and actuators could reduce its cost to approximately $3 k–4 k USD at low volumes.
Our system’s write-to-read latency is approximately 21 h. The majority of this time is taken by synthesis, viz., approximately 305 s per base, or 8.4 h to synthesize a 99-mer payload and 12 h to cleave and deprotect the oligonucleotides at room temperature. After synthesis, preparation takes an additional 30 min, and nanopore reading and online decoding take 6 min.

Using this prototype system, we stored and subsequently retrieved the 5-byte message “HELLO” (01001000 01000101 01001100 01001100 01001111 in bits).
Neither of these technologies is anywhere close to ready for prime time. Both are hamstrung by the rate at which DNA sequences can be generated, and the MIT/Broad approach lacks both automation and technology for production-level encapsulation.

No comments: