Tuesday, October 11, 2022

The "DNA Typewriter"

It is time to catch up on a few developments in the field of storing data via chemicals, such as DNA. Below the fold I discuss a half-dozen recent reports.

First is DNA typewriter taps out messages inside cells from the Howard Hughes Medical Institute about two papers from researchers they support at the University of Washington, but separate from the U. Washington/Microsoft team that has pioneered so much of the work in DNA storage for computers. This team is working on data storage for biological organisms.
The DNA typewriter isn't the first of its kind, however, it has a combination of attributes that make it particularly promising for illuminating the biology of cells. Namely, it can capture a large number of events while documenting them in chronological order.

"We have accomplished something that's analogous to writing," Shendure says. "We can create thousands of symbols, which we call barcodes, and we can capture them in order."

The researchers have so far adapted the system to accommodate as many as 4,096 barcodes, which are short pieces of DNA. Like the original machine tapping out letters, the DNA typewriter lays down only one barcode at a time, from left to right.
The report gives an example of how the technique can be used:
In another experiment, Shendure's team turned their recorder on the cells themselves. By tagging dividing cells with barcodes, they tracked cell divisions. After 25 days, in which one cell gave rise to about 1.2 million, the researchers analyzed the patterns in the cells' barcodes to reconstruct their family tree.
The first paper describing the technique is A time-resolved, multi-symbol molecular recorder via sequential genome editing by Junhong Choi et al. From their abstract:
For DNA Typewriter, the blank recording medium (‘DNA Tape’) consists of a tandem array of partial CRISPR–Cas9 target sites, with all but the first site truncated at their 5′ ends and therefore inactive. Short insertional edits serve as symbols that record the identity of the prime editing guide RNA mediating the edit while also shifting the position of the ‘type guide’ by one unit along the DNA Tape, that is, sequential genome editing. In this proof of concept of DNA Typewriter, we demonstrate recording and decoding of thousands of symbols, complex event histories and short text messages; evaluate the performance of dozens of orthogonal tapes; and construct ‘long tape’ potentially capable of recording as many as 20 serial events.
The "blank tape" looks like this (vertical bars delineate the sites):
The "spacer" that is the target for the CRISPR-Cas9 insertion looks like this:
The insertion happens between the ACG and the TGA at the end. The tape is prefixed by CGA thus activating the first site (plus indicates the "spacer"):
+++++++++++++++++ +++
The insertion consists of a 2 base-pair "barcode" NN followed by CGA:
                   +++++++++++++++++ +++
Note that the head of the tape no longer matches the target, since it lacks the trailing TGA, and that the first target on the tape is now past the "barcode", activated ready for the next insertion.
                                      +++++++++++++++++ +++
Thus the barcode data consists of 2 base pairs with 17 base pairs of overhead, or 10.5% of the DNA.

The authors tested their technique by using it to track many generations of cell division:
Finally, we leverage DNA Typewriter in conjunction with single-cell RNA-seq to reconstruct a monophyletic lineage of 3,257 cells and find that the Poisson-like accumulation of sequential edits to multicopy DNA tape can be maintained across at least 20 generations and 25 days of in vitro clonal expansion.
The report describes the second paper:
In a study published in Nature, his team describes a strategy for genetically engineering cells so that the genetic code for a molecule that interests scientists carries with it the instructions for a barcode. When the molecule turns on within a cell, it switches on the barcode too. Then the typewriter goes to work, documenting the molecule's activity.
This second paper is Multiplex genomic recording of enhancer and signal transduction activity in mammalian cells by Wei Chen et al. They call the way they apply the "DNA Typewriter" ENGRAM (ENhancer-driven Genomic Recording of transcriptional Activity in Multiplex):
ENGRAM is based on the prime editing-mediated insertion of signal- or enhancer-specific barcodes to a genomically encoded recording unit. We show how this strategy can be used to concurrently genomically record the relative activity of at least hundreds of enhancers with high fidelity, sensitivity and reproducibility. Leveraging synthetic enhancers that are responsive to specific signal transduction pathways, we further demonstrate time- and concentration-dependent genomic recording of Wnt, NF-κB, and Tet-On activity. Finally, by coupling ENGRAM to sequential genome editing, we show how serially occurring molecular events can potentially be ordered. Looking forward, we envision that multiplex, ENGRAM-based recording of the strength, duration and order of enhancer and signal transduction activities has broad potential for application in functional genomics, developmental biology and neuroscience.
In other words, they add to each enhancer the ability to write its specific barcode to the cell's DNA tape, recording not merely that the enhancer was active, but the sequence of enhancer activity along the DNA tape.

In another aspect of DNA data storage, Why are hard drive companies investing in DNA data storage?, John Timmer reports on a new partnership between Seagate and Catalog, the DNA data storage company I first discussed 3.5 years ago in Cost-Reducing Writing DNA Data. Earlier this year I wrote about Catalog in Storage Update: Part 1:
who encode data not in individual bases, but in short strands of pre-synthesized DNA. The idea is to sacrifice ultimate density for write speed. David Turek reported that, by using conventional ink-jet heads to print successive strands on dots on a polymer tape, they have demonstrated writing at 1Mb/s.
This is important because the low write bandwidth is a major constraint on the usefulness of DNA storage. This is especially true in the enterprise archival market; Kestutis Patiejunas stressed the importance of write bandwidth in the talk I reported in More on Facebook's "Cold Storage":
Nevertheless, this design has high performance where it matters to Facebook, in write bandwidth. While a group of disks is spun up, any reads queued up for that group are performed. But almost all the I/O operations to this design are writes. Writes are erasure-coded, and the shards all written to different disks in the same group. In this way, while a group is spun up, all disks in the group are writing simultaneously providing huge write bandwidth. When the group is spun down, the disks in the next group take over, and the high write bandwidth is only briefly interrupted.
Four years ago I pointed out the problematic economics of DNA storage in DNA's Niche in the Storage Market that the cost of writing DNA was a major problem for its use in archival storage. Catalog CEO Hyunjun Park agrees:
Citing a DNA synthesis cost of about .03 cents per base, Park said, ".03 cents times two bits per base pair times, say, gigabytes—that's a lot of money. That's millions of dollars."
At $0.0012 per byte, writing a petabyte would cost $1,200B. As I predicted, enterprise archiving in DNA is a tough sell:
But the company found that potential customers were less interested in archiving than Catalog expected. "We've been speaking with companies like Seagate and other companies in the entertainment industry or gas, tech—a lot of very large companies with big data problems and challenges. And we saw that it's not just the cold storage aspect of this that's interesting to them."

Instead, Park found that people were intrigued by the prospect that DNA could allow massively parallel operations on the stored data without the need to convert it back to digital form—Park cited massively parallel database searches and digital signal processing as potential applications. "We want to create a new tier of computational storage, where it supports massive data sizes but is also very much searchable and computable," Park said.
But computation in DNA only makes sense after writing massive amounts of data:
"You need to be able to have the ability to store a lot of information to DNA before DNA base computation makes sense," Park said, because traditional computers will chew through smaller amounts of data without hitting bottlenecks. DNA storage only comes into its own because it can handle massive parallelism better. "[If] you're trying to compute on say, a megabyte of data stored in DNA, the time or resources it would take to do that would be, say, on par with the time it would take to compute on a petabyte of data stored in DNA," he said.
There is competition for the computational storage market, as Dan Robinson reports in Computational storage specs hit v1.0 after 4 years of work:
The Storage Networking Industry Association (SNIA) has at last published version 1.0 of its Computational Storage Architecture and Programming Model, the specs meant to help develop the new performance-boosting tech by providing interoperability between different vendors.

Computational storage covers hardware and software architectures in which compute is more tightly coupled with storage at the system and drive level, partly driven by trends such as the growing volumes of data that organizations find themselves dealing with.

A typical implementation might be an SSD with some embedded compute capability, which might vary from an FGPA designed to offload compression, encryption, and erasure coding from the host CPU, to an Arm CPU capable of running Linux and custom code.
Four years is actually rapid progress in this area. I believe the beginning of computational storage was a 2009 paper entitled FAWN A Fast Array of Wimpy Nodes in which David Andersen and his co-authors from C-MU showed that a network of large numbers of small CPUs coupled with modest amounts of flash memory could process key-value queries at the same speed as the networks of beefy servers used by, for example, Google, but using 2 orders of magnitude less power.

There is, however, better news on the side of DNA storage that isn't the economic problem, reading. In The era of fast, cheap genome sequencing is here, Emily Mullin reports that:
Illumina unveiled what it calls its fastest, most cost-efficient sequencing machines yet, the NovaSeq X series. The company, which controls around 80 percent of the DNA sequencing market globally, believes its new technology will slash the cost to just $200 per human genome while providing a readout at twice the speed. Francis deSouza, Illumina’s CEO, says the more powerful model will be able to sequence 20,000 genomes per year; its current machines can do about 7,500. Illumina will start selling the new machines today and ship them next year.
DNA is far from the only molecule that can be used to store data. Ink filled with secret molecules can hide encryption key in a letter by Chris Stokel-Walker reports on another technique that is more about steganography than storage, since they only stored 256 bits:
The team generated a 256-character [sic] cipher key to encrypt and decrypt files using the Advanced Encryption Standard (AES), a common cryptography method. The group then encoded the cipher key into eight oligourethanes, a type of polymer.
Each polymer was made up of 10 smaller compounds called monomers. The middle eight monomers hold the details of the key, with one monomer on each side acting as the key’s synthesiser and decoder.

The team mixed the polymers with isopropanol, glycerol and soot to make an ink. Using the ink, some of the researchers wrote a letter and then sent it to another person on the team. The group then extracted a sample of the ink and used the cipher key to unlock an encrypted file: the entire text of the novel The Wonderful Wizard of Oz by L. Frank Baum.
The whole Wizard of Oz thing is just a shiny object to distract from the fact they stored just 256 bits. The researchers then go off into the typical irrelevant chemical data storage hype about density:
For Anslyn, the ability to hide information is secondary. His main goal is storing data. The information density within molecules in polymers, which include DNA, is higher than it would be through encoded magnetic spots on a hard disc. “The information density is mind-boggling,” he says.
This focus on density instead of the actual economics of storing data is the bane of discussions of chemical data storage. For example, it is true that Catalog's data storage density is much higher than current magnetic media, but writing it involves a machine the size of a kitchen.


Tardigrade said...

As a person who works in biology I've been disinterested and quizzical of DNA data storage. But the first part of your post, where the writing is used to record physiological processes is intriguing.

In the vein of the non-DNA polymer I assume researchers are considering expanding the DNA nucleosides used for DNA printing? Each additional pair would double the information content of each bit of the chain. There are also tougher DNA backbones than the phosphodiester backbone, as well as alternative sugars (though deoxyribose is probably pretty tough on its own, at least compared to ribose).

There are biomolecules such as ribosomes and polyketide synthetases that polymerize and (in the later case) tailor amino acids. This is in addition to chemical polymerization as done by most of the DNA printers. Appropriately folding amino acid chains, while not as structurally similar as DNA chains are to each other, might also be amenable to parallelization. And have the benefit of over 20 commonly used side chains in organisms, along with many other alternatives.

I suppose DNA (and RNA) itself can also be chemically modified (as seen in epigenetic modifications), which would expand the coding or computational power of a strand. I'm not aware of electronic storage devices which can compute (though this could probably be set up in something like flash memory). While molecular biologists have been fiddling with chemically and catalytically active RNA molecules for a while now.

I'm still not too interested in these technologies with respect to storage or data processing. But the technologies themselves are interesting.

Tardigrade said...

When it comes to long-term storage proteins are pretty unbeatable, with Maillard-reaction protein residues lasting at least half a billion years:


"Although ancient DNA carries the most detailed biological information, it degrades most quickly. Relatively intact proteins can persist for nearly 4 million years and can still distinguish between closely related species. With protein residues, "the information is again reduced," Wiemann says. The polymers don't preserve 3D structure or a complete sequence of amino acids. But the compounds are incredibly stable, preserved "through deep time," Briggs says. Wiemann says they have identified protein residues in 500-million-year-old fossils from Canada's Burgess Shale in British Columbia."

Anonymous said...

Off-topic for this post, but I would be curious your thoughts on Microsoft's Project Silica, a cold storage tech that they have been working on for years but recently (?) released some updates on: https://www.microsoft.com/en-us/research/project/project-silica/

David. said...

I wrote "This focus on density instead of the actual economics of storing data is the bane of discussions of chemical data storage." The fact is that technology specifically for archival storage is a really bad business, irrespective of the details of the technology:

- No-one wants to spend money on archival storage, so the "Total Available Market" is a minute fraction of the total storage market.
- Worse, in the small archival market a new technology burdened by its R&D costs has to compete with past-generation technology from the mainstream storage market whose R&D costs have already been amortized, so it can easily undercut the new technology in a very cost-sensitive market.

Thus investing in archival-specific technologies may be technically interesting but it is a looser financially.