The DNA typewriter isn't the first of its kind, however, it has a combination of attributes that make it particularly promising for illuminating the biology of cells. Namely, it can capture a large number of events while documenting them in chronological order.The report gives an example of how the technique can be used:
"We have accomplished something that's analogous to writing," Shendure says. "We can create thousands of symbols, which we call barcodes, and we can capture them in order."
The researchers have so far adapted the system to accommodate as many as 4,096 barcodes, which are short pieces of DNA. Like the original machine tapping out letters, the DNA typewriter lays down only one barcode at a time, from left to right.
In another experiment, Shendure's team turned their recorder on the cells themselves. By tagging dividing cells with barcodes, they tracked cell divisions. After 25 days, in which one cell gave rise to about 1.2 million, the researchers analyzed the patterns in the cells' barcodes to reconstruct their family tree.The first paper describing the technique is A time-resolved, multi-symbol molecular recorder via sequential genome editing by Junhong Choi et al. From their abstract:
For DNA Typewriter, the blank recording medium (‘DNA Tape’) consists of a tandem array of partial CRISPR–Cas9 target sites, with all but the first site truncated at their 5′ ends and therefore inactive. Short insertional edits serve as symbols that record the identity of the prime editing guide RNA mediating the edit while also shifting the position of the ‘type guide’ by one unit along the DNA Tape, that is, sequential genome editing. In this proof of concept of DNA Typewriter, we demonstrate recording and decoding of thousands of symbols, complex event histories and short text messages; evaluate the performance of dozens of orthogonal tapes; and construct ‘long tape’ potentially capable of recording as many as 20 serial events.The "blank tape" looks like this (vertical bars delineate the sites):
TGATGGTGAGCACG|TGATGGTGAGCACG|TGATGGTGAGCACG|TGATGGTGAGCACG|...The "spacer" that is the target for the CRISPR-Cas9 insertion looks like this:
CGATGATGGTGAGCACGTGAThe insertion happens between the ACG and the TGA at the end. The tape is prefixed by CGA thus activating the first site (plus indicates the "spacer"):
CGATGATGGTGAGCACG|TGATGGTGAGCACG|TGATGGTGAGCACG|TGATGGTGAGCACG|... +++++++++++++++++ +++The insertion consists of a 2 base-pair "barcode" NN followed by CGA:
CGATGATGGTGAGCACGNNCGATGATGGTGAGCACG|TGATGGTGAGCACG|TGATGGTGAGCACG|... +++++++++++++++++ +++Note that the head of the tape no longer matches the target, since it lacks the trailing TGA, and that the first target on the tape is now past the "barcode", activated ready for the next insertion.
CGATGATGGTGAGCACGNNCGATGATGGTGAGCACGNNCGATGATGGTGAGCACG|TGATGGTGAGCACG|... +++++++++++++++++ +++Thus the barcode data consists of 2 base pairs with 17 base pairs of overhead, or 10.5% of the DNA.
The authors tested their technique by using it to track many generations of cell division:
Finally, we leverage DNA Typewriter in conjunction with single-cell RNA-seq to reconstruct a monophyletic lineage of 3,257 cells and find that the Poisson-like accumulation of sequential edits to multicopy DNA tape can be maintained across at least 20 generations and 25 days of in vitro clonal expansion.The report describes the second paper:
In a study published in Nature, his team describes a strategy for genetically engineering cells so that the genetic code for a molecule that interests scientists carries with it the instructions for a barcode. When the molecule turns on within a cell, it switches on the barcode too. Then the typewriter goes to work, documenting the molecule's activity.This second paper is Multiplex genomic recording of enhancer and signal transduction activity in mammalian cells by Wei Chen et al. They call the way they apply the "DNA Typewriter" ENGRAM (ENhancer-driven Genomic Recording of transcriptional Activity in Multiplex):
ENGRAM is based on the prime editing-mediated insertion of signal- or enhancer-specific barcodes to a genomically encoded recording unit. We show how this strategy can be used to concurrently genomically record the relative activity of at least hundreds of enhancers with high fidelity, sensitivity and reproducibility. Leveraging synthetic enhancers that are responsive to specific signal transduction pathways, we further demonstrate time- and concentration-dependent genomic recording of Wnt, NF-κB, and Tet-On activity. Finally, by coupling ENGRAM to sequential genome editing, we show how serially occurring molecular events can potentially be ordered. Looking forward, we envision that multiplex, ENGRAM-based recording of the strength, duration and order of enhancer and signal transduction activities has broad potential for application in functional genomics, developmental biology and neuroscience.In other words, they add to each enhancer the ability to write its specific barcode to the cell's DNA tape, recording not merely that the enhancer was active, but the sequence of enhancer activity along the DNA tape.
In another aspect of DNA data storage, Why are hard drive companies investing in DNA data storage?, John Timmer reports on a new partnership between Seagate and Catalog, the DNA data storage company I first discussed 3.5 years ago in Cost-Reducing Writing DNA Data. Earlier this year I wrote about Catalog in Storage Update: Part 1:
who encode data not in individual bases, but in short strands of pre-synthesized DNA. The idea is to sacrifice ultimate density for write speed. David Turek reported that, by using conventional ink-jet heads to print successive strands on dots on a polymer tape, they have demonstrated writing at 1Mb/s.This is important because the low write bandwidth is a major constraint on the usefulness of DNA storage. This is especially true in the enterprise archival market; Kestutis Patiejunas stressed the importance of write bandwidth in the talk I reported in More on Facebook's "Cold Storage":
Nevertheless, this design has high performance where it matters to Facebook, in write bandwidth. While a group of disks is spun up, any reads queued up for that group are performed. But almost all the I/O operations to this design are writes. Writes are erasure-coded, and the shards all written to different disks in the same group. In this way, while a group is spun up, all disks in the group are writing simultaneously providing huge write bandwidth. When the group is spun down, the disks in the next group take over, and the high write bandwidth is only briefly interrupted.Four years ago I pointed out the problematic economics of DNA storage in DNA's Niche in the Storage Market that the cost of writing DNA was a major problem for its use in archival storage. Catalog CEO Hyunjun Park agrees:
Citing a DNA synthesis cost of about .03 cents per base, Park said, ".03 cents times two bits per base pair times, say, gigabytes—that's a lot of money. That's millions of dollars."At $0.0012 per byte, writing a petabyte would cost $1,200B. As I predicted, enterprise archiving in DNA is a tough sell:
But the company found that potential customers were less interested in archiving than Catalog expected. "We've been speaking with companies like Seagate and other companies in the entertainment industry or gas, tech—a lot of very large companies with big data problems and challenges. And we saw that it's not just the cold storage aspect of this that's interesting to them."But computation in DNA only makes sense after writing massive amounts of data:
Instead, Park found that people were intrigued by the prospect that DNA could allow massively parallel operations on the stored data without the need to convert it back to digital form—Park cited massively parallel database searches and digital signal processing as potential applications. "We want to create a new tier of computational storage, where it supports massive data sizes but is also very much searchable and computable," Park said.
"You need to be able to have the ability to store a lot of information to DNA before DNA base computation makes sense," Park said, because traditional computers will chew through smaller amounts of data without hitting bottlenecks. DNA storage only comes into its own because it can handle massive parallelism better. "[If] you're trying to compute on say, a megabyte of data stored in DNA, the time or resources it would take to do that would be, say, on par with the time it would take to compute on a petabyte of data stored in DNA," he said.There is competition for the computational storage market, as Dan Robinson reports in Computational storage specs hit v1.0 after 4 years of work:
The Storage Networking Industry Association (SNIA) has at last published version 1.0 of its Computational Storage Architecture and Programming Model, the specs meant to help develop the new performance-boosting tech by providing interoperability between different vendors.Four years is actually rapid progress in this area. I believe the beginning of computational storage was a 2009 paper entitled FAWN A Fast Array of Wimpy Nodes in which David Andersen and his co-authors from C-MU showed that a network of large numbers of small CPUs coupled with modest amounts of flash memory could process key-value queries at the same speed as the networks of beefy servers used by, for example, Google, but using 2 orders of magnitude less power.
Computational storage covers hardware and software architectures in which compute is more tightly coupled with storage at the system and drive level, partly driven by trends such as the growing volumes of data that organizations find themselves dealing with.
A typical implementation might be an SSD with some embedded compute capability, which might vary from an FGPA designed to offload compression, encryption, and erasure coding from the host CPU, to an Arm CPU capable of running Linux and custom code.
There is, however, better news on the side of DNA storage that isn't the economic problem, reading. In The era of fast, cheap genome sequencing is here, Emily Mullin reports that:
Illumina unveiled what it calls its fastest, most cost-efficient sequencing machines yet, the NovaSeq X series. The company, which controls around 80 percent of the DNA sequencing market globally, believes its new technology will slash the cost to just $200 per human genome while providing a readout at twice the speed. Francis deSouza, Illumina’s CEO, says the more powerful model will be able to sequence 20,000 genomes per year; its current machines can do about 7,500. Illumina will start selling the new machines today and ship them next year.DNA is far from the only molecule that can be used to store data. Ink filled with secret molecules can hide encryption key in a letter by Chris Stokel-Walker reports on another technique that is more about steganography than storage, since they only stored 256 bits:
The team generated a 256-character [sic] cipher key to encrypt and decrypt files using the Advanced Encryption Standard (AES), a common cryptography method. The group then encoded the cipher key into eight oligourethanes, a type of polymer.The whole Wizard of Oz thing is just a shiny object to distract from the fact they stored just 256 bits. The researchers then go off into the typical irrelevant chemical data storage hype about density:
Each polymer was made up of 10 smaller compounds called monomers. The middle eight monomers hold the details of the key, with one monomer on each side acting as the key’s synthesiser and decoder.
The team mixed the polymers with isopropanol, glycerol and soot to make an ink. Using the ink, some of the researchers wrote a letter and then sent it to another person on the team. The group then extracted a sample of the ink and used the cipher key to unlock an encrypted file: the entire text of the novel The Wonderful Wizard of Oz by L. Frank Baum.
For Anslyn, the ability to hide information is secondary. His main goal is storing data. The information density within molecules in polymers, which include DNA, is higher than it would be through encoded magnetic spots on a hard disc. “The information density is mind-boggling,” he says.This focus on density instead of the actual economics of storing data is the bane of discussions of chemical data storage. For example, it is true that Catalog's data storage density is much higher than current magnetic media, but writing it involves a machine the size of a kitchen.