Tuesday, May 14, 2019

Storing Data In Oligopeptides

Bryan Cafferty et al have published a paper entitled Storage of Information Using Small Organic Molecules. There's a press release from Harvard's Wyss Institute at Storage Beyond the Cloud. Below the fold, some commentary on the differences and similarities between this technique and using DNA to store data.

The paper's abstract reads:
Although information is ubiquitous, and its technology arguably among the highest that humankind has produced, its very ubiquity has posed new types of problems. Three that involve storage of information (rather than computation) include its usage of energy, the robustness of stored information over long times, and its ability to resist corruption through tampering. The difficulty in solving these problems using present methods has stimulated interest in the possibilities available through fundamentally different strategies, including storage of information in molecules. Here we show that storage of information in mixtures of readily available, stable, low-molecular-weight molecules offers new approaches to this problem. This procedure uses a common, small set of molecules (here, 32 oligopeptides) to write binary information. It minimizes the time and difficulty of synthesis of new molecules. It also circumvents the challenges of encoding and reading messages in linear macromolecules. We have encoded, written, stored, and read a total of approximately 400kilobits (both text and images), coded as mixtures of molecules, with greater than 99% recovery of information, written at an average rate of 8bits/s, and read at a rate of 20 bits/s. This demonstration indicates that organic and analytical chemistry offer many new strategies and capabilities to problems in long-term, zero-energy, robust information storage.
The press release explains the basic idea:
Oligopeptides also vary in mass, depending on their number and type of amino acids. Mixed together, they are distinguishable from one another, like letters in alphabet soup.

Making words from the letters is a bit complicated: In a microwell—like a miniature version of a whack-a-mole but with 384 mole holes—each well contains oligopeptides with varying masses. Just as ink is absorbed on a page, the oligopeptide mixtures are then assembled on a metal surface where they are stored. If the team wants to read back what they “wrote,” they take a look at one of the wells through a mass spectrometer, which sorts the molecules by mass. This tells them which oligopeptides are present or absent: Their mass gives them away.

Then, to translate the jumble of molecules into letters and words, they borrowed the binary code. An “M,” for example, uses four of eight possible oligopeptides, each with a different mass. The four floating in the well receive a “1,” while the missing four receive a “0.” The molecular-binary code points to a corresponding letter or, if the information is an image, a corresponding pixel.

With this method, a mixture of eight oligopeptides can store one byte of information; 32 can store four bytes; and more could store even more.
The idea of encoding data using a "library" of previously synthesized chemical units is similar to that Catalog uses with fragments of DNA. Catalog claims to be able to write data to DNA at 1.2Mb/s, which makes the press release's claim that the Wyss technique's:
“writing” speed far outpaces writing with synthetic DNA
misleading to say the least. Also misleading is this claim from the press release:
DNA synthesis requires skilled and often repetitive labor. If each message needs to be designed from scratch, macromolecule storage could become long and expensive work.
The team from Microsoft Research and U.W. have a paper and a video describing a fully-automated write-store-read pipeline for DNA. As I understand it Catalog's approach is also fully automated.

Both the Wyss Institute and Catalog approaches can readily expand their "libraries" to increase the raw bit density of the medium, but again this is misleading. As I explained in detail in DNA's Niche in the Storage Market, the data density of an actual storage device is controlled not by the size of a bit on the medium, but by the infrastructure that has to surround the medium in order to write, preserve and read it.

Like all publications about chemical storage, the Wyss technique's prospects are hyped by referring to the "data tsunami", or the "data apocalypse", with the demand for data storage being insatiable. This merely demonstrates that the writers don't understand the storage business, because they uncritically accept the bogus IDC numbers. The idea that data will only be stored, especially for the long term, if the value to be extracted from it justifies the expense seems not to occur to them. And thus that the price/performance of storage devices rather than the density of storage media is critical to their market penetration.

The details of the competing chemical storage technologies are actually not very relevant to their commercial prospects. All the approaches are restricted only to archival data storage, which as I explained in Archival Media: Not A Good Business, is a niche market with low margins. There is some demand to lock data away for the long term in low-maintenance long access latency media, but it is a long way from insatiable.

No comments: