Why are hard drive companies investing in DNA data storage?

The research community is excited about the potential of DNA to function as long-term archival storage. That’s largely because it’s extremely dense, chemically stable for tens of thousands of years, and comes in a format we’re unlikely to forget how to read. While there has been some interesting progress, efforts have mostly stayed in the research community because of the high costs and extremely slow read and write speeds. These are problems that need to be solved before DNA-based storage can be practical.

So we were surprised to hear that storage giant Seagate had entered into a collaboration with a DNA-based storage company called Catalog. To find out how close the company’s technology is to being useful, we talked to Catalog’s CEO, Hyunjun Park. Park indicated that Catalog’s approach is counterintuitive on two levels: It doesn’t store data the way you’d expect, and it isn’t focusing on archival storage at all.

A different sort of storage

DNA is a molecule that can be thought of as a linear array of bases, with each base being one of four distinct chemicals: A, T, C, or G. Typically, each base of the DNA molecule is used to hold two bits of information, with the bit values conveyed by the specific base that is present. So A can encode 00, T can encode 01, C can encode 10, and G can encode 11; with this encoding, the molecule AA would store 0000, while AC would store 0010, and so on. We can synthesize DNA molecules hundreds of bases long with high efficiency, and we can add flanking sequences that provide the equivalent of file system information, telling us which part of a chunk of binary data an individual piece of DNA represents.

The problem with this approach is that the longer the string of bits is that you want to store, the more time and money it takes. Robotic hardware performs the synthesis reactions, and each hardware unit can only synthesize a single DNA molecule at a time. The raw materials the hardware uses to perform that synthesis also add a cost for each stored molecule. While this isn’t a concern for small-scale demonstration projects, the costs quickly become prohibitive if you start storing large amounts of data. Citing a DNA synthesis cost of about .03 cents per base, Park said, “.03 cents times two bits per base pair times, say, gigabytes—that’s a lot of money. That’s millions of dollars.”

Park told Ars that Catalog started by rethinking the encoding process to get around this bottleneck. The company’s encoding starts with a library of dozens to hundreds of short pieces of DNA called oligos (short for oligonucleotide). Each bit in the data is then assigned a unique combination of oligos—you can think of this as a bit like a silicon processor assigning a bit in memory a unique, 64-bit address. If that bit is a 1, a robot can gather small samples of solutions containing each of the oligos needed to represent it and combine them with an enzyme that can link all of the oligos together.

The enzyme merges the oligos into a single, longer DNA molecule that contains the unique signature of the bit. If, in contrast, the bit is a zero, the corresponding DNA for its address isn’t synthesized.

All of the molecules that are produced can then be combined in a single solution (which can be dried out for long-term storage). To read the data, the population of DNA molecules is sequenced, and an algorithm recognizes the unique combination of oligos present in each molecule. The recognized addresses are assigned a 1; the rest, a 0. This restores the data that was encoded to digital form.

This system is far less efficient in data/DNA than storing two bits in every base. But the individual molecules remain small enough that it’s still an impressively compact and stable storage medium. And it saves significant time and money due to a fundamental asymmetry: It’s far cheaper to synthesize a lot of one specific DNA sequence than it is to synthesize small amounts of lots of different DNA sequences. So by assembling DNA using a small bit of a large volume of pre-made DNA, the cost of synthesis goes down dramatically. Each assembly reaction can also be run in parallel; in contrast, synthesizing individual sequences ties up the machine they’re running on until the synthesis is complete.

Not about archiving

In the latest implementation of this concept, Catalog has built a machine (called Shannon, after the information theorist Claude Shannon) based on inkjet technology, Park said. Each jet can “print” a single oligo into a drop on a continuous sheet of film. “Different oligos land at the same reaction spot and we overprint with the droplet of enzyme, and that film goes into an incubator.” Park told Ars. There, the enzyme assembles them into a DNA molecule. Once the reactions are complete, the drops can be combined into a single solution that contains all the encoded data.

Part of Catalog’s partnership with Seagate involves seeing if some fluid handling hardware that the hard drive company has developed could help shrink and automate the process even further, cutting the energy and resource use. (Park compared the size of Shannon to that of a typical kitchen.)

In any case, the output of Shannon is all set for archiving. But the company found that potential customers were less interested in archiving than Catalog expected. “We’ve been speaking with companies like Seagate and other companies in the entertainment industry or gas, tech—a lot of very large companies with big data problems and challenges. And we saw that it’s not just the cold storage aspect of this that’s interesting to them.”

Instead, Park found that people were intrigued by the prospect that DNA could allow massively parallel operations on the stored data without the need to convert it back to digital form—Park cited massively parallel database searches and digital signal processing as potential applications. “We want to create a new tier of computational storage, where it supports massive data sizes but is also very much searchable and computable,” Park said.

Park said the encoding scheme could provide an advantage for some DNA-based operations, partly because we know something in advance about the structure of the data—something that’s not possible with encoding schemes where the sequence of bases varies based on the data being stored. Similarly, the absence of certain sequences in this encoding scheme could be useful. At this point, however, Park said that Catalog is still in the process of figuring out how to implement some of these ideas. Demonstrations may still be a while off.

An actual computational advantage may be further off still, since any advantage will only come at very large scales. “You need to be able to have the ability to store a lot of information to DNA before DNA base computation makes sense,” Park said, because traditional computers will chew through smaller amounts of data without hitting bottlenecks. DNA storage only comes into its own because it can handle massive parallelism better. “[If] you’re trying to compute on say, a megabyte of data stored in DNA, the time or resources it would take to do that would be, say, on par with the time it would take to compute on a petabyte of data stored in DNA,” he said.

While a startup like Catalog is obviously focused on profitable companies that deal with huge data sets, it’s possible that some of the first applications will come out of the academic community. Park cited the massive amounts of data produced by the Large Hadron Collider as a potential target, saying that Catalog had signed up for the Open Labs technology development framework run by CERN, which hosts the collider. “I think [DNA would] be a perfect way to store massive amounts of data—when there’s a new theory that comes out, you want to be able to search through all of the previous experiments in a very efficient way,” Park said. “And there isn’t a way to do that currently, because they just have rooms full of tape. I think a DNA-based system would be a very good solution for that.”

DNA may enable computation in storage on massive data sets.

A different sort of storage

Not about archiving