Data storage

Why are hard drive companies investing in DNA data storage?

The research community is excited about the potential of DNA to function as long-term archival storage. That’s largely because it’s extremely dense, chemically stable for tens of thousands of years, and comes in a format we’re unlikely to forget reading about. Although there have been some interesting advancements, efforts have mostly remained within the research community due to high costs and extremely slow read and write speeds. These are issues that must be resolved before DNA-based storage can be practical.

So we were surprised to learn that storage giant Seagate had entered into a collaboration with a DNA-based storage company called Catalog. To find out how close the company’s technology is to utility, we spoke to Catalog CEO Hyunjun Park. Park said Catalog’s approach is counter-intuitive on two levels: it doesn’t store data as one would expect, and it doesn’t focus on archival storage at all.

A different storage

DNA is a molecule that can be thought of as a linear array of bases, with each base being one of four distinct chemicals: A, T, C, or G. Typically, each base in the DNA molecule is used to hold two pieces of information, with the bit values ​​conveyed by the specific base present. Thus A can code 00, T can code 01, C can code 10 and G can code 11; with this encoding, the AA molecule would store 0000, while AC would store 0010, and so on. We can synthesize DNA molecules hundreds of bases long with great efficiency, and we can add flanking sequences that provide the equivalent of file system information, telling us which part of a block of binary data a individual piece of DNA represents.

The problem with this approach is that the longer the bit string you want to store, the more time and money it takes. The robotic hardware performs the synthesis reactions and each hardware unit can only synthesize one DNA molecule at a time. The raw materials used by the hardware to perform this synthesis also add a cost for each molecule stored. While not a problem for small-scale demo projects, the costs quickly become prohibitive if you start storing large amounts of data. Citing a DNA synthesis cost of about 0.03 cents per base, Park said, “0.03 cents times two bits per base pair times, say, gigabytes – that’s a lot of money. is millions of dollars.”

Park told Ars that Catalog started by redesigning the encoding process to get around this bottleneck. The company’s encoding begins with a library of tens to hundreds of short pieces of DNA called oligos (short for oligonucleotide). Each bit in the data is then assigned a unique combination of oligos – you can think of it a bit like a silicon processor assigning a bit in memory a unique 64-bit address. If this bit is a 1, a robot can gather small sample solutions containing each of the oligos needed to represent it and combine them with an enzyme that can bind all the oligos together.

The enzyme fuses the oligos into a single, longer DNA molecule that contains the bit’s unique signature. If, on the contrary, the bit is a zero, the DNA corresponding to its address is not synthesized.

All the molecules produced can then be brought together in a single solution (which can be dried for long-term storage). To read the data, the population of DNA molecules is sequenced and an algorithm recognizes the unique combination of oligos present in each molecule. Recognized addresses are assigned a 1; the rest a 0. This restores the data that was encoded in digital form.

This system is much less data/DNA efficient than storing two bits in each base. But the individual molecules remain small enough for it to be an incredibly compact and stable storage medium. And it saves a lot of time and money because of a fundamental asymmetry: it is much cheaper to synthesize a large amount of a specific DNA sequence than to synthesize small amounts of many DNA sequences. different DNAs. Thus, by assembling DNA using a small piece of a large volume of pre-made DNA, the cost of synthesis drops significantly. Each assembly reaction can also be run in parallel; in contrast, synthesizing individual sequences ties up the machine they are running on until the synthesis is complete.