Data storage

An extended molecular alphabet for storing DNA data

February 28, 2022

(News from Nanowerk) DNA data storage systems have the potential to hold orders of magnitude more information than existing systems of comparable size. Compared to existing data storage technologies, it is potentially cheaper, much more physically compact, more energy efficient and more durable – DNA survives for hundreds of years and is maintenance free. Files stored in DNA can also be very easily copied at negligible cost.

The storage density of DNA is staggering. Consider this: humanity will generate about 33 zettabytes by 2025, or 3.3 followed by 22 zeros. All of this information would fit in a ping-pong ball, with room to spare. The United States Library of Congress has about 74 terabytes, or 74 million million bytes, of information – 6,000 of those libraries would fit in a DNA archive the size of a poppy seed.

Information stored in DNA can be copied in a massive parallel fashion and selectively retrieved via the polymerase chain reaction (PCR). However, existing DNA storage systems suffer from high latency caused by the inherently sequential write process. Despite recent advances, a typical solid-phase DNA synthesis cycle time is on the order of minutes, which limits the practical applications of this molecular storage platform.

To overcome these challenges, new synthesis methods and new information encoding approaches are needed to accelerate the speed of writing large volume datasets.

Extending the alphabet of a DNA storage medium by including chemically modified DNA nucleotides can both increase storage density and write speed because more than two bits are recorded during each cycle of synthesis.

Advancing research in this regard, scientists have now reported an expanded molecular alphabet for storing DNA data comprising four natural nucleotides and seven chemically modified nucleotides that are easily detected and distinguished using nanopore sequencers.

DNA data storage using natural and chemically modified nucleotides. (A) Chemical structures of natural DNA nucleotides (A, C, G, T) and selected chemically modified nucleotides employed in our study (B1-B7). (B) Schematic of the ssDNA oligo used in the MspA nanopore experiments. The length of the oligos is 40 nucleotides (nts), with biotin attached at the 5′ end. The homo- or heterotetrameric sequences are located at positions 13-16, flanked by two polyT regions of 12 nt and 24 nt length at the 5′ and 3′ ends, respectively. (C) Sequence space for DNA homotetramers or heterotetramers used in MspA nanopore experiments. The notation aX + bY, where a and b take values ​​in {2, 3, 4} so that a + b = 4, indicates that “a” symbols of the same type are combined with “b” symbols of another type and arranged in an arbitrary linear order. A total of 77 distinct tetrameric sequences were synthesized and tested experimentally. (Left) Pie chart showing the 11 homotetramers and 12 tetramers of the form ACT+X, where X is a chemically modified nucleotide from the set {B2, B3, B5}. (Middle) Pie chart showing the 30 tested combinations of tetramer sequences with the total composition 2X + 2Y using chemically modified monomers from the set {B1, B2, B3, B4, B5}, including sequence models XXYY, XYYX and XYXY. (Right) Pie chart showing the remaining 24 combinations of tetramer sequences with the total composition 3X+Y using the set {B2, B3, B5}. Five chemically modified nucleotides form stable base pairs with natural nucleotides via hydrogen bonds (B2-G, B3-A, B5-A, B6-A, B6-C), based on results of molecular dynamics simulations (MD). (Reproduced with permission from the American Chemical Society) (click image to enlarge)

The findings are published in Nano-letters (“Expanding the molecular alphabet of DNA-based data storage systems with neural network nanopore readout processing”).

The authors’ results show that Mycobacterium smegmatis Porin A (MspA) nanopores, which are widely used for ssDNA detection and single molecule chemistry studies, can accurately distinguish 77 combinations and orders of chemically diverse monomers in homo and heterotetrameric sequences.

They further demonstrate that highly accurate classification (greater than 60% on average) of natural and chemically modified nucleotide combinatorial patterns is possible using deep learning architectures.

According to the scientists, the extended molecular alphabet has the potential to offer nearly doubled storage density and potentially the same order of reduced recording latency, thus offering a promising avenue for the development of novel molecular recorders.