Write this in DNA
Sometimes it seems like the “Age of Synthetic Biology” is actually the “Age of Writing Great Works of Literature Into DNA”. While the tools of synthetic biology increase the potential of genetic engineering by letting us finely control the genomes of experimental organisms, many of the demonstrations of how minutely we can write out DNA sequences have more to do with the western canon than protein function.
This article has a description of work which encoded all of Shakespeare’s sonnets, Watson and Crick’s paper describing the structure of DNA, a color photo, and an excerpt of an audio recording of Martin Luther King’s “I Have a Dream” speech.
So how does that work?
Basically, you can use DNA bases the same way you would use bits in a computer. One group in 2012 used a binary system: instead of A, C, G, and T representing four different signals, they use A and C to mean, essentially, 1, and G and T mean, essentially, 0. From there it’s easy to get any computer-encoded data into a strand of DNA.
The group featured in the article did something somewhat more complicated: they used a trinary code instead, converting each byte into base three code and then converting that into DNA.
But DNA isn’t trinary – there are four bases, so it’s by default quaternary. Theoretically, you could convert a byte in base 2 to a string of four base-four digits, and put that into DNA. But it would run into errors of sequencing, the most common of which happens when you have a string of several copies of the same base: called a “homopolymer”.
A homopolymer messes up sequencing protocols because we sequence, now, based on reactions that happen when bases are added to DNA chains. That “adding a base” reaction isn’t 100% perfect: we can’t guarantee that absolutely EVERY molecule with that sequence added a base in each round of the reaction. (Side note: We also don’t really measure single molecules; we measure clusters of identical molecules generated chemically and tethered to a bead or a plate.) If the sequence is a heteropolymer (say, “AGCTCATAG”) then having a subset of the molecules in the cluster be off by one or two bases for a couple rounds is okay. It won’t seriously mess up the sequence. But what happens when you have a homopolymer? (say, “GAAAAAAAAAT”) It becomes more difficult to say with certainty that the sequence actually has nine As, and not 8 or 7. If the string gets long enough, it gets even worse: the strands can slide along eachother, which can actually add or subtract bases from the sequence. (So it’s not an error in our detection, or in the yield of the reaction, but a true mutation in the DNA sequence). That kind of frame-shifting can result in a lot of damage to your data.
But we don’t need to use all four bases to encode four different signals. We could, as this group did, use the four bases to encode three different signals. One way to do that would be to say that “G” for example was a wild card – it meant “Whatever the last base was”. So 012110021 would be, perhaps, ATCTGAGCT. No homopolymers.
What the group actually did was determine which three bases to use by the prior base. Basically, something like this table:
(I might have extrapolated that from their figure.)
So, you can’t get homopolymers, because by definition the previous base isn’t considered in encoding.
The point is, there are lots of ways to put data into DNA. DNA is, at base, a molecule which stores data: genetic data. It’s incredibly dense storage for data; far denser by volume than most storage systems we have right now because it’s a thin strand of data that can be coiled on itself so tightly. These guys stored about 757 kilobytes of data in 337 picograms of DNA; roughly 2.2 petabytes per gram of DNA.
Of course, there’s a really obvious reason why we’re not going to see DNA computers any time soon: their experiment of writing that 757 kb of data and reading it back took almost three weeks. Then again, the authors aren’t suggesting that we use this to stream video. For archival storage – something you could take a couple days to write and wait a week to read out; something that needs to be dense and long lasting – DNA seems ever more likely as a suitable substrate.
I want to end this post with something a little bit more fun: writing poetry for DNA, in DNA’s ‘natural’ code. The authors I’ve been talking about create a new code that leverages the density of information inherent in DNA. It would be an actually functional code. But the “DNA code” that most people already know about is a different one: one where three bases encode an amino acid in a protein.
Of course, amino acids also have one letter abbreviations, and with them we start having most of an alphabet (with, sadly, a dearth of vowels): A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y. With a little bit of fudging, you can add B (either N or D) and Z (either Q or E). Not that that makes it much better. And then, with a codon table, you could write a sentence into DNA: “Write this in DNA” becomes TGGCGTATTACTCAA ACTCATATTTCT ATTAAT GATAATGCT.
Or, possibly, a haiku.
Every living thing
Sits with me, writing idylls
Speaking life’s bases
CAAGTTCAACGTTAT CTTATTGTTATTAATGGT ACTCATATTAATGGT
TCTATTACTTCT TGGATTACTCAT ATGCAA, TGGCGTATTACTATTAATGGT ATTGATTATCTTCTTTCT
TCTCCTCAAGCTAAAATTAATGGT CTTATTTTTCAATCT GATGCTTCTCAATCT
This reminds me of an essay by Brian Hayes: “The Invention of the Genetic Code” (which happily is online at http://www.americanscientist.org/issues/pub/the-invention-of-the-genetic-code — I highly recommend it), about some of the hypotheses people came up with for exactly how the 20 amino acids were encoded by the 4 bases. Several notions were proposed, based more on information theory than biochemistry, some of them rather clever … and none of them particularly close to the truth.