Context, content, and understanding the genome
In my last post, I talked about how as a biologist I often feel I am drowning in data, and that I wished we spent more time on the basic, hypothesis-based research that will provide the signposts we need to interpret the deluge. I think this is especially true in the field of epigenetics, which is usually where my (rather myopic at the moment) gaze falls. But a few recent studies have brought this to light in genetics as well — which is to say, recent studies have highlighted just how complex the genome is in function, even just considering the sequence.
Context is content in proteins and DNA
In the first study, a search for the cause of a genetic disorder led to an interesting discovery — the causative variant (clearly causative, and elegantly proved so) had been classified as neutral because it is found, with no ill effects, in several other vertebrates. The crux of the story was, however, that each other vertebrate /with/ the disease causing variant also had a /protective/ variant nearby. In combination, the two did not lead to disease. Alone, however, the mutation was very harmful. It is an elegant demonstration of something fairly basic: that each protein, and in fact the genome, acts as a whole. Bases are not unitary, and the effect of changing any one depends largely on their context.
It’s easy to conclude from this that we need more sequencing, and more data. Right now we just don’t have enough samples to cover all the pairs of variants, but if we did, we could figure it out. Even the authors suggest this, saying that in what we’ve already sequenced we can get to 12% of pairs. On the other hand, pairs are quite the simplest of interactions, and there’s no reason why triplets, quadruplets, or combinatorial interactions between any number of factors couldn’t contribute to the function of a protein, protein complex, or pathway. In fact, this is trivially true. At that point, given the size of the human genome, there aren’t enough people in the world — or computing hours — to run the calculation.
So, we need a better rule, more understanding of how the sequence of a protein leads to its structure and how the structure of a protein leads to its function. These are big basic questions, and they are crucial in order for individualized medicine to get out of the starting gate. Sequencing more individuals, while tempting as a brute force approach, simply isn’t going to cut it.
A diverse genome is a changing genome
The other study that particularly piqued my interest was one that showed, in plants as well as animals, that heterozygosity bred mutation. What that means: having variation in the population (which is required for two different alleles to come together in a heterozygote) will result in more mutations, and hence more variation, nearby. In this case, the sequence itself doesn’t even matter so much as the fact that the two alleles in the nucleus are different from each other. The authors suggest that this could be due to problems during mieosis, when the two copies of the chromosomes must “pair” — if they’re not actually the same, pairing is harder, and the genome gets damaged.
I have a whole other post I want to write about that idea — what it means for modern agriculture, how it is reflected in our societies. But scientifically, it points out just how much each base pair in the genome is reliant on the others. The genome, in this picture, is not a static thing that is easy to copy. Its copying requires it align and contort itself out of its natural fold, and as such changes in the sequence change its ability to be faithfully copied.
It’d be like copying a book, if every typo made your eyes blur over for a bit. And the book really really wanted to go to another page.
But the glorious thing about this study is that we can use the result. We’ve been trying to understand and predict evolutionary “hot spots” for ages. We’ve observed that mutations tend to happen in clusters, and we haven’t been able to tell why. This study puts us a bit closer to seeing why, and to being able to predict which parts of our genomes are more fluid — and as a consequence, which details are more specifically important.
We’re in an exciting moment for biological sciences, but also one that is fraught with difficulties. There are so many projects that need doing, and so few that are actually supported. I think it’s tempting right now to take the brute-force approach, especially since making a tool (like coming up with a new way to sequence a slightly different portion of the genome) is a surer way to get something publishable than is finding a new biological insight. And when funding is scarce, everyone plays it safe. Everyone does the thing they know will get them a result. Some people are realizing that we just don’t have enough understanding to filter through the noise right now, and I hope that we see a shift back towards careful analysis, and real biology, in the future. If we can come back to these massive data sets with a bit more basic knowledge, we’ll be better able to see the signal through the noise.
Featured image is taken from “DNA alignment written in paper” by Ben Casey – Licensed under CC BY-SA 3.0 via Wikimedia Commons