I might have had the past month of my life swallowed by writing and editing a paper analyzing a massive RNA sequencing data set. And that means that what is on my mind right now is how we visualize massive biological data sets. In particular, RNA sequencing data sets. In particular, relative RNA expression levels. And, especially in particular, the obvious problems with the most popular method of doing so.
The short story: biologists use a red/green color scale to show the difference between low and high transcription levels in large data sets. High throughput studies are full of images like the featured image. They’re pretty terrible, especially for people with red/green color-blindness. You would think that biologists would know better, what with some of them researching color-blindness.
But, much as I love my fellow biologists, we are not graphic designers. And just like any other field, we end up with strange conventions that ease the understanding of fellow biologists, even if they don’t seem logical at the start.
So, why use a red/green gradient at all? We knew about color blindness before we were measuring the relative levels of specific RNA molecules, so how did this even start? Is there some cultural preference for green being “on” and red being “off”? Possibly. But in all likelihood, it’s more about the practicalities of the very first high-throughput data sets. So I’m going to start by talking about how we used to measure RNA levels. I’m not even going to just go one step back, I’m going to go two steps back. We’re going to talk northern blots.
A northern blot is, quite simply, a way to detect a specific RNA sequence in a sample. You use gel electrophoresis to separate individual RNA molecules by size, and then transfer those molecules to a membrane and bind them in place using UV light. You then hybridize with a complementary probe (usually DNA because it’s more stable), that has been made detectable – either by using radioactive phosphorous in it, or by covalently attaching fluorescent molecules. The probe sticks to the sequence of interest, and lights it up. You get a band on a gel: totally old school.
By old school, I mean that we’ve been developing ways to do thousands of northern blots at the same time since the 1980s. You can tell a biologist is getting bored with a technique when (1) he develops a way to do it thousands of times at once, or (2) she breaks down an builds a robot to do it for her. A microarray is, basically, thousands of northern blots done at once. Only, well, backwards.
In a northern blot, you tether your sample to a membrane, and you float over labeled probe. In a microarray, you tether your probe to a slide, and you float over labeled sample. Based on where your label shows up, you can tell which molecule it was attached to. So you can do your northern blot, for every gene in a sample, and see which of thousands of genes are being transcribed. Cool. But since this is biology we’re talking about, you still probably have to do two, a test and a control.
Only, wouldn’t it be even better if you could just do them both at once? That way, any fluctuations due to the time you let them hybridize or the temperature of the slide or the humidity or whatever would be canceled out. And, hey, if we use fluorescence, we can measure two colors!
You can see where this is going, right? The first two fluorescent molecules that were widely used were GFP – green – and DsRed – red. (Related note: GFP is awesome enough that its discoverers, Shimomura, Chalfie, and Tsien won the Nobel Prize in Chemistry in 2008.) So scientists were able to tell relative levels of test and control by binding DsRed to their control samples and GFP to their test samples and measuring Red v. Green. Red meant turned off in the test, yellow meant no change, and green meant turned on in the test.
Which meant, in short, that there was a very natural way to present that data: red is “off”, green is “on”. Having the grid go through black rather than yellow was essentially just a matter of normalizing for intensity, although it helped to bring attention to the important parts of the graph: the things that were different in the test than in the control.
The thing is, as sequencing becomes cheaper and cheaper, more biologists are moving away from microarrays. Sequencing data gives you everything a microarray does, and more. But everyone is so used to the convention of red is “off” and green is “on” that we look askance at a new one. There are some exceptions. Epigenetic marks, especially DNA methylation or histone modifications, can be expressed in the same cluster-gram, and often have a different color scheme. But that just means that if you’re not measuring RNA levels, you don’t necessarily need to use the same convention. (Of course, many people still do.) And people know this is a problem. In the manual for R color schemes, they have this to say about the red/green color scheme: “The redgreen color map ranges from pure green at the low end, through black in the middle, to pure red at the high end. Although this is the most common color map used in the microarray literature, it will prove problematic for individuals with red-green color-blindness.”
The fact of the matter is, at base, there are lots of conventions in biology – I think in any science – that are there because, at one point, this kind of thing made sense. And even though it’s become divorced from its source (the initial images of microarrays where low expression in a test versus control actually read out as green versus red fluorescence) if any individual researcher tries to change it now we’ll mostly just confuse people.