Science post today. Something soothing and relaxing, possibly to the point of lulling you to sleep. I'm going to share why the breakthrough technology of RNA Sequencing has basically ruined my life.
We're firmly in the age of big data. Applied to biological problems, this is essentially a brute force approach where we read every scrap of information in a biological system and then dig into into the entire meal. This is in stark contrast to previous eras, where we had hypothesis-driven science as our only metric.
Let's put a face on the project. Think of Alzheimer's. No one wants to admit it, but we have no consensus idea of what causes the disease. We don't see it coming, we don't know if it's reversible with drugs or whether you're already fucked with the first symptoms (this matter a lot, as the latter case shifts the main efforts to super early disease detection) and, most importantly, we don't know what cellular pathways to target.
Now that's a big problem, and there are two main ways to tackle it. Option one is educated guesses: Twenty years ago, I worked for a lab that did just that. They put two observations together. The first was that Americans had high levels of Alzheimer's compared to worldwide levels. The second was that a lipid processing protein was linked to the disease - people who had poor lipid processing were much less likely to get AD. The lab put these two things together and hypothesized that genes (specifically the lipid processing gene) and diet played a synergistic role in developing AD. They built mice with defects in the lipid processing gene, put them on a typical American diet high in fat and sugar, and waited to see if the typical plaques and tangles that occur in AD appeared. (Spoiler alert: Not so much)
This type of science used to be all we had. But now there's option two: We use "big data" to take a snapshot of many sick people and compare them to snapshots of similar-but-not-sick individuals. Eventually, the thinking goes, our analysis will become powerful enough that patterns will emerge. I'll dig into what that really means in a second, so bear with me.
The technological side of the big data approach has gone quite well. We're able to do more sensitive analysis for less money than ever before. Whereas it took more than two decades and three billion dollars to sequence the genome of a single person in the 80s/90s, we're now able to read the transcriptome of a person for a tiny, tiny fraction of this amount. Reading the transcriptome is, in many ways, more useful than reading the genome itself. While examining the genome reveals potential causative mutations leading to diseases, there's a lot of junk in the dataset; most of our DNA is garbage sequence with no real use. Moreover, even the useful stuff that encodes a gene or something else useful isn't always relevant when you're looking at a potential disease. The genome of most cells are fairly uniform diploids (fun brain teaser: can you think of cells or biological situations where this isn't the case?), and cells in the brain will contain genes used for, say, muscle replacement that are functionally inactive.
The transcriptome, on the other hand, can be far more illustrative. Remember that DNA is made into RNA which is (usually) made into proteins. Whereas gene sequencing reads the DNA content of cells, transcriptome profiling reads the RNA. Using the same example, a genomic reading of the brain and muscle would be essentially identical, whereas a transcriptome reading would be completely different - the patterns and levels of gene expression would be totally different. Also, where DNA is static and built to last, RNA is inherently dynamic - it's constantly being produced and broken down. Expression levels can change in many ways (with injury, with age, or over the course of a day) and are quite dynamic both in the speed and degree of change. I think most people would agree that transcriptome analysis gives a much better picture about what's going on in the cell.
Applying this technique to a disease, particularly a disease whose cause is not yet known (Alzheimer's for example), you'd think we'd crack that nut pretty quickly using this techie approach. For example, take (postmortem) brain samples from a hundred Alzheimer's disease patients and compare them to a matched set of brains from old people without dementia. Patterns will emerge, right?
Well... yes and no. We still don't know what causes Alzheimer's, and answering the question of 'why' reveals that, while we've solved one problem with RNASeq, we've just peeled back a layer to reveal another layer of different, more difficult problems.
So why doesn't RNASeq solve everything?
(1) Rot. People die in all kinds of places. Until you preserve their brain tissue, it begins to degrade almost immediately. RNA, as I mentioned earlier, is as delicate as a celebrity guest on America's Got Talent and it falls apart fast. I don't want to get into the guts of how RNA is read, but even a few stinkers (and I mean that literally) skews the results of a large study by introducing error. You can work around it with algorithms that diminish the noise, but you also tend to mask real biological differences when you do this.
(2) Natural human heterogeneity. People are different, biologically. This includes at the RNA level. Most experiments on mice are done on strains that are back-crossed (i.e., bred incestually) to remove genetic diversity. While this does tend to give cleaner results, this level of inbreeding is not the case with (most) of us. Data points for the same gene will splay out over a broad range, even in healthy people. This usually means more error or deviation from the mean, which leads study authors to increase the total number of subjects, which leads to:
(3) False positive differences. Using a high enough number of people will tend to create such tight data points that statistical significance (e.g., "this gene is underexpressed in Alzheimer's patients!") becomes almost inevitable. Put a different way, going from 0.01 to 0.1 on the 1-10 attractiveness scale may be a huge jump for the person being rated, but it doesn't matter because you're still fugly.
(4) Translation differences. Transcription is the conversion of DNA to RNA and translation is the conversation of RNA to proteins, the little guys that actually do what the gene codes for. Just because an RNA is there doesn't mean it's translated into a protein, and just because there aren't a lot of RNA transcripts for a gene doesn't mean your cells aren't producing it by making multiple protein copies from a single RNA transcript. Translation is regulated in many ways and it's dynamic. There are feedback loops, for example, where you make too much of a protein, and that actually inhibits the further production of that protein, sort of a molecular biology version of a teeter-totter*. Long story short, RNA level =/= Protein level.
(5) Case control differences. This is the real bastard, and there's not a great way around it. Because we're so different, we all tend to have different experiences. We drink or we don't. We smoke or we don't. We have cancer and take meds for it or we don't or we never get cancer but do get congestive heart failure and take meds for that. You get the idea. Everything we do has the capability to impact our genetic transcriptional profile to the point where comparing relatively subtle changes is akin to trying to control for total chaos.
This last one is particularly bad for psychiatric diseases like the ones I study. People take drugs for decades, they die, then they donate their brains to science. I have to sort out whether the unusual occurrence of gene X is a result of whatever causes schizophrenia or the antipsychotic meds they took for the last umpteen years.
If you want to know why new drugs are so expensive, there's one answer for you. It's also the reason that a lot of drugs are found either by serendipity or by hypothesis-based approaches, which are inherently inefficient but don't suffer from (most of) the problems above.
I know you're all asking what you can do to help me out in making big data work. So I've made a list:
(1) Honestly, it would be VERY helpful if people lived homogenously in identical housing, on identical diets. No multiple races, no ethnic groups. No variation whatsoever. I'm not a huge fan of the two-gender system; that one really wreaks hell on our transcriptome. Try reproducing asexually.
(2) Don't take any medicine for your disease, unless you're fine with another one never being developed.
(3) Regular tissue donations. If you'd let me remove small pieces of your brain while you're alive, that'd be great.
(4) If you decide to kill yourself, don't shoot yourself in the head and don't suffocate yourself. Boith those things ruin the brain (although if I was a hypoxia researcher, suffocation might not be the worst thing). Use pills or simply immerse yourself into a large bucket of paraformaldehyde or deep freeze your whole body. Also, call someone before you do it, so I don't have to deal with a degraded sample.
(5) Be patient with scientists. We're working on it.
*Why don't we just read all the proteins then? Well, we do. But the tech isn't as good. Proteins are even more fragile than RNA in some cases, and there are modifications to proteins that are highly diverse. It's a rabbit hole, this biology thing.
Noah's Inner Monologue
Scribblings of a man who can barely operate an idiotproof website.