There is this idea which is central to a log of biology and bioinformatics: orthology. In fact, it is fairly central to my own work. I've been struggling with it and I've finally figured out what is wrong and why and now I can explain it.
First, let's start of with homology. Homologous genes share an evolutionary history. Two homologous genes, at some point, were a single gene. As time passed, there was some kind of duplication and there became copies of the gene, each now on its own evolutionary journey. Duplications can be with in a genome or due to speciation. With in a genome, an event like replication slippage, homologous recombination, or transposon activity can duplicate a chunk of DNA, providing the organism with two copies. The assumption is that one gene will continue to do the original function while the other can accumulate mutations that allow it to do a related novel function. For instance, hæmoglobin, an oxygen transporter, was a duplication of myoglobin, an oxygen storage gene. Myglobin is definitely the original since it is “more basal”, that is, it is found in organisms that diverged much earlier. These two genes are term paralogs. When a species splits into two species, every gene in the original genome has been “duplicated” since each species is now going down a different evolutionary path, taking their genes along for the ride. These duplications are called orthologs. A sort of reasonable way to think of orthologs versus paralogs is a duplication in space versus time.
First, we need a way to find out if genes are homologous. This is generally done by some comparison of their sequences. If the sequences are similar enough, then the two genes must have some history together because the probability of two sequences coming into being with such similarity is very low. Many people will use the score of the sequence comparison two mean something. This is utterly wrong. Homology is a binary question; either two genes share an evolutionary history or they don't. Moving along... Say that we find a gene g in species A and a gene h in species B where the sequence alignment score is orders of magnitude higher than any other sequence pairing involving g and h. We would conclude that these two genes are orthologs. Let's start making bad assumptions. First, we are going to assume that these two genes have the same function in both organisms. This can be test in the lab by functional complementation. g is replaced by h in organism A and we check that everything still works properly. Sequences for which functional complementation works have very high sequence identity. Unfortunately, there is no good measure of exactly how much sequence identity is needed. Some parts of the sequence are more involved in the chemical activity of the gene than others (NB: true all genes be they proteins, ribozymes, binding sequences, microRNAs, rRNAs, or tRNAs); changes in some parts of the gene may have no noticeable effect. The bad assumption we implicitly make is that this is a bidirectional relationship; we assume that if two genes have high sequence similarity, they must have identical function. There is no ab initio way of knowing which part of the gene is important. Given a pile of genes with equivalent biochemical activities, we can create a model from their sequence and learn which parts of the gene are important and which are not. The next bad assumption is that this technique works for genes which we don't know the biochemical properties. Sequences which are “pretty darn similar” are used to build models and then harvest more sequences that should be “equivalent”. This method is used to make the COG database and many of the protein family databases. The typical method for finding orthologs is to search genome B for matches to gene g for every gene in genome A and then doing the complementary search and selecting any genes which mutually find each other. This has a big problem: these relationships are not transitive. Once you deal with three or more organisms, everything begins to fall apart. You get many wild conflicts and general weirdness. It's also problem that genes are not really the functional units of life. Many genes are made of functional units, domains, and fusions are possible that make genes borrowing bits and pieces of the functions of their ancestors. We can't define homology explicitly for these genes; parts of them can have homologies. A lot of basic operating bits of genes are reused many, many times. Searching for histidine kinase or ATPase activity will find oodles of unrelated proteins.
If you're like me, then you might think that the problem is with the method. Perhaps there is a better method. Indeed many methods have been proposed. The underlying problem is that orthology is the idea, not the method. Orthology has no naturalistic basis. Orthology, as is currently defined is asking for two things: homology and functional equivalence. It is tempting to assume that one of these things might imply the other, but they don't. If you take a group of highly-related organisms and find the genes that are present in all of them, you find that genes required for the organism to survive are missing. Yes, the intersection of all the genes in the various E. coli strains would not be a viable organism. Clearly, some of the genes much have functional equivalents that do no have a common evolutionary ancestry. But how did this come to be? Surely their common ancestor had a single functionally-equivalent gene. How could a replacement gene be introduced? The assumption that the ancestor had two functionally-equivalent genes and its descendants lost one or the other is very unlikely. Moreover, what if genes are homologous and not functionally equivalent? More importantly, what if there is systemic divergence? Suppose you have an organism that creates a particular kind of bioplastic; it is possible to find another organism that has homologous genes to all the ones in the first organism, but creates a different kind of plastic. Are the genes orthologous? It's probably reasonable to say so. What if the organism has two copies of the bioplastic pathway and one produced the original bioplastic and the other produces a different one? What if it has two copies but they both produce different bioplastics. Our definitions are the problem. We're collecting corner cases, not developing some understanding of the underlying biochemical and evolutionary processes.
The non-transitive nature of orthology becomes more and more of a problem when you begin to look at multiple organisms simultaneously. Ideally, orthologs should form a maximal clique, where each gene partners with every other gene when searching the genome. That doesn't happen. If the graphs were almost maximal cliques, things would be simple enough to approximate. Things are much more complicated. Many genes form long ladder-like graphs or sort-of-connected blobs. Selfish DNA, like transposons, that replicates at will inside the genome creates huge poorly-defined webs of meaningless relationships; it is almost always worthwhile to collapse all the selfish DNA together into a single super-group.
I think, in the end, whatever definition of orthology you choose will fail to make evolutionary sense. Any method that looks for orthologs becomes a garbage-in, garbage-out scenario. Sequence data can be used to find homologous genes and to make educated guesses about the equivalence of their functionality, but it should not be used to infer the equivalence of their functionality. Moreover, sometimes we want to compare genes based on evolutionary history, sometimes by biochemical activity and possibly even broader activities, but it is a trap to try both at the same time.
|
2010-05-16T19:56:55-04:00 |
|
http://www.masella.name/badortho |