Abstract
Homology is a fundamental concept in comparative biology and a crucial tool for the analysis of character distribution. Introduced by Owen in 1843 (Lectures on comparative anatomy and physiology of the invertebrate animals, Longman, Brown, Green and Longman, London) in a morphological context, homology can similarly be applied to protein-coding genes. However, in molecular biology the proper distinction between orthology and paralogy was long limited by the absence of whole-genome sequencing data. By now, genome-wide sequencing allows comprehensive analyses of the homology of genes and gene families at the level of an entire phylum. Here, we analyze a manually curated dataset of more than 2,000 proteins from the genomes of 11 nematode species of seven different genera, including free-living and animal and plant parasites to study the principles of homology assignments in gene families. Using all sequenced species as an extensive outgroup, we specifically focus on the two model species Caenorhabditis elegans and Pristionchus pacificus and compare enzymes involved in detoxification of xenobiotics and synthesis of fatty acids. We find that only a small proportion of genes in these families are one-to-one orthologs and that their history is shaped by massive duplication events. Of a total of 349 and 528 genes from C. elegans and P. pacificus, respectively, only 39 are one-to-one orthologs. Thus, frequent amplifications and losses are a widespread phenomenon in nematode lineages. We also report variation in birth and death rates depending on gene families and nematode lineages. Finally, we discuss the consequence of the near absence of one-to-one orthology in related organisms for the application of the homology concept to protein-coding genes in the era of whole-genome sequencing data.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Homology is a fundamental unit of comparative biology and has been successfully applied at many different levels of biological organization (Hall 1994; Wagner 2007). Introduced by Richard Owen (1843), the homology concept predates Darwin and is in part, independent of evolutionary theory (Rieppel 1988). Since the 1960s, the homology concept has also been applied in molecular biology with the important distinction between orthology and paralogy as introduced by Fitch (1970). Specifically, orthologous genes or proteins are defined as homologs in different species that evolved from common ancestry through speciation. In contrast, paralogous genes and proteins result from duplications within a genome. While the application of the homology concept to protein-coding genes and the development of the field of bioinformatics has been most fruitful (Dessimoz et al. 2012), one important distinction to the use of homology in anatomy and morphology was for a long time the different context dependence. In anatomy and morphology, homology of structures was always considered in the context of the whole organisms. In contrast, homology of genes was for more than three decades always assigned in the absence of whole-genome sequencing data. Therefore, the proper distinction between orthology and paralogy was in part limited by the absence of whole-genome data sets (Thornton and DeSalle 2000) and the lack of phylogenetic resolution among sequenced species (Koonin 2005).
This situation has changed during the last decade with the availability of whole-genome sequencing data, often in a comprehensive phylogenetic context. Such studies can focus on closely related species and help to determine recent patterns and processes of the evolution of genes and gene families. For example, the comparison of the genome of 12 closely related Drosophila species revealed frequent changes in the size of gene families (Hahn et al. 2007). Alternatively, genome comparisons can try to cover more distantly related species of a given phylum to determine more ancient evolutionary events. Such studies are still in their infancy given that few broad coverages of whole-genome sequencing data are available in animals (for example, Moore et al. 2013). However, one group of animals that has advanced whole-genome sequencing data available in many species are the nematodes (roundworms). Building on the genome sequencing project of the model organism Caenorhabditis elegans in 1998 (C. elegans sequencing consortium 1998), soon followed by another member of this genus, C. briggsae (Stein et al. 2003), species of seven additional genera had their genomes sequenced and published with protein prediction sequences available for search in public databases. These include animal parasites (Brugia malayi, Loa loa, Ascaris suum, and Trichinella spiralis) (Ghedin et al. 2007; Desjardins et al. 2013; Jex et al. 2011; Mitreva et al. 2011), plant parasites (Meloidogyne hapla and Bursaphelenchus xylophilus) (Opperman et al. 2008; Kikuchi et al. 2011), as well as an additional free-living nematode Pristionchus pacificus (Dieterich et al. 2008). Based on molecular phylogeny, the nematodes are divided into five major clades, numbered from I to V (Blaxter et al. 1998), and representatives of four of them are now sequenced and available for genomic searches (Fig. 1a). Two species, Caenorhabditis remanei and Strongyloides ratti, for which a genome sequence is available in public databases but not formally published were also covered in this study, whereas species with a published genome for which protein predictions are not yet available for search, such as Dirofilaria immitis, Meloidogyne incognita, and Panagrellus redivivus (Godel et al. 2012; Abad et al. 2008; Srinivasan et al. 2013), were left aside.
Here, we investigate homology relationships of multigene families using a manually curated dataset from several gene families that are potentially involved in the detoxification of xenobiotics (Fig. 1b) and in the synthesis of fatty acids (Fig. 1c). We choose these two pathways because they contain enzymes involved in the metabolism of two different kinds of substrates. Xenobiotic-metabolizing enzymes process extracellular substrates that might be highly species-specific depending on diet and exposure to ingested pathogens or environmental pollutants and nematicides (Lindblom and Dodd 2006). In contrast, polyunsaturated and branched-chain fatty acid synthesis is considered to be an evolutionary conserved process and unusually detailed knowledge of the functional specificity of enzymes is available (Watts 2009). Specifically, we compare cytochrome P450 (CYP), short-chain dehydrogenases reductases (SDR), glutathione-S-transferases (GST), UDP-glucuronosyltransferases (UGT), ABC transporters, fatty acid desaturases (FAT), and fatty acid elongases (ELO). Our study has a special focus on the two genetic nematode models C. elegans and P. pacificus, which represent distant relatives of the same clade (clade V). C. elegans and P. pacificus are members of different families, the Rhabditidae and Diplogastridae, respectively, and molecular sequence data suggested the divergence of the last common ancestor of both lineages more than 200 million years ago (Dieterich et al. 2008). Both species differ in their ecology with C. elegans being often found on rotting fruits (Kiontke et al. 2011), whereas P. pacificus lives in a tight association with scarab beetles (Herrmann et al. 2007, Herrmann et al. 2010). We find the near absence of one-to-one orthology relationships among nematode gene families. This finding is due to substantial lineage-specific expansions and losses, which can only be fully revealed in the context of manual curation of whole-genome sequencing data.
Materials and Methods
Data Collection
Protein sequences for C. elegans, C. briggsae, C. remanei, B. malayi, L. loa, A. suum, and T. spiralis were collected from BLAST searches in GenBank. Sequences for B. xylophilus, M. hapla, and S. ratti were collected from BLAST searches in Wormbase version WS232 (Yook et al. 2012). Sequences of P. pacificus are from the HYBRID1 proteomics gene models dataset that is available on the website http://pristionchus.org, and were refined when necessarily by the new assembly dataset and with the help of sequences from the sister species Pristionchus exspectatus. Some individual experimental sequences from various nematodes without a completed genome were also incorporated when it helped to increase the phylogenetic resolution. Accession numbers for all used sequences are provided in Table S1, and a hundred of manually corrected sequences are given in Dataset S1. Species names are abbreviated by the following suffixes on the Figs: Aca, Ancyclostoma caninum; Asu, A. suum; Ath, Arabidopsis thaliana; Bma, B. malayi; Bxy, B. xylophilus; Cbr, Caenorhabditis briggsae; Cel, C. elegans; Cre, Caenorhabditis remanei; Cte, Capitella teleta; Dim, D. immitis; Hco, Haemonchus contortus; Hgl, Heterodera glycines; Hsa, Homo sapiens; Hpo, Heligmosomoides polygyrus; Llo, L. loa; Mha, M. hapla; Min, M. incognita; Nam, Necator americanus; Ode, Oesophagostomum dentatum; Ovu, Onchocerca volvulus; Ppa, Pristionchus pacificus; Sce, Saccharomyces cerevisiae; Sra, Strongyloides ratti; Sst, Strongyloides stercoralis; Tbr, Trichinella britovi; Tps, Trichinella pseudospiralis; Tsp, T. spiralis; Wba, Wuchereria bancrofti.
Protein Domain Screening
Protein predictions of unexpected size were manually screened using PFAM (Punta et al. 2012) in order to remove other protein domains that are very likely to be the result of artifactual domain fusions, or to detect wrongly split proteins. Partial sequences that could not be unambiguously improved by manual fusion and sequences bearing domains below the significance level were kept in the dataset, considering that this will help in future studies.
Detection of Artifactual Domain Combinations and Removal of Contaminants
Before carrying out a detailed phylogenetic analysis, the collected sequences of abnormal length were individually scanned to check for the presence of PFAM domains other than those characterizing the protein family of interest. Protein domain recombination through exon shuffling is a well-characterized evolutionary process. Therefore, the phylogenetic distribution of such fusion proteins can provide a first way of detecting fusion artifacts. If a fusion appears in many closely related species, it seems reasonable to infer that there is a true novel arrangement. In contrast, isolated fusion domains are more likely due to assembly artifacts and would thus, be removed. Some proteins domains were found to be present as single-gene predictions. In cases where it was reasonable to assume that isolated domains are complementary parts of the same protein, we manually fused them. In other cases where the grouping was not clear, the predicted domains were treated as partial sequences in the alignment. These cases are indicated by asterisks (*) in Table S1, and modified sequences are given in full in Dataset S1. Sequences that appeared highly divergent on preliminary phylogenetic trees and/or that showed a best blast hit to bacterial sequences, and thus could be reasonably interpreted as contaminants, were removed from the final dataset.
Phylogenetic Analyses
Collected sequences were aligned using Muscle (Edgar 2004) and MAFFT (Katoh et al. 2005). The quality of both alignments was checked and refined using GUIDANCE (Penn et al. 2010), and the alignment with the best score was selected for further analysis. Best-fit amino acid substitution models were then evaluated using ProtTest (Abascal et al. 2005), and sorted under the Akaike Information Criterion (Akaike 1974). Except for the elongases, for which the VT model showed the best results (Müller and Vingron 2000), the best-fit model was always obtained using the LG substitution matrix (Le et al. 2008). Because the LG substitution matrix is not yet implemented in Bayesian tree reconstruction programs, phylogenetic analyses were carried out only in the maximum-likelihood framework, using PHYML (Guindon and Gascuel 2003) as implemented in Seaview (Gouy et al. 2010). Alternatively, the web interface (http://www.atgc-montpellier.fr/phyml/) was used if empirical estimates of equilibrium frequencies were required. Reliability of nodes was assessed by likelihood-ratio test (Anisimova and Gascuel 2006).
Screen for Selective Constraints
For the complete sequence set presented in Fig. 3, as well as for the four subgroups indicated in the same figure, cDNA alignments were made using Seaview (Gouy et al. 2010), removing all gap-containing sites and ambiguous sites, detected with the help of Guidance (Penn et al. 2010). Three sequences (GST-17, GST-18, and CBG06381) were removed from the dataset, based on their incompleteness (Dataset S1). A maximum-likelihood tree was estimated using PHYML (Guindon and Gascuel 2003) and rooted accordingly to the position of the sequence groups in the whole GST tree. Subtrees corresponding to the four groups were directly pruned from that tree. The likelihoods of two pairs of models were compared, using the codeml program as implemented in the PAML4.4 package (Yang 2007). The null model M1a (nearly neutral) assumes two classes of sites, with either ω0 ≤ 1 (negative or purifying selection) or ω1 = 1 (neutral). The alternative model M2a (selection) adds a third class of sites with ω2 ≥ 1 (positive selection). The null model M7 assumes that ω ratios vary among codons, following a β distribution in the interval 0 < ω < 1. The alternative model M8 adds an extra class of sites under neutral evolution or positive selection with ωS ≥ 1 (Yang and Bielawski 2000). Differences in log-likelihoods (2ΔlnL) were compared to a χ 2 distribution of 1 degree of freedom and corrected for multiple testing (Anisimova and Yang 2007).
Birth and Death Estimates and Automated Orthology Assignments
Estimates of birth and death were performed using BadiRate (Librado et al. 2012). A table of gene counts from the manually curated dataset, and an altrametric tree corresponding to the current consensus on nematode phylogenetic relationships (Van Megen et al. 2009) were used as input. Turnover rates were estimated under the BDI model (Csurös and Miklós 2009), allowing different rates of turnover across the branches (FR choice in the –b model option), which seemed to be the biologically most relevant scenario after visual inspection of the phylogenetic trees. Ancestral states were calculated under the maximum-likelihood (ML) criterion. This combination of parameters defines the so-called BDI-FR-ML method. For comparative purposes, estimates were also carried out using two parsimony-based methods (BDI-FR-CSP and BDI-FR-CWP). Clusters of orthologs were calculated using version 2.0 of the OrthoMCL (Li et al. 2003), using the HYBRID1 protein predictions available at http://pristionchus.org for P. pacificus and the protein predictions available in Wormbase WS241 for the eight other investigated species for which the genome sequence was already published. Predictions from C. remanei and S. ratti were not included, to avoid biases in gene and cluster counts due to the high number of contaminants or heterozygous sequences that were noticed during the manual analysis. Protein sequences were automatically retrieved by searching for Pfam domains using HMMER (Eddy 2011) and the pfam_scan script provided on the PFAM website (Punta et al. 2012), with E value cutoff of 0.001. Clustering was done using the default values suggested in the OrthoMCL documentation for blast E value (E-05) and percentage identity cutoff (50 %). 25,303 clusters with more than one gene were identified from the OrthoMCL output.
Results
In the following, we describe the topology of seven different gene families in nematodes according to maximum-likelihood phylogenies. Our analysis is based on a manually curated protein set of nearly 2,000 proteins compiled from public databases containing whole-genome sequencing drafts of 11 species from seven genera in four distinct nematode clades (Fig. 1a). The database deadline for our analysis was February 2013. Data collection, protein domain screening, and phylogenetic analyses are described in detail in the Methods section, along with screens for positive selection and estimations of birth and death rates. This study compares the following families: GST (Figs. 2, 3, Fig. S1; Table 1), CYP, SDR and UGT (Fig. S2), ABC (Fig. 4, Fig. S3), FAT (Fig. 6), and ELO (Fig. S4), using citrate cycle enzyme genes as an external baseline in an automated control analysis, where all the manually analyzed families were also re-analyzed. A summary of all results is provided in Tables 2, S2, S3 and Figs. 7 and 8, indicating the general conservation level of gene families.
The GST Family Shows Eighteen Lineage-Specific Gene Expansions
We first analyzed the GST family as a detailed example for investigating the principles of homology assignments in multigene families because (i) GST proteins have a conserved structure that is quite well understood with crystallographic data for C. elegans, and also for some parasitic nematode species (van Rossum et al. 2004, Perbandt et al. 2005, Asojo et al. 2007); (ii) it has a reasonable size of approximately 50 members in C. elegans, making it manageable to investigate the relationships between family members; and (iii) it was already subject of evolutionary studies that provided a basic topological and nomenclatural framework. Moreover, a previous study provided an explicit hypothesis about the origin of the family (Sheehan et al. 2001).
GSTs have a small size of about 200 amino acids and the family can be divided into several subclasses (Fig. 2). Classical GSTs are formed by a thioredoxin-like N-terminus consisting of two motifs (βαβ motif and ββα motif) separated by an α helix, and by a C-terminus consisting of a variable number of α helixes (Ladner et al. 2004). One important exception is the GST kappa class with helixes forming a domain functionally equivalent to the classical C-terminus that is inserted between the βαβ and ββα motives (Ladner et al. 2004). GST kappa and other GSTs have different bioinformatical signatures: GST kappa members correspond to the PFAM domain DSBA (PF01323), whereas other GSTs are identified by the PFAM domains GST_N (PF02798) or GST_N_3 (PF13417) for the N-terminus and GST_C (PF00043) for the C-terminus. We group the phylogeny of nematode GST genes in a simplified tree (Fig. 2) and provide high resolution data for the GST kappa members separately (Fig. S1). We excluded from the analysis the proteins bearing a GST_N_3 (PF13417) domain followed by a GST_C_2 (PF13410) domain because they are currently classified in a different family in nematodes based on crystallographic studies (Harrop et al. 2001; Dong et al. 2005; Littler et al. 2008).
The total number of GST genes varies highly from only two in T. spiralis to 59 in C. elegans and P. pacificus. The slight discrepancies in gene numbers with previously published estimates (Kikuchi et al. 2011; Dieterich et al. 2008; Borchert et al. 2010) may be due to assembly and annotation errors as well as to different cutoffs in protein homology searches. According to current hypotheses about the evolution of GST genes, GST kappa is supposed to branch at the basis of the family, and therefore, we rooted our summary tree accordingly (Fig. 2) (Sheehan et al. 2001). Moreover, the overall distribution of the nematode GST sequences fits with the previously published phylogenies of C. elegans GSTs (Perally et al. 2008). The two classes diverging first are GST omega and GST zeta, with GST pi representing another well-defined family. These three families are well supported, with likelihood-ratio test values of 1.00 for the zeta class, 0.99 for the pi class, and 0.94 for the omega class.
The majority of the remaining sequences cluster in a less defined group that we designate as sigma/nu. While the sigma class is present in all metazoans (Sheehan et al. 2001), the nu class was defined specifically for nematode sequences starting from a sequence from the parasitic nematode H. contortus (van Rossum et al. 2004). This group was later extended to some other nematode sequences (Perally et al. 2008). However, the distinction between the two classes is not clear and the nu class might partially be a subpart of the sigma class. Therefore, we here refer to the whole group as sigma/nu.
We highlighted as expansions all groups of sequences with three or more paralogs in one genus relative to a group of orthologous sequences or to a paralogous group in another genus (Fig. 2, Fig. S1). Clade III and Clade IV nematodes, as well as P. pacificus and rhabditids all show at least one group of paralogs that cluster with a strong likelihood-ratio support (Fig. 2). In total, we observe 18 expansions mainly in class IV and class V nematodes, but also one expansion in A. suum. Most importantly, the expansion patterns vary between the different classes of GSTs, as well as in and between organisms. For example, for clade IV nematodes, there is one small expansion in the zeta clade with three additional sequences. In the GST sigma/nu group, there are four clade IV-specific expansions with 3–33 members each. In contrast, clade IV sequences are totally absent from the omega and pi classes. Similarly, P. pacificus GSTs expand at a similar basal rate (3 paralogs) in the omega and zeta classes, whereas there is no duplication in the pi class. In rhabditid nematodes, there are expansions in all classes with two cases resulting in 27 and 53 members, respectively.
To indicate the tempo and mode of gene expansions and losses with a higher phylogenetic resolution, we studied the representation of GST genes in the genus Caenorhabditis, which has the highest coverage of whole-genome sequencing data. Figure 3 shows that the rhabditid-specific expansion in the GST sigma/nu class is still highly variable within the Caenorhabditis genus. Four proteins (GST-3, GST-4, GST-15, and GST-24) are encoded by unambiguously orthologous genes in C. elegans, C. briggsae, and C. remanei. However, in one case a gene being present as a single ortholog in C. briggsae and C. remanei has four counterparts in C. elegans (GST-21, GST-22, GST-34, and GST-35). In another case, we find a group of 6 C. elegans-specific paralogs (GST-12, GST-14, GST-16, GST-17, GST-18, and GST-19) with a single ortholog in C. briggsae and C. remanei. Finally, a third group contains seven paralogs from C. elegans, (GST-26, GST-27, GST-28, GST-29, GST-31, GST-32, and GST-37), two paralogs in C. briggsae and one additional gene that could be a one-to-one ortholog between the tree species, even if not supported by a high likelihood-ratio test value (GST-39). We searched for signatures of positive selection either at the branch or site level in the whole group and in four subgroups corresponding to more recently diverged sequences (groups A to D on Fig. 3), but found no significant differences between the likelihoods of models allowing for positive selection relative to those assuming neutral evolution or purifying selection (Table 1). Screens for positive selection that were carried out extensively in the GST family among 12 species from the Drosophila genus, where lineage-specific duplications also occur and GST numbers vary from 30 to 45 paralogs (Low et al. 2007). This screen identified a single gene showing some signal for positive selection, suggesting that the vast majority of gene duplications are neutral events.
Taken together, GST expansions occur even in closely related species of the same genus, a pattern similar to observations made in Drosophila gene families (Hahn et al. 2007). These expansions can occur to the extent that few one-to-one orthology relationships can be made. An extreme case is found between P. pacificus and C. elegans, which both have 59 GST genes in their genomes. While the total number of genes is identical in both species, only GST-11 is a one-to-one ortholog among all GST genes of both organisms.
Tens of Lineage-Specific Gene Expansions in the CYP, SDR, and UGT Families
The CYP family consists of well-conserved proteins, some of which are involved in phase I xenobiotic metabolism and in the synthesis of various endogenous hormones (Fig. 1b) (Brown et al. 2008). They correspond to the single PFAM domain PF00067. It is already known that lineage-specific expansions have occurred in C. elegans (Nelson 1998), in vertebrates (Thomas 2007), and in various other metazoan groups (Markov et al. 2009). Based on previous knowledge, we rooted the nematode CYP tree using CYP51, which is conserved in almost all living organisms except for the ecdysozoa, and CYP39A, a vertebrate paralog that branched basal in previous analyses, which is also conserved in annelids (Markov et al. 2009). We show a total of 20 independent lineage-specific expansions in nematodes from clades IV and clade V (Fig. S2). The number of genes involved in such expansions is highly variable ranging from only three paralogs for a Meloidogyne-specific group to 52 in the Caenorhabditis CYP34-CYP35 group (Fig. S2). Thirteen out of those 20 expansions are unambiguously supported by a likelihood-ratio test value higher than 0.97 at their basis, and six of the remaining seven expansions encompass a subclade of at least three sequences also supported by a likelihood-ratio test value higher than 0.97, suggesting that these findings are overall robust. Besides these major amplifications, other nematode CYPs are not necessarily one-to-one orthologs because of many single duplications and losses.
The SDR constitute a large family of NAD(P)(H)-dependent oxidoreductases with crucial functions in lipid, amino acid, carbohydrate, cofactor, hormone and phase I xenobiotic metabolism, as well as in redox sensing mechanisms (Kavanagh et al. 2008). Sequence identities are low within the SDR family with many undefined boundaries. The current family classification is not based upon phylogeny, but rather relies on motif recognition (Persson et al. 2009). As an exhaustive study on SDRs from C. elegans is still lacking, we deciphered as a starting point the phylogenetic relationships among the classical nematode SDRs. Specifically, we took into account all sequences identified by the PFAM domains adh_short (PF00106) and adh_short_C2 (PF13561), because all genes that are proposed to be involved in xenobiotic response are from these groups (Menzel et al. 2007; Kisiela et al. 2011; Son et al. 2011). The family history is shaped by four expansion events within the Caenorhabditis genus, 11 expansions in Pristionchus, 17 expansions in clade IV, two in clade III, and even one small expansion in T. spiralis (Fig. S2). Of these 35 expansions, only 14 are supported by a LRT value higher than 0.97. However, this number increases to 23 when incorporating subparts of the weakly supported amplifications with internal branches that cover more than three sequences from the same species. This lower support reflects the limited level of sequence identity in the SDR family, which makes phylogenetic reconstructions difficult. In total, there are nine one-to-one orthologs between C. elegans and P. pacificus with one orthologous pair showing differential domain composition.
Enzymes of the UGT family are involved in the addition of glucuronyl residues during phase II of xenobiotic metabolism (Fig. 1b). They are known to be highly diverse at the intra-phylum level. For example, out of a dataset of 310 insect proteins, only one was found to be conserved among holometabolous insects (Hahn et al. 2012). Our phylogenetic analysis identifies 19 amplification events within nematode members of this family (Fig. S2). Four small amplifications are found in the clade III nematode A. suum, seven in clade IV nematodes, and five in P. pacificus, including two major amplifications encompassing 34 and 54 sequences, respectively. Another three amplifications are observed in the Caenorhabditis genus, the biggest encompassing 154 sequences. Only one gene, coding for UGT-60, is a one-to-one ortholog between C. elegans and P. pacificus (Fig. S2). Taken together, the extensive lineage-specific gene expansions observed for CYP, SDR, and UGT families indicate that the observation originally made for GST genes is not an exception but more likely the rule.
Nine Small Expansions and Some Domain Losses in the ABC Transporter Family
ABC transporters are a highly modular family of proteins that are involved in pumping metabolites from the cytoplasm of the cell to other compartments or outside of the cell (Sheps et al. 2004). The domain number and order can vary among subfamilies (Sheps et al. 2004; Fig. 4a). We have therefore analyzed separately the four types of domain combinations that exist in nematodes (Fig. S3).
The first type of domain architecture of ABC transporters has a tandem repeat of an ATP-binding-cassette domain followed by a transmembrane domain, which is characteristic of transporters of the subfamilies ABCA, ABCB, and ABCC (Fig. 4a). The full-length ABCB, also called P-glycoproteins (PGP), and the ABCC, also known as multidrug-resistant proteins (MRP), are supposed to be involved in drug excretion in C. elegans (Sheps et al. 2004). A more exhaustive study among the Caenorhabditis genus showed that these proteins are roughly conserved as one-to-one orthologs between C. elegans, C. briggsae, and C. remanei (Zhao et al. 2007). Consistently, we identified only a small number of moderate expansions relative to the previously analyzed families. We notice seven expansions, four of them involving three to four paralogs in the Caenorhabditis genus, in P. pacificus or in B. xylophilus (Fig. S3).
The second type of domain architecture of ABC transporters has a single ATP-binding-cassette in the N-terminus followed by a transmembrane domain in C-terminus. Such proteins need to dimerize to be fully functional, and such structures are characteristic for members of the half-length (HAF) members of the ABCB family (Sheps et al. 2004) and for the ABCD members, also called peroxisomal membrane proteins (PMP) in nematodes (Zhao et al. 2007). The half-length ABCB includes some mitochondrial transporters, whereas the ABCD subfamily contains proteins involved in fatty acid transport into peroxisomes and vitamin B12 metabolism in lysosomes in vertebrates (Tarling et al. 2013). We find a level of conservation that is even higher than in the previous group with seven of the 15 C. elegans proteins having one-to-one ortholog in P. pacificus (Fig. S3). Only two expansions of six to eight paralogs occur specifically in B. xylophilus. Strikingly, the largest of these expansions occurs in the HAF-5 subfamily, which is otherwise conserved as one-to-one ortholog in mammals, fruit fly (Zhao et al. 2007), and other nematodes (Fig. S3).
ABCE, ABCF-1, ABCF-2, and ABCF-3 are strictly conserved as one-to-one orthologs in all analyzed nematodes. The only variation is a partial ABCF-1 ortholog in T. spiralis, which is probably an incomplete sequence (Fig. S3). Although classified in the ABC transporter family due to the presence of two ATP-binding cassettes, ABCF proteins are not transmembrane transporters, rather they are involved in protein synthesis processes that are conserved throughout the metazoan (Dean and Annilo 2006).
The ABCG family, also known as white-related (WHT), and the ABCH family consist of those ABC transporters with the ATP-binding cassette in the N-terminus, whereas the transmembrane domain is in C-terminal part of the protein (Fig. 4a). In general, the ABCG family shows high conservation. All sequences that are unambiguously placed are one-to-one orthologs of WHT-1, WHT-4, WHT-7, and WHT-8 (Fig. S3). In contrast, there are small amplifications of genes encoding WHT-2, WHT-3, and WHT-6 proteins with two P. pacificus sequences, two B. xylophilus sequences and one sequence from M. hapla, S. ratti, and A. suum (Fig. S3). For the ABCH family, we found orthologs only in P. pacificus. Due to the high level of sequence divergence in this group, it is possible that other nematode orthologs were missed in our database searches due to a low level of sequence similarity.
In summary, the conservation patterns of the ABC transporters is much higher than in the gene families discussed above. However, the exact conservation patterns vary from one subfamily to another. An additional factor of diversification of ABC transporters is the presence of partial sequences, which could be interpreted as incomplete protein predictions or real expressed pseudogenes, as already documented in vertebrates (Dean and Annilo 2006). In P. pacificus, one likely case of partial sequences are the tandem proteins coded on Contig4-snap.264 and Contig4-snap.265, two close paralogs of the C. elegans MRP-1 and MRP-2 (Fig. 4b, Fig. S3).
Pristionchus Expansions of Desaturases and Elongases
Desaturases (FAT) and elongases (ELO) are two gene families that are involved in the production of polyunsaturated C:20 fatty acids (PUFA) from a saturated C:16 precursor in C. elegans (Fig. 1c) (Watts 2009). Additionally, elongases also elongate branched-chain fatty acids (Fig. 1c) (Watts 2009). Desaturase activity is carried out by a single protein domain (PF00487), whose tridimensional structure has not yet been resolved in animals because it is very reluctant to purification (Napier et al. 2003). The desaturase domain is sometimes combined in the same protein with a Cytochrome b5-like Heme/Steroid binding domain (PF00173), which contributes to the electron transfer during the formation of the double bound on the fatty acid chain (Dean and Annilo 2006), or with a sphingolipid delta(4)-desaturase domain (PF08557), which is associated with a sphingolipid delta(4)-desaturase activity in various eukaryotes (Ternes et al. 2002). We find that in nematodes the desaturase family divides into four principal clades, all of them being well supported (Fig. 5). One clade contains FAT-1 and FAT-2, that are specifically duplicated in the Caenorhabditis genus. The ancestral gene leading to FAT-1 and FAT-2 has also been independently duplicated in P. pacificus and in B. xylophilus. The second clade contains the proteins bearing the additional sphingolipid delta(4)-desaturase domain, including the functionally uncharacterized C. elegans proteins F33D4.4 and Y54E5A.1, that are both conserved as one-to-one orthologs in P. pacificus. This is also the only clade to contain a sequence for T. spiralis. The third clade, containing the C. elegans FAT-3 and FAT-4, consists of all proteins with an extra Cytochrome b5-like Heme/Steroid binding domain in their N-terminus. Both genes are well conserved among nematodes, with few exceptions in form of a single duplication of FAT-4 in P. pacificus, a loss of both genes in A. suum and a loss of FAT-4 in B. malayi. The fourth clade comprises the C. elegans FAT-5, FAT-6, and FAT-7 genes and their closer homologs. FAT-5 is unambiguously conserved with the three members of the Caenorhabditis genus. In contrast, FAT-6 and FAT-7 are specific to C. elegans and have one single paralog in C. remanei and two paralogs in C. briggsae. A single duplication occurs specifically in A. suum, whereas a lineage-specific amplification, leading to six paralogs, occured in P. pacificus (Fig. 5).
The elongase family, corresponding to proteins with the PFAM domain PF01151, shows similar general pattern with a moderate number of duplications and losses in various nematodes and two more important amplifications in P. pacificus (Fig. S4). The nine C. elegans elongases are unambiguously conserved accross the Caenorhabditis genus, and five of them (ELO-2, ELO-3, ELO-5, ELO-6, and ELO-8) have one-to-one orthologs in at least one nematode from clade IV and one nematode from clade III. For ELO-4, an ambiguous pattern of multiple sequences for all nematodes outside of the Caenorhabditis genus is found. Similarly, for ELO-1 and ELO-7, the relationships among the sequences from different clades are unresolved. P. pacificus underwent an expansion to nine paralogs in the ELO-9 group unambiguously from a single expansion event. Also, P. pacificus has 14 paralogs in the ELO-1/ELO-7 group, resulting from one or two distinct amplification events. These high number of duplicates result in very divergent sequences and a partially unresolved phylogeny.
Together, the amplitude of gene duplications is smaller between genes from the PUFA synthesis pathway compared to genes from the xenobiotic metabolism pathway. However, this moderate level of variation is sufficient to conclude that only one of those genes is conserved as a one-to-one ortholog between the model nematodes C. elegans and P. pacificus (Fig. 6).
High Variation Both in Absolute Gene Numbers and in Relative Proportions in Seven Families
To obtain a more complete overview of the general dynamics of genes birth and death rates, we reconstructed the ancestral gene set profiles at all nodes of the nematode tree among the eleven fully sequenced species (Fig. 7; Table S3). Variation occurs among nematode genomes both in terms of total gene numbers and in terms of relative proportions among the different gene families. Specifically, there is a global gene increase in all clade V nematodes, up to the intragenus level in Caenorhabditis, and with a stronger overall increase in the branch leading to P. pacificus. On the contrary, there is a global gene decrease in branches leading to clade III and clade IV nematodes, and also in the branch leading to the clade I T. spiralis. The global birth rates (β) were estimated to vary between almost 0 up to 1.65 births per gene per million of years, the maximum being in the branch between B. xylophilus and its last common ancestor with S. ratti, whereas the global death rates (δ) were estimated to vary between 0 and 0.32 deaths per gene per million of years. Those values are much higher than the rates that are reported in the literature for eukaryotes, ranging from 0.0006 to 0.0193 (Demuth and Hahn 2009). Maximum-likelihood estimations were recently shown to be prone to overestimation of turnover rates (Almeida et al. 2014). Therefore, we also estimated those rates under the two parsimony methods implemented in BadiRate (BDI-FR-CSP and BDI-FR-CWP, Librado et al. 2012) and found that the rates are even higher, ranging up to 2.2 births in the CWP method and up to 4.24 in the CSP method. We hypothesize that those extremely high estimates reflect both the biological and technical peculiarities of our dataset. Indeed, all species from the two most basal clades (clade I and clade III) are parasites that are known to have lost many genes, and this is the same at the basis of clade IV. In addition, except for T. spiralis, which has a high quality assembly (N50 = 6 Mb), the other parasites have also a genome assembly of lower quality when compared to free-living nematodes and B. xylophilus (Rödelsperger et al. 2013). Thus, the different quality of genomes may contribute to an artefactual increase of the gene turnover rate estimates, and we therefore believe that further careful calibration studies are necessary before the reported rates can be compared with those outside nematodes.
In addition to the discussed limitations about the absolute turnover rates, the relative proportions of the seven analyzed gene families also vary. Starting from a common ancestor where all proportions are almost equal, the families enriched in xenobiotic-metabolizing genes increase faster in respect to the PUFA-metabolizing genes. PUFA-metabolizing genes indeed represent 26 % of the estimated gene set in the common ancestor of the 11 nematodes, but drops to 4.8 and 19.4 % concerning the gene numbers that were counted in the genomes of the 11 investigated species. For the xenobiotic-metabolizing genes, the trends are different. For example, CYPs represent 27 % of the total gene number in P. pacificus, whereas it forms only 4 % in T. spiralis and between 3 and 8 % in clade III nematodes. Even more strikingly, the amount of UGT genes varies from 2 % in B. malayi up to 40 % of genes in M. hapla. Thus, there is no sign of obvious correlation in the detailed pattern of gene expansion even inside families that are sharing functional similarities in term of biochemical activity.
Higher Variation in Gene Numbers for Xenobiotic-Metabolizing Families Relative to PUFA Synthesis or Citrate Cycle Enzymes
To place our findings in a broader context and to gage the expected baseline of gene conservation, we compared the variation in genes and clusters of orthologs for the seven studied families to the variation levels of those families containing genes coding for the enzymes of the citrate cycle (Fig. 8; Table S4). We chose this pathway because it is a central metabolic pathway that is considered to be universal, and also because previous data already suggested that it is strictly conserved in terms of one-to-one orthologs between the models C. elegans and P. pacificus (Sinha et al. 2012). For this, we made an automated estimate of gene number both for our families of interest and for the enzymes from the citrate cycle, using a clustering approach by OrthoMCL (see “Materials and Methods” section for details). We measured variation as the difference between the maximal and minimal number of genes or clusters in a given family. The observed variation for the ABCE domains, which we know to be strictly conserved from our manual analysis, indicates that the estimates given by OrthoMCL contain a certain amount of errors. Similarly, the differences are much higher to the maximum PFAM domain counts for the CYP, UGT, and ELO families, which are mainly due to artefactual domain fusions (listed in additional dataset S1), that make the program to aggregate genes from other families into the clusters. For citrate cycle and PUFA synthesis genes, the variation ranges between 0 and 26 genes or clusters, whereas the numbers rise up to 38 clusters or 160 genes in some of the xenobiotic-metabolizing families, excluding the artefactual overestimates discussed above. Note that the separation between both categories is not straightforward, because some genes from the ABCD and ABCE family are not involved in xenobiotic metabolism. Taken together, the gene numbers from the xenobiotic metabolism pathway are more variable than those from the citrate cycle or PUFA synthesis pathway. But even the comparatively lower level of variation observed in the gene families coding for PUFA synthesis enzymes is sufficient to give an almost complete lack of one-to-one orthologs between the two model nematodes C. elegans and P. pacificus (Fig. 6).
Discussion
In this study, we investigated the homology relationships in seven gene families using some of the main enzymes involved in the detoxification of xenobiotics and in the synthesis of polyunsaturated and branched-chain fatty acids in nematodes. The analysis of the genomes of 11 species of seven nematode genera revealed that only a small proportion of the genes involved in these pathways are one-to-one orthologs. For example, the phylogenetic relationships of the 349 C. elegans and 528 P. pacificus genes analyzed in this study showed that only 11.17 % of the C. elegans genes and 7.39 % of the P. pacificus genes are conserved one-to-one orthologs. The analysis of the genes encoding GST enzymes is a particular point in case. The apparently constant number of 59 genes in C. elegans and P. pacificus first pointed toward a high level of conservation and orthology. However, the presence of only one of these genes as a true one-to-one ortholog between both species indicates that orthology assignments require detailed phylogenetic analyses relying on manually cured genome-wide datasets.
While every single data point (gene or gene family) is of course subject to change because of the potential incompleteness of the genome sequencing data, the overwhelming number of lineage-specific expansions observed in nematodes makes it likely that the overall phenomenon is robust. Our sampling dataset is enriched in genes that duplicate at a higher rate than the average. Indeed, previous comparisons based on whole-genome sequence data of C. elegans and P. pacificus indicated that 23.5 % of the approximately 20,000 C. elegans genes and 16.1 % of the originally predicted 29,201 P. pacificus genes are one-to-one orthologs (Dieterich et al. 2008) and, consistently with this, we show that xenobiotic metabolic genes vary in numbers at a higher rate than genes encoding for core metabolic enzymes of the citrate cycle (Fig. 8). However, these genome-wide comparisons also point toward a limited to moderate orthology relationship of genes in nematodes. It should be stated that a complete analysis of the orthology relationships of nematode genes in a genome-wide context awaits future analysis. Such studies are only possible using a systematic approach involving more species as indicated here for the detoxification and fatty acid synthesis enzymes. One should also note that we did not check specifically for gene conversion events. If such events would occur, as already described for some CYPs in Helicoverpa moths (Li et al. 2002), this would even more substantiate our point that nematode genomes have quite distinct evolutionary histories, because for those genes, orthology would even not be a valid possibility.
Lineage-Specific Divergence and Functional Specificity
The most important finding and overall conclusion of our study is the existence of massive duplication events, lineage-specific amplifications, and gene losses in nematodes, with variation in birth and death rates from one family to another even among enzymes belonging to the same pathway. These results have two major evolutionary implications. First, a major evolutionary challenge resulting from our observations is the relationship between lineage-specific amplifications and functional specificity. It is generally assumed that orthologous genes are functionally more conserved than paralogous genes (Tatusov et al. 1997), although this is still a matter of intense debate (Dessimoz et al. 2012; Gabaldón and Koonin 2013). Orthology of genes encoding enzymes in a given pathway is generally considered a reasonable basis to assess functional conservation of molecular mechanisms between species. In contrast to this assumption, the massive lineage-specific gene amplifications and the resulting limitations of one-to-one orthology relationships strongly suggest that the functional specificity of many enzyme-encoding genes might have changed during the course of nematode evolution.
While functional studies are generally sparse, some case studies suggest that genes encoding enzymes with metabolic activity have an unusual amount of convergence, promiscuity, and redundancy. For example, the short-chain reductase LET-767 does not belong to the elongase family but has elongase activity by elongating fatty acids with a linear carbon chain (Watts 2009) (Fig. 6). Similarly, redundancy among enzymes seems to represent a widespread phenomenon, i.e., the distantly related ELO-1 and ELO-2 are both able to elongate a 20:3(n-6) fatty acid from a 18:3(n-6) fatty acid and a 20:4(n-6) fatty acid from a 18:4(n-3) fatty acid (Watts 2009). Another example is GST-10 from the GST pi group and its distantly related paralogs GST-6 and GST-8 from the sigma group that are all able to conjugate 4-HNE, a product from lipid peroxidation (Ayyadevara et al. 2007). Finally, it is well known that some enzymes are promiscuous, being able to process different substrates. In PUFA metabolism, ELO-1 is known to elongate 16:0, 18:3(n-6) and 18:4(n-3) fatty acids, and FAT-1 performs an omega-3 desaturation on 18:2(n-6), 20:3(n-6), and 20:4(n-6) fatty acids (Watts 2009). Xenobiotic-metabolizing enzymes are even more versatile with C. elegans F54F3.4 (SDR25C22) metabolizing isatin and 4-oxono-2-enal (Kisiela et al. 2011), DHS-21 converting l-xylulose into xylitol and reducing various dicarbonyl compounds (Son et al. 2011). In the filarian worm O. volvulus, the conserved GST-11 conjugates various xenobiotics and additionally acts as prostaglandin D2 synthase, thus producing a molecule fooling the immune system of its human host (Sommer et al. 2003). Taken together, functional convergence, promiscuity, and redundancy might de-constrain the evolution of genes in these multigene families resulting in the pattern of massive lineage-specific amplifications observed in this study. The release of functional constraints might differ between gene families and could therefore account for the observed differences in gene amplifications found in distinct gene families. The absence of signatures for positive selection in GSTs duplicated specifically within the Caenorhabditis genus, together with similar results on xenobiotic-metabolizing ABCs (Zhao et al. 2007), suggests that variation in the number of genes coding for catalytically redundant enzymes could also contribute to the modulation of metabolic phenotypes.
A related implication of the absence of one-to-one orthology relationships is the resulting uncertainty of assigning the existing enzymes to specific metabolic activities. For example, genes encoding candidate enzymes for the synthesis of polyunsaturated fatty acids are currently uncharacterized biochemically in P. pacificus. However, there is indirect evidence for their activity based on the identification of ascaroside pheromones (Bose et al. 2012), which are thought to derivate from polyunsaturated fatty acids (von Reuss et al. 2012). Thus, the almost complete lack of one-to-one orthologs in P. pacificus for the enzymes involved in the biosynthesis pathway from C:16 to C:20 PUFAs in C. elegans suggests that, in P. pacificus, those molecules are synthesized by enzymes encoded by genes that are paralogous to the C. elegans ones.
Lineage-Specific Gene Expansions and Homology in Metabolic Pathways
The second major implication of our survey is in the context of the homology concept. In the pre-genome era, a proper distinction between orthology and paralogy of individual homologous genes was nearly impossible. Genome-wide studies, such as the one described here, clearly indicate that in certain gene families one-to-one orthology is an exception rather than the rule. Studies in insects come to the same overall conclusion indicating that at least in these species-rich phyla massive gene amplifications are common (Sánchez-Gracia et al. 2009, Fang et al. 2009, Bass and Field 2011). Both nematodes and insects are special among sequenced eukaryotes because there is no documented whole-genome duplication event, such as those found in vertebrates, land plants, and ciliates, that are proposed to be major drivers for evolutionary diversification (Van de Peer et al. 2009). Previous genomic data in nematodes already made us propose that in those organisms, horizontal gene transfers and gene amplification could produce similar levels of genomic diversity in the absence of whole-genome duplication (Markov and Sommer 2012). However, it remained unknown if those amplifications were occurring before the radiation of some nematode species, as could be suggested by the similar numbers of GSTs in C. elegans or P. pacificus, or if it occurred preferentially in a species-specific way. Our dataset indicates that species or at least genus-specific amplifications are an important component of the duplication pattern. More generally, there is a great variety in duplications patterns. Specifically, even the comparatively smaller scale amplification events, such as what we observed in PUFA-synthesizing enzymes, are important to consider when addressing the question about functional conservation of metabolic pathways. Some authors argued that there is no reason to distinguish between one-to-one orthologs and recently diverged paralogs as far as gene/protein function is concerned (Koonin 2005). This is also the main principle behind cluster searches in OrthoMCL (Li et al. 2003). However, this assumption relies mainly on data in bacteria, where genomes experience evolutionary forces that differ strongly from those of eukaryotes (Lynch et al. 2011). Moreover, such concepts also depend on the level of resolution in the definition of gene or protein “function.” For example, in the PUFA synthesis pathway, all desaturases and elongases could be annotated as members of this pathway, and indeed such an assumption would be a reasonable inference to start with. However, when searching for candidates for particular enzymatic reaction in nematodes, the available knowledge of enzymatic activities in C. elegans suggests the lack of important one-to-one orthologs resulting in the absence of candidate genes and absence for particular reactions.
The evolutionary patterns observed in this study are in strong discrepancy to the observed evolution of many other genes and gene families, i.e., developmental control genes. During the last two decades, the comparative analysis of developmental processes revealed a surprising and unexpected conservation of genes controlling development. Throughout the animal kingdom, transcription factors and signaling pathways are conserved (Carroll 2005; Gerhard and Kirschner 1997; Wilkins 2002; Pires daSilva and Sommer 2003). Interestingly, the comparison of species within one phylum often reveals that genes encoding for key factors in signal pathways show one-to-one orthology relationships, i.e., genes encoding for Hedgehog, Wnt, EDA, and other ligands, but also their receptors and downstream cytoplasmic adaptors and kinases (Keys et al. 1999; Zheng et al. 2005; Pantalacci et al. 2008).
This discrepancy strongly suggests that different evolutionary forces shape the evolution of genes and genomes. The widespread existence of one-to-one orthologs in signaling pathways and developmental control genes can best be explained by purifying selection that constrains duplication and amplification events in these genes, and the same could be true for core essential metabolic pathways such as the citrate cycle. In contrast, such constraints seem to be released for members of other gene families. At least in part, this de-constrain might result from the promiscuity and redundancy as discussed above. In addition, this de-constrain might result from the influence of the environment on organisms and their genomes. We hypothesize that duplications of enzyme-encoding genes might be initially tolerated because gene and protein activity are not crucial for the survival of the individual. Depending on changes of the environment, the resulting variance in metabolic activity might become adaptive, resulting in rapid changes in the genomic composition of the organism. While such hypotheses await experimental validation, they can explain the different patterns of evolution seen in different gene classes.
In conclusion, our systematic approach to homology assignments in gene families strongly indicates that in the era of whole-genome sequencing data the comparison of multiple related species in a defined phylogenetic context allows precise evaluations of the history of individual genes. Such studies can be of tremendous values for experimentalists. Given the still growing rate of whole-genome sequencing projects, the refinement of rough bioinformatic studies by detailed phylogenetic analyses on manually curated protein family datasets will be of great importance in the near future and they are likely to provide many surprising findings and re-assignments of homology relationships.
References
Abad P, Gouzy J, Aury JM et al (2008) Genome sequence of the metazoan plant-parasitic nematode Meloidogyne incognita. Nat Biotechnol 26:909–915. doi:10.1038/nbt.1482
Abascal F, Zardoya R, Posada D (2005) ProtTest: selection of best-fit models of protein evolution. Bioinformatics 21:2104–2105. doi:10.1093/bioinformatics/bti263
Ahn SJ, Vogel H, Heckel DG (2012) Comparative analysis of the UDP-glycosyltransferase multigene family in insects. Insect Biochem Mol Biol 42:133–147. doi:10.1016/j.ibmb.2011.11.006
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19:716–723. doi:10.1109/TAC.1974.1100705
Almeida FC, Sánchez-Gracia A, Campos JL, Rozas J (2014) Family size evolution in Drosophila chemosensory gene families: a comparative analysis with a critical appraisal of methods. Genome Biol Evol 6:1669–1682. doi:10.1093/gbe/evu130
Anisimova M, Gascuel O (2006) Approximate likelihood-ratio test for branches: a fast, accurate, and powerful alternative. Syst Biol 55:539–552. doi:10.1080/10635150600755453
Anisimova M, Yang Z (2007) Multiple hypothesis testing to detect lineages under positive selection that affects only a few sites. Mol Biol Evol 24:1219–1228. doi:10.1093/molbev/msm042
Asojo OA, Homma K, Sedlacek M, Ngamelue M, Goud GN, Zhan B, Deumic V, Asojo O, Hotez PJ (2007) X-ray structures of Na-GST-1 and Na-GST-2 two glutathione S-transferase from the human hookworm Necator americanus. BMC Struct Biol 7:42. doi:10.1186/1472-6807-7-42
Ayyadevara S, Dandapat A, Singh SP, Siegel ER, Reis RJS, Zimniak L, Zimniak P (2007) Life span and stress resistance of Caenorhabditis elegans are differentially affected by glutathione transferases metabolizing 4-hydroxynon-2-enal. Mech Ageing Dev 128:196–205. doi:10.1016/j.mad.2006.11.025
Bass C, Field LM (2011) Gene amplification and insecticide resistance. Pest Manag Sci 67:886–890. doi:10.1002/ps.2189
Blaxter M (2011) Nematodes: the worm and its relatives. PLoS Biol 9:e1001050. doi:10.1371/journal.pbio.1001050
Blaxter ML, Ley PD, Garey JR, Liu LX, Scheldeman P, Vierstraete A, Vanfleteren JR, Mackey LY, Dorris M, Frisse LM, Vida JT, Thomas W (1998) A molecular evolutionary framework for the phylum Nematoda. Nature 392:71–75. doi:10.1126/science.282.5396.2041
Borchert N, Dieterich C, Krug K, Schütz W, Jung S, Nordheim A, Sommer RJ, Macek B (2010) Proteogenomics of Pristionchus pacificus reveals distinct proteome structure of nematode models. Genome Res 20:837–846. doi:10.1101/gr.103119.109
Bose N, Ogawa A, von Reuss SH, Yim JJ, Ragsdale EJ, Sommer RJ, Schroeder FC (2012) Complex small-molecule architectures regulate phenotypic plasticity in a nematode. Angew Chem Int Ed Engl 51:12438–12443. doi:10.1002/anie.201206797
Brown CM, Reisfeld B, Mayeno AN (2008) Cytochromes P450: a structure-based summary of biotransformations using representative substrates. Drug Metab Rev 40:1–100. doi:10.1002/anie.201206797
C. elegans Sequencing Consortium (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282:2012–2018. doi:10.1126/science.282.5396.2012
Carroll SB (2005) Endless forms most beautiful. Norton & Comp, New York
Csurös M, Miklós I (2009) Streamlining and large ancestral genomes in Archaea inferred with a phylogenetic birth-and-death model. Mol Biol Evol 26:2087–2095. doi:10.1093/molbev/msp123
Dean M, Annilo T (2006) Evolution of the ATP-binding cassette (ABC) transporter superfamily in vertebrates. Annu Rev Genomics Hum Genet 6:123–142. doi:10.1146/annurev.genom.6.080604.162122
Demuth JP, Hahn MW (2009) The life and death of gene families. Bioessays 31:29–39. doi:10.1002/bies.080085
Desjardins CA, Cerqueira GC, Goldberg JM, Hotopp JCD, Haas BJ, Zucker J, Ribeiro JMC, Saif S, Levin JZ, Fan L, Zeng Q, Russ C, Wortman JR, Fink DL, Birren BW, Nutman TB (2013) Genomics of Loa loa a Wolbachia-free filarial parasite of humans. Nat Genet 45:495–500. doi:10.1038/ng.2585
Dessimoz C, Gabaldón T, Roos DS, Sonnhammer ELL, Herrero J, Quest for Orthologs Consortium (2012) Toward community standards in the quest for orthologs. Bioinformatics 28:900–904. doi:10.1093/bioinformatics/bts050
Dieterich C, Clifton SW, Schuster LN, Chinwalla A, Delehaunty K, Dinkelacker I, Fulton L, Fulton R, Godfrey J, Minx P, Mitreva M, Roeseler W, Tian H, Witte H, Yang SP, Wilson RK, Sommer RJ (2008) The Pristionchus pacificus genome provides a unique perspective on nematode lifestyle and parasitism. Nat Genet 40:1193–1198. doi:10.1038/ng.227
Dong J, Song MO, Freedman JH (2005) Identification and characterization of a family of Caenorhabditis elegans genes that is homologous to the cadmium-responsive gene cdr-1. Biochim Biophys Acta 1727:16–26. doi:10.1016/j.bbaexp.2004.11.007
Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7:e1002195. doi:10.1371/journal.pcbi.1002195
Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinform 5:113. doi:10.1186/1471-2105-5-113
Fang S, Ting CT, Lee CR, Chu KH, Wang CC, Tsaur SC (2009) Molecular evolution and functional diversification of fatty acid desaturases after recurrent gene duplication in Drosophila. Mol Biol Evol 26:1447–1456. doi:10.1093/molbev/msp057
Fitch WM (1970) Distinguishing homologous from analogous proteins. Syst Zool 19:99–113. doi:10.2307/2412448
Gabaldón T, Koonin EV (2013) Functional and evolutionary implications of gene orthology. Nat Rev Genet 14:360–366. doi:10.1038/nrg3456
Gerhard J, Kirschner M (1997) Cells, embryos and evolution. Blackwell Science, Oxford
Ghedin E, Wang S, Spiro D et al (2007) Draft genome of the filarial nematode parasite Brugia malayi. Science 317:1756–1760. doi:10.1126/science.1145406
Godel C, Kumar S, Koutsovoulos G, Ludin P, Nilsson D, Comandatore F, Wrobel N, Thompson M, Schmid CD, Goto S, Bringaud F, Wolstenholme A, Bandi C, Epe C, Kaminsky R, Blaxter M, Mäser P (2012) The genome of the heartworm Dirofilaria immitis reveals drug and vaccine targets. FASEB J 26:4650–4661. doi:10.1096/fj.12-205096
Gouy M, Guindon S, Gascuel O (2010) SeaView version 4: a multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Mol Biol Evol 27:221–224. doi:10.1093/molbev/msp259
Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52:696–704. doi:10.1080/10635150390235520
Hahn MW, Han MV, Han SG (2007) Gene family evolution across 12 Drosophila genomes. PLoS Genet 3:e197. doi:10.1371/journal.pgen.0030197
Hall B (1994) Homology. Academic Press, San Diego
Harrop SJ, DeMaere MZ, Fairlie WD, Reztsova T, Valenzuela SM, Mazzanti M, Tonini R, Qiu MR, Jankova L, Warton K, Bauskin AR, Wu WM, Pankhurst S, Campbell TJ, Breit SN, Curmi P (2001) Crystal structure of a soluble form of the intracellular chloride ion channel CLIC1 (NCC27) at 1.4-A resolution. J Biol Chem 276:44993–45000. doi:10.1074/jbc.M107804200
Herrmann M, Mayer WE, Hong RL, Kienle S, Minasaki R, Sommer RJ (2007) The nematode Pristionchus pacificus (Nematoda: Diplogastridae) is associated with the oriental beetle Exomala orientalis (Coleoptera: Scarabaeidae) in Japan. Zoolog Sci 24:883–889. doi:10.2108/zsj.24.883
Herrmann M, Kienle S, Rochat J, Mayer W, Sommer RJ (2010) Haplotype diversity of the nematode Pristionchus pacificus on Réunion in the Indian Ocean suggests multiple independent invasions. Biol J Linn Soc 100:170–179. doi:10.1111/j.1095-8312.2010.01410.x
Jex AR, Liu S, Li B, Young ND, Hall RS, Li Y, Yang L, Zeng N, Xu X, Xiong Z, Chen F, Wu X, Zhang G, Fang X, Kang Y, Anderson GA, Harris TW, Campbell BE, Vlaminck J, Wang T, Cantacessi C, Schwarz EM, Ranganathan S, Geldhof P, Nejsum P, Sternberg PW, Yang H, Wang J, Wang J, Gasser RB (2011) Ascaris suum draft genome. Nature 479:529–533. doi:10.1038/nature10553
Katoh K, Kuma K, Toh H, Miyata T (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33:511–518. doi:10.1093/nar/gki198
Kavanagh KL, Jörnvall H, Persson B, Oppermann U (2008) Medium- and short-chain dehydrogenase/reductase gene and protein families: the SDR superfamily: functional and structural diversity within a family of metabolic and regulatory enzymes. Cell Mol Life Sci 65:3895–3906. doi:10.1007/s00018-008-8588-y
Keys DN, Lewis D, Selegue JE, Pearson BJ, Goodrich LV, Johnson RL, Gates J, Scott MP, Carroll SB (1999) Recruitment of a hedgehog regulatory circuit in butterfly eyespot evolution. Science 283:532–534. doi:10.1126/science.283.5401.532
Kikuchi T, Cotton JA, Dalzell JJ, Hasegawa K, Kanzaki N, McVeigh P, Takanashi T, Tsai IJ, Assefa SA, Cock PJA, Otto TD, Hunt M, Reid AJ, Sanchez-Flores A, Tsuchihara K, Yokoi T, Larsson MC, Miwa J, Maule AG, Sahashi N, Jones JT, Berriman M (2011) Genomic insights into the origin of parasitism in the emerging plant pathogen Bursaphelenchus xylophilus. PLoS Pathog 7:e1002219. doi:10.1371/journal.ppat.1002219
Kiontke KC, Félix MA, Ailion M, Rockman MV, Braendle C, Pénigault JB, Fitch DHA (2011) A phylogeny and molecular barcodes for Caenorhabditis, with numerous new species from rotting fruits. BMC Evol Biol 11:339. doi:10.1186/1471-2148-11-339
Kisiela M, El-Hawari Y, Martin HJ, Maser E (2011) Bioinformatic and biochemical characterization of DCXR and DHRS2/4 from Caenorhabditis elegans. Chem Biol Interact 191:75–82. doi:10.1016/j.cbi.2011.01.034
Koonin EV (2005) Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet 39:309–338. doi:10.1146/annurev.genet.39.073003.114725
Ladner JE, Parsons JF, Rife CL, Gilliland GL, Armstrong RN (2004) Parallel evolutionary pathways for glutathione transferases: structure and mechanism of the mitochondrial class kappa enzyme rGSTK1-1. Biochemistry 43:352–361. doi:10.1021/bi035832z
Le SQ, Lartillot N, Gascuel O (2008) Phylogenetic mixture models for proteins. Philos Trans R Soc Lond B 363:3965–3976. doi:10.1098/rstb.2008.0180
Li X, Berenbaum MR, Schuler MA (2002) Cytochrome P450 and actin genes expressed in Helicoverpa zea and Helicoverpa armigera: paralogy/orthology identification, gene conversion and evolution. Insect Biochem Mol Biol 32:311–320. doi:10.1016/S0965-1748(01)00092-3
Li L, Stoeckert CJ, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178–2189. doi:10.1101/gr.1224503
Librado P, Vieira FG, Rozas J (2012) BadiRate: estimating family turnover rates by likelihood-based methods. Bioinformatics 28:279–281. doi:10.1093/bioinformatics/btr623
Lindblom TH, Dodd AK (2006) Xenobiotic detoxification in the nematode Caenorhabditis elegans. J Exp Zool A 305:720–730. doi:10.1002/jez.a.324
Littler DR, Harrop SJ, Brown LJ, Pankhurst GJ, Mynott AV, Luciani P, Mandyam RA, Mazzanti M, Tanda S, Berryman MA, Breit SN, Curmi PMG (2008) Comparison of vertebrate and invertebrate CLIC proteins: the crystal structures of Caenorhabditis elegans EXC-4 and Drosophila melanogaster DmCLIC. Proteins 71:364–378. doi:10.1002/prot.21704
Low WY, Ng HL, Morton CJ, Parker MW, Batterham P, Robin C (2007) Molecular evolution of glutathione S-transferases in the genus Drosophila. Genetics 177:1363–1375. doi:10.1534/genetics.107.075838
Lynch M, Bobay LM, Catania F, Gout JF, Rho M (2011) The repatterning of eukaryotic genomes by random genetic drift. Annu Rev Genomics Hum Gene 12:347–366. doi:10.1146/annurev-genom-082410-101412
Markov GV, Sommer RJ (2012) The evolution of novelty in conserved gene families. Int J Evol Biol 2012:490894. doi:10.1155/2012/490894
Markov GV, Tavares R, Dauphin-Villemant C, Demeneix BA, Baker ME, Laudet V (2009) Independent elaboration of steroid hormone signaling pathways in metazoans. Proc Natl Acad Sci USA 106:11913–11918. doi:10.1073/pnas.0812138106
Menzel R, Yeo HL, Rienau S, Li S, Steinberg CEW, Stürzenbaum SR (2007) Cytochrome P450s and short-chain dehydrogenases mediate the toxicogenomic response of PCB52 in the nematode Caenorhabditis elegans. J Mol Biol 370:1–13. doi:10.1016/j.jmb.2007.04.058
Mitreva M, Jasmer DP, Zarlenga DS, Wang Z, Abubucker S, Martin J, Taylor CM, Yin Y, Fulton L, Minx P, Yang SP, Warren WC, Fulton RS, Bhonagiri V, Zhang X, Hallsworth-Pepin K, Clifton SW, McCarter JP, Appleton J, Mardis ER, Wilson RK (2011) The draft genome of the parasitic nematode Trichinella spiralis. Nat Genet 43:228–235. doi:10.1038/ng.769
Moore AD, Grath S, Schüler A, Huylmans AK, Bornberg-Bauer E (2013) Quantification and functional analysis of modular protein evolution in a dense phylogenetic tree. Biochim Biophys Acta 1834:898–907. doi:10.1016/j.bbapap.2013.01.007
Müller T, Vingron M (2000) Modeling amino acid replacement. J Comput Biol 7:761–776. doi:10.1089/10665270050514918
Napier JA, Michaelson LV, Sayanova O (2003) The role of cytochrome b5 fusion desaturases in the synthesis of polyunsaturated fatty acids. Prostaglandins Leukot Essent Fatty Acids 68:135–143. doi:10.1016/S0952-3278(02)00263-6
Nelson DR (1998) Metazoan cytochrome P450 evolution. Comp Biochem Physiol C 121:15–22. doi:10.1016/S0742-8413(98)10027-0
Opperman CH, Bird DM, Williamson VM, Rokhsar DS, Burke M, Cohn J, Cromer J, Diener S, Gajan J, Graham S, Houfek TD, Liu Q, Mitros T, Schaff J, Schaffer R, Scholl E, Sosinski BR, Thomas VP, Windham E (2008) Sequence and genetic map of Meloidogyne hapla: a compact nematode genome for plant parasitism. Proc Natl Acad Sci USA 105:14802–14807. doi:10.1073/pnas.0805946105
Owen R (1843) Lectures on comparative anatomy and physiology of the invertebrate animals. Delivered at the Royal College of Surgeons in 1843. Longman, Brown, Green and Longman, London
Pantalacci S, Chaumot A, Benoît G, Sadier A, Delsuc F, Douzery EJ, Laudet V (2008) Conserved features and evolutionary shifts of the EDA signaling pathway involved in vertebrate skin appendage development. Mol Biol Evol 25:912–928. doi:10.1093/molbev/msn038
Penn O, Privman E, Ashkenazy H, Landan G, Graur D, Pupko T (2010) GUIDANCE: a web server for assessing alignment confidence scores. Nucleic Acids Res 38:W23–W28. doi:10.1093/nar/gkq443
Perally S, Lacourse EJ, Campbell AM, Brophy PM (2008) Heme transport and detoxification in nematodes: subproteomics evidence of differential role of glutathione transferases. J Proteome Res 7:4557–4565. doi:10.1021/pr800395x
Perbandt M, Höppner J, Betzel C, Walter RD, Liebau E (2005) Structure of the major cytosolic glutathione S-transferase from the parasitic nematode Onchocerca volvulus. J Biol Chem 280:12630–12636. doi:10.1074/jbc.M413551200
Persson B, Kallberg Y, Bray JE, Bruford E, Dellaporta SL, Favia AD, Duarte RG, Jörnvall H, Kavanagh KL, Kedishvili N, Kisiela M, Maser E, Mindnich R, Orchard S, Penning TM, Thornton JM, Adamski J, Oppermann U (2009) The SDR (short-chain dehydrogenase/reductase and related enzymes) nomenclature initiative. Chem Biol Interact 178:94–98. doi:10.1016/j.cbi.2008.10.040
Pires daSilva A, Sommer RJ (2003) The evolution of signalling pathways in animal developmentNat Rev Genet 4(1):39–49. doi:10.1038/nrg977
Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer ELL, Eddy SR, Bateman A, Finn RD (2012) The Pfam protein families database. Nucleic Acids Res 40:D290–D301. doi:10.1093/nar/gkr1065
Rieppel O (1988) Fundamentals of comparative biology. Birkhäuser Verlag, Basel
Rödelsperger C, Streit A, Sommer RJ (2013) Structure, Function and Evolution of the Nematode Genome. Wiley, Chichester. doi:10.1002/9780470015902.a0024603
Sánchez-Gracia A, Vieira FG, Rozas J (2009) Molecular evolution of the major chemosensory gene families in insects. Heredity 103:208–216. doi:10.1038/hdy.2009.55
Sheehan D, Meade G, Foley VM, Dowd CA (2001) Structure, function and evolution of glutathione transferases: implications for classification of non-mammalian members of an ancient enzyme superfamily. Biochem J 360:1–16. doi:10.1042/0264-6021:3600001
Sheps JA, Ralph S, Zhao Z, Baillie DL, Ling V (2004) The ABC transporter gene family of Caenorhabditis elegans has implications for the evolutionary dynamics of multidrug resistance in eukaryotes. Genome Biol 5:R15. doi:10.1186/gb-2004-5-3-r15
Sinha A, Sommer RJ, Dieterich C (2012) Divergent gene expression in the conserved dauer stage of the nematodes Pristionchus pacificus and Caenorhabditis elegans. BMC Genom 13:254. doi:10.1186/1471-2164-13-254
Sommer A, Rickert R, Fischer P, Steinhart H, Walter RD, Liebau E (2003) A dominant role for extracellular glutathione S-transferase from Onchocerca volvulus is the production of prostaglandin D2. Infect Immun 71:3603–3606. doi:10.1128/IAI.71.6.3603-3606.2003
Son LT, Ko KM, Cho JH, Singaravelu G, Chatterjee I, Choi TW, Song HO, Yu JR, Park BJ, Lee SK, Ahnn J (2011) DHS-21, a dicarbonyl/L-xylulose reductase (DCXR) ortholog, regulates longevity and reproduction in Caenorhabditis elegans. FEBS Lett 585:1310–1316. doi:10.1016/j.febslet.2011.03.062
Srinivasan J, Dillman AR, Macchietto MG, Heikkinen L, Lakso M, Fracchia KM, Antoshechkin I, Mortazavi A, Wong G, Sternberg PW (2013) The draft genome and transcriptome of Panagrellus redivivus are shaped by the harsh demands of a free-living lifestyle. Genetics 193:1279–1295. doi:10.1534/genetics.112.148809
Stein LD, Bao Z, Blasiar D et al (2003) The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol 11:E45. doi:10.1371/journal.pbio.0000045
Tarling EJ, de Aguiar Vallim TQ, Edwards PA (2013) Role of ABC transporters in lipid transport and human disease. Trends Endocrinol Metab 24:342–350. doi:10.1016/j.tem.2013.01.006
Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science 278:631–637. doi:10.1126/science.278.5338.631
Ternes P, Franke S, Zähringer U, Sperling P, Heinz E (2002) Identification and characterization of a sphingolipid delta 4-desaturase family. J Biol Chem 277:25512–25513. doi:10.1074/jbc.M202947200
Thomas JH (2007) Rapid birth-death evolution specific to xenobiotic cytochrome P450 genes in vertebrates. PLoS Genet 3:e67. doi:10.1371/journal.pgen.0030067
Thornton JW, DeSalle R (2000) Gene family evolution and homology: genomics meets phylogenetics. Annu Rev Genomics Hum Genet 1:41–73. doi:10.1146/annurev.genom.1.1.41
Van de Peer Y, Maere S, Meyer A (2009) The evolutionary significance of ancient genome duplications. Nat Rev Genet 10:725–732. doi:10.1038/nrg2600
Van Megen H, Van Den Elsen S, Holterman M, Karssen G, Mooyman P, Bongers T, Holovachov O, Bakker J, Helder J (2009) A phylogenetic tree of nematodes based on about 1200 full-length small subunit ribosomal DNA sequences. Nematology 11:927–950. doi:10.1163/156854109X456862
van Rossum AJ, Jefferies JR, Rijsewijk FAM, LaCourse EJ, Teesdale-Spittle P, Barrett J, Tait A, Brophy PM (2004) Binding of hematin by a new class of glutathione transferase from the blood-feeding parasitic nematode Haemonchus contortus. Infect Immun 72:2780–2790. doi:10.1128/IAI.72.5.2780-2790.2004
von Reuss SH, Bose N, Srinivasan J, Yim JJ, Judkins JC, Sternberg PW, Schroeder FC (2012) Comparative metabolomics reveals biogenesis of ascarosides, a modular library of small-molecule signals in C. elegans. J Am Chem Soc 134:1817–1824. doi:10.1021/ja210202y
Wagner GP (2007) The developmental genetics of homology. Nat Rev Genet 8:473–479. doi:10.1038/nrg2099
Watts JL (2009) Fat synthesis and adiposity regulation in Caenorhabditis elegans. Trends Endocrinol Metab 20:58–65. doi:10.1016/j.tem.2008.11.002
Wilkins AS (2002) The evolution of developmental pathways. Sinauer Associates, Sunderland
Yang Z (2007) PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24:1586–1591. doi:10.1093/molbev/msm088
Yang Z, Bielawski JP (2000) Statistical methods for detecting molecular adaptation. Trends Ecol Evol 15:496–503. doi:10.1016/S0169-5347(00)01994-7
Yook K, Harris TW, Bieri T et al (2012) WormBase 2012: more genomes, more data, new website. Nucleic Acids Res 40:D735–D741. doi:10.1093/nar/gkr954
Zhao Z, Thomas JH, Chen N, Sheps JA, Baillie DL (2007) Comparative genomics and adaptive selection of the ATP-binding-cassette gene family in Caenorhabditis species. Genetics 175:1407–1418. doi:10.1534/genetics.106.066720
Zheng M, Messerschmidt D, Jungblut B, Sommer RJ (2005) Conservation and diversification of Wnt signaling function during the evolution of nematode vulva development. Nat Genet 37:300–304. doi:10.1038/ng1512
Acknowledgments
We thank Drs. Amit Sinha, Akira Ogawa, Erik Ragsdale, and Adrian Streit for useful discussions, as well as Drs. Christian Rödelsperger, and James Lightfoot for critical reading of the manuscript. This work was supported by the Max Planck Society.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Markov, G.V., Baskaran, P. & Sommer, R.J. The Same or Not the Same: Lineage-Specific Gene Expansions and Homology Relationships in Multigene Families in Nematodes. J Mol Evol 80, 18–36 (2015). https://doi.org/10.1007/s00239-014-9651-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00239-014-9651-y