Introduction

Homology is a fundamental unit of comparative biology and has been successfully applied at many different levels of biological organization (Hall 1994; Wagner 2007). Introduced by Richard Owen (1843), the homology concept predates Darwin and is in part, independent of evolutionary theory (Rieppel 1988). Since the 1960s, the homology concept has also been applied in molecular biology with the important distinction between orthology and paralogy as introduced by Fitch (1970). Specifically, orthologous genes or proteins are defined as homologs in different species that evolved from common ancestry through speciation. In contrast, paralogous genes and proteins result from duplications within a genome. While the application of the homology concept to protein-coding genes and the development of the field of bioinformatics has been most fruitful (Dessimoz et al. 2012), one important distinction to the use of homology in anatomy and morphology was for a long time the different context dependence. In anatomy and morphology, homology of structures was always considered in the context of the whole organisms. In contrast, homology of genes was for more than three decades always assigned in the absence of whole-genome sequencing data. Therefore, the proper distinction between orthology and paralogy was in part limited by the absence of whole-genome data sets (Thornton and DeSalle 2000) and the lack of phylogenetic resolution among sequenced species (Koonin 2005).

This situation has changed during the last decade with the availability of whole-genome sequencing data, often in a comprehensive phylogenetic context. Such studies can focus on closely related species and help to determine recent patterns and processes of the evolution of genes and gene families. For example, the comparison of the genome of 12 closely related Drosophila species revealed frequent changes in the size of gene families (Hahn et al. 2007). Alternatively, genome comparisons can try to cover more distantly related species of a given phylum to determine more ancient evolutionary events. Such studies are still in their infancy given that few broad coverages of whole-genome sequencing data are available in animals (for example, Moore et al. 2013). However, one group of animals that has advanced whole-genome sequencing data available in many species are the nematodes (roundworms). Building on the genome sequencing project of the model organism Caenorhabditis elegans in 1998 (C. elegans sequencing consortium 1998), soon followed by another member of this genus, C. briggsae (Stein et al. 2003), species of seven additional genera had their genomes sequenced and published with protein prediction sequences available for search in public databases. These include animal parasites (Brugia malayi, Loa loa, Ascaris suum, and Trichinella spiralis) (Ghedin et al. 2007; Desjardins et al. 2013; Jex et al. 2011; Mitreva et al. 2011), plant parasites (Meloidogyne hapla and Bursaphelenchus xylophilus) (Opperman et al. 2008; Kikuchi et al. 2011), as well as an additional free-living nematode Pristionchus pacificus (Dieterich et al. 2008). Based on molecular phylogeny, the nematodes are divided into five major clades, numbered from I to V (Blaxter et al. 1998), and representatives of four of them are now sequenced and available for genomic searches (Fig. 1a). Two species, Caenorhabditis remanei and Strongyloides ratti, for which a genome sequence is available in public databases but not formally published were also covered in this study, whereas species with a published genome for which protein predictions are not yet available for search, such as Dirofilaria immitis, Meloidogyne incognita, and Panagrellus redivivus (Godel et al. 2012; Abad et al. 2008; Srinivasan et al. 2013), were left aside.

Fig. 1
figure 1

Phylogenetic relationships among nematode genomic models, and structure of the biochemical pathways that were investigated. a Relationships between nematode for which a genome sequence is available for genomic searches in public databases is plotted on a phylogenetic tree, with a topology adapted from van Megen et al. (2009) and Blaxter (2011). Species for which genome data are available without a published genome paper are indicated in orange. This means that the dataset should be considered as possibly more incomplete or less reliable for them. The color to group the taxa is the same as used in the later Fig. to describe the gene expansions. b Xenobiotic-metabolizing pathway, redrawn after Lindblom and Dodd (2006). Protein families are highlighted in purple. c Polyunsaturated and monomethyl branched-chain fatty acid synthesis, redrawn from Watts (2009). Fatty acid nomenclature: X:Yn-Z, fatty acid chain of X carbon atoms and Y methylene-interrupted cis double bonds; Z indicates the position of the terminal double bond relative to the methyl end of the molecule. Enzyme families that were analyzed are highlighted in purple. The conservation of ACC and FAS enzymes was not investigated here. LET-767 is a member of the SDR family (Fig. S2)

Here, we investigate homology relationships of multigene families using a manually curated dataset from several gene families that are potentially involved in the detoxification of xenobiotics (Fig. 1b) and in the synthesis of fatty acids (Fig. 1c). We choose these two pathways because they contain enzymes involved in the metabolism of two different kinds of substrates. Xenobiotic-metabolizing enzymes process extracellular substrates that might be highly species-specific depending on diet and exposure to ingested pathogens or environmental pollutants and nematicides (Lindblom and Dodd 2006). In contrast, polyunsaturated and branched-chain fatty acid synthesis is considered to be an evolutionary conserved process and unusually detailed knowledge of the functional specificity of enzymes is available (Watts 2009). Specifically, we compare cytochrome P450 (CYP), short-chain dehydrogenases reductases (SDR), glutathione-S-transferases (GST), UDP-glucuronosyltransferases (UGT), ABC transporters, fatty acid desaturases (FAT), and fatty acid elongases (ELO). Our study has a special focus on the two genetic nematode models C. elegans and P. pacificus, which represent distant relatives of the same clade (clade V). C. elegans and P. pacificus are members of different families, the Rhabditidae and Diplogastridae, respectively, and molecular sequence data suggested the divergence of the last common ancestor of both lineages more than 200 million years ago (Dieterich et al. 2008). Both species differ in their ecology with C. elegans being often found on rotting fruits (Kiontke et al. 2011), whereas P. pacificus lives in a tight association with scarab beetles (Herrmann et al. 2007, Herrmann et al. 2010). We find the near absence of one-to-one orthology relationships among nematode gene families. This finding is due to substantial lineage-specific expansions and losses, which can only be fully revealed in the context of manual curation of whole-genome sequencing data.

Materials and Methods

Data Collection

Protein sequences for C. elegans, C. briggsae, C. remanei, B. malayi, L. loa, A. suum, and T. spiralis were collected from BLAST searches in GenBank. Sequences for B. xylophilus, M. hapla, and S. ratti were collected from BLAST searches in Wormbase version WS232 (Yook et al. 2012). Sequences of P. pacificus are from the HYBRID1 proteomics gene models dataset that is available on the website http://pristionchus.org, and were refined when necessarily by the new assembly dataset and with the help of sequences from the sister species Pristionchus exspectatus. Some individual experimental sequences from various nematodes without a completed genome were also incorporated when it helped to increase the phylogenetic resolution. Accession numbers for all used sequences are provided in Table S1, and a hundred of manually corrected sequences are given in Dataset S1. Species names are abbreviated by the following suffixes on the Figs: Aca, Ancyclostoma caninum; Asu, A. suum; Ath, Arabidopsis thaliana; Bma, B. malayi; Bxy, B. xylophilus; Cbr, Caenorhabditis briggsae; Cel, C. elegans; Cre, Caenorhabditis remanei; Cte, Capitella teleta; Dim, D. immitis; Hco, Haemonchus contortus; Hgl, Heterodera glycines; Hsa, Homo sapiens; Hpo, Heligmosomoides polygyrus; Llo, L. loa; Mha, M. hapla; Min, M. incognita; Nam, Necator americanus; Ode, Oesophagostomum dentatum; Ovu, Onchocerca volvulus; Ppa, Pristionchus pacificus; Sce, Saccharomyces cerevisiae; Sra, Strongyloides ratti; Sst, Strongyloides stercoralis; Tbr, Trichinella britovi; Tps, Trichinella pseudospiralis; Tsp, T. spiralis; Wba, Wuchereria bancrofti.

Protein Domain Screening

Protein predictions of unexpected size were manually screened using PFAM (Punta et al. 2012) in order to remove other protein domains that are very likely to be the result of artifactual domain fusions, or to detect wrongly split proteins. Partial sequences that could not be unambiguously improved by manual fusion and sequences bearing domains below the significance level were kept in the dataset, considering that this will help in future studies.

Detection of Artifactual Domain Combinations and Removal of Contaminants

Before carrying out a detailed phylogenetic analysis, the collected sequences of abnormal length were individually scanned to check for the presence of PFAM domains other than those characterizing the protein family of interest. Protein domain recombination through exon shuffling is a well-characterized evolutionary process. Therefore, the phylogenetic distribution of such fusion proteins can provide a first way of detecting fusion artifacts. If a fusion appears in many closely related species, it seems reasonable to infer that there is a true novel arrangement. In contrast, isolated fusion domains are more likely due to assembly artifacts and would thus, be removed. Some proteins domains were found to be present as single-gene predictions. In cases where it was reasonable to assume that isolated domains are complementary parts of the same protein, we manually fused them. In other cases where the grouping was not clear, the predicted domains were treated as partial sequences in the alignment. These cases are indicated by asterisks (*) in Table S1, and modified sequences are given in full in Dataset S1. Sequences that appeared highly divergent on preliminary phylogenetic trees and/or that showed a best blast hit to bacterial sequences, and thus could be reasonably interpreted as contaminants, were removed from the final dataset.

Phylogenetic Analyses

Collected sequences were aligned using Muscle (Edgar 2004) and MAFFT (Katoh et al. 2005). The quality of both alignments was checked and refined using GUIDANCE (Penn et al. 2010), and the alignment with the best score was selected for further analysis. Best-fit amino acid substitution models were then evaluated using ProtTest (Abascal et al. 2005), and sorted under the Akaike Information Criterion (Akaike 1974). Except for the elongases, for which the VT model showed the best results (Müller and Vingron 2000), the best-fit model was always obtained using the LG substitution matrix (Le et al. 2008). Because the LG substitution matrix is not yet implemented in Bayesian tree reconstruction programs, phylogenetic analyses were carried out only in the maximum-likelihood framework, using PHYML (Guindon and Gascuel 2003) as implemented in Seaview (Gouy et al. 2010). Alternatively, the web interface (http://www.atgc-montpellier.fr/phyml/) was used if empirical estimates of equilibrium frequencies were required. Reliability of nodes was assessed by likelihood-ratio test (Anisimova and Gascuel 2006).

Screen for Selective Constraints

For the complete sequence set presented in Fig. 3, as well as for the four subgroups indicated in the same figure, cDNA alignments were made using Seaview (Gouy et al. 2010), removing all gap-containing sites and ambiguous sites, detected with the help of Guidance (Penn et al. 2010). Three sequences (GST-17, GST-18, and CBG06381) were removed from the dataset, based on their incompleteness (Dataset S1). A maximum-likelihood tree was estimated using PHYML (Guindon and Gascuel 2003) and rooted accordingly to the position of the sequence groups in the whole GST tree. Subtrees corresponding to the four groups were directly pruned from that tree. The likelihoods of two pairs of models were compared, using the codeml program as implemented in the PAML4.4 package (Yang 2007). The null model M1a (nearly neutral) assumes two classes of sites, with either ω0 ≤ 1 (negative or purifying selection) or ω1 = 1 (neutral). The alternative model M2a (selection) adds a third class of sites with ω2 ≥ 1 (positive selection). The null model M7 assumes that ω ratios vary among codons, following a β distribution in the interval 0 < ω < 1. The alternative model M8 adds an extra class of sites under neutral evolution or positive selection with ωS ≥ 1 (Yang and Bielawski 2000). Differences in log-likelihoods (2ΔlnL) were compared to a χ 2 distribution of 1 degree of freedom and corrected for multiple testing (Anisimova and Yang 2007).

Birth and Death Estimates and Automated Orthology Assignments

Estimates of birth and death were performed using BadiRate (Librado et al. 2012). A table of gene counts from the manually curated dataset, and an altrametric tree corresponding to the current consensus on nematode phylogenetic relationships (Van Megen et al. 2009) were used as input. Turnover rates were estimated under the BDI model (Csurös and Miklós 2009), allowing different rates of turnover across the branches (FR choice in the –b model option), which seemed to be the biologically most relevant scenario after visual inspection of the phylogenetic trees. Ancestral states were calculated under the maximum-likelihood (ML) criterion. This combination of parameters defines the so-called BDI-FR-ML method. For comparative purposes, estimates were also carried out using two parsimony-based methods (BDI-FR-CSP and BDI-FR-CWP). Clusters of orthologs were calculated using version 2.0 of the OrthoMCL (Li et al. 2003), using the HYBRID1 protein predictions available at http://pristionchus.org for P. pacificus and the protein predictions available in Wormbase WS241 for the eight other investigated species for which the genome sequence was already published. Predictions from C. remanei and S. ratti were not included, to avoid biases in gene and cluster counts due to the high number of contaminants or heterozygous sequences that were noticed during the manual analysis. Protein sequences were automatically retrieved by searching for Pfam domains using HMMER (Eddy 2011) and the pfam_scan script provided on the PFAM website (Punta et al. 2012), with E value cutoff of 0.001. Clustering was done using the default values suggested in the OrthoMCL documentation for blast E value (E-05) and percentage identity cutoff (50 %). 25,303 clusters with more than one gene were identified from the OrthoMCL output.

Results

In the following, we describe the topology of seven different gene families in nematodes according to maximum-likelihood phylogenies. Our analysis is based on a manually curated protein set of nearly 2,000 proteins compiled from public databases containing whole-genome sequencing drafts of 11 species from seven genera in four distinct nematode clades (Fig. 1a). The database deadline for our analysis was February 2013. Data collection, protein domain screening, and phylogenetic analyses are described in detail in the Methods section, along with screens for positive selection and estimations of birth and death rates. This study compares the following families: GST (Figs. 2, 3, Fig. S1; Table 1), CYP, SDR and UGT (Fig. S2), ABC (Fig. 4, Fig. S3), FAT (Fig. 6), and ELO (Fig. S4), using citrate cycle enzyme genes as an external baseline in an automated control analysis, where all the manually analyzed families were also re-analyzed. A summary of all results is provided in Tables 2, S2, S3 and Figs. 7 and 8, indicating the general conservation level of gene families.

Fig. 2
figure 2

Simplified view of the phylogenetic relationships among nematode GST genes. A compiled view of the general pattern of nematode GSTs, simplified from the maximum-likelihood trees that are extensively given in Fig. S1. Lineage-specific expansions are colored according to the phylogenetic position of the nematode bearing the amplified sequences. For nodes that are discussed in the text, support values (likelihood-ratio tests) are indicated on the branches. Values above 0.97 are considered fully reliable

Fig. 3
figure 3

Lineage-specific expansions within the genus level. An excerpt of the tree shown in Fig. S1 is presented to highlight the differences in duplication patterns among parts of the subtree between three species of the same genus. The proteins from C. elegans, C. briggsae, and C. remanei are each highlighted in their respective color. Likelihood-ratio test support values are plotted on nodes that are discussed in the text. Groups that were defined to screen for positive selection are also boxed (groups A to D), with GST-17, GST-18, and CBG06381 being removed because of their incompleteness. Likelihood values for the tested models are given in Table 1

Table 1 Summary of screen for positive selection constraints in the GST family
Fig. 4
figure 4

Domain variation among the ABC family. a Classical subdivisions among ABC transporter families are presented, indicating the supplementary figures, where the phylogenetic relationships among the various structural subgroups were analyzed. Trivial names used for nematode subfamilies are indicated into brackets. b One example of domain variation among close relatives in the ABCC (MRP) subfamily

Table 2 Comparisons of xenobiotic- and PUFA-metabolizing genes between C. elegans and P. pacificus

The GST Family Shows Eighteen Lineage-Specific Gene Expansions

We first analyzed the GST family as a detailed example for investigating the principles of homology assignments in multigene families because (i) GST proteins have a conserved structure that is quite well understood with crystallographic data for C. elegans, and also for some parasitic nematode species (van Rossum et al. 2004, Perbandt et al. 2005, Asojo et al. 2007); (ii) it has a reasonable size of approximately 50 members in C. elegans, making it manageable to investigate the relationships between family members; and (iii) it was already subject of evolutionary studies that provided a basic topological and nomenclatural framework. Moreover, a previous study provided an explicit hypothesis about the origin of the family (Sheehan et al. 2001).

GSTs have a small size of about 200 amino acids and the family can be divided into several subclasses (Fig. 2). Classical GSTs are formed by a thioredoxin-like N-terminus consisting of two motifs (βαβ motif and ββα motif) separated by an α helix, and by a C-terminus consisting of a variable number of α helixes (Ladner et al. 2004). One important exception is the GST kappa class with helixes forming a domain functionally equivalent to the classical C-terminus that is inserted between the βαβ and ββα motives (Ladner et al. 2004). GST kappa and other GSTs have different bioinformatical signatures: GST kappa members correspond to the PFAM domain DSBA (PF01323), whereas other GSTs are identified by the PFAM domains GST_N (PF02798) or GST_N_3 (PF13417) for the N-terminus and GST_C (PF00043) for the C-terminus. We group the phylogeny of nematode GST genes in a simplified tree (Fig. 2) and provide high resolution data for the GST kappa members separately (Fig. S1). We excluded from the analysis the proteins bearing a GST_N_3 (PF13417) domain followed by a GST_C_2 (PF13410) domain because they are currently classified in a different family in nematodes based on crystallographic studies (Harrop et al. 2001; Dong et al. 2005; Littler et al. 2008).

The total number of GST genes varies highly from only two in T. spiralis to 59 in C. elegans and P. pacificus. The slight discrepancies in gene numbers with previously published estimates (Kikuchi et al. 2011; Dieterich et al. 2008; Borchert et al. 2010) may be due to assembly and annotation errors as well as to different cutoffs in protein homology searches. According to current hypotheses about the evolution of GST genes, GST kappa is supposed to branch at the basis of the family, and therefore, we rooted our summary tree accordingly (Fig. 2) (Sheehan et al. 2001). Moreover, the overall distribution of the nematode GST sequences fits with the previously published phylogenies of C. elegans GSTs (Perally et al. 2008). The two classes diverging first are GST omega and GST zeta, with GST pi representing another well-defined family. These three families are well supported, with likelihood-ratio test values of 1.00 for the zeta class, 0.99 for the pi class, and 0.94 for the omega class.

The majority of the remaining sequences cluster in a less defined group that we designate as sigma/nu. While the sigma class is present in all metazoans (Sheehan et al. 2001), the nu class was defined specifically for nematode sequences starting from a sequence from the parasitic nematode H. contortus (van Rossum et al. 2004). This group was later extended to some other nematode sequences (Perally et al. 2008). However, the distinction between the two classes is not clear and the nu class might partially be a subpart of the sigma class. Therefore, we here refer to the whole group as sigma/nu.

We highlighted as expansions all groups of sequences with three or more paralogs in one genus relative to a group of orthologous sequences or to a paralogous group in another genus (Fig. 2, Fig. S1). Clade III and Clade IV nematodes, as well as P. pacificus and rhabditids all show at least one group of paralogs that cluster with a strong likelihood-ratio support (Fig. 2). In total, we observe 18 expansions mainly in class IV and class V nematodes, but also one expansion in A. suum. Most importantly, the expansion patterns vary between the different classes of GSTs, as well as in and between organisms. For example, for clade IV nematodes, there is one small expansion in the zeta clade with three additional sequences. In the GST sigma/nu group, there are four clade IV-specific expansions with 3–33 members each. In contrast, clade IV sequences are totally absent from the omega and pi classes. Similarly, P. pacificus GSTs expand at a similar basal rate (3 paralogs) in the omega and zeta classes, whereas there is no duplication in the pi class. In rhabditid nematodes, there are expansions in all classes with two cases resulting in 27 and 53 members, respectively.

To indicate the tempo and mode of gene expansions and losses with a higher phylogenetic resolution, we studied the representation of GST genes in the genus Caenorhabditis, which has the highest coverage of whole-genome sequencing data. Figure 3 shows that the rhabditid-specific expansion in the GST sigma/nu class is still highly variable within the Caenorhabditis genus. Four proteins (GST-3, GST-4, GST-15, and GST-24) are encoded by unambiguously orthologous genes in C. elegans, C. briggsae, and C. remanei. However, in one case a gene being present as a single ortholog in C. briggsae and C. remanei has four counterparts in C. elegans (GST-21, GST-22, GST-34, and GST-35). In another case, we find a group of 6 C. elegans-specific paralogs (GST-12, GST-14, GST-16, GST-17, GST-18, and GST-19) with a single ortholog in C. briggsae and C. remanei. Finally, a third group contains seven paralogs from C. elegans, (GST-26, GST-27, GST-28, GST-29, GST-31, GST-32, and GST-37), two paralogs in C. briggsae and one additional gene that could be a one-to-one ortholog between the tree species, even if not supported by a high likelihood-ratio test value (GST-39). We searched for signatures of positive selection either at the branch or site level in the whole group and in four subgroups corresponding to more recently diverged sequences (groups A to D on Fig. 3), but found no significant differences between the likelihoods of models allowing for positive selection relative to those assuming neutral evolution or purifying selection (Table 1). Screens for positive selection that were carried out extensively in the GST family among 12 species from the Drosophila genus, where lineage-specific duplications also occur and GST numbers vary from 30 to 45 paralogs (Low et al. 2007). This screen identified a single gene showing some signal for positive selection, suggesting that the vast majority of gene duplications are neutral events.

Taken together, GST expansions occur even in closely related species of the same genus, a pattern similar to observations made in Drosophila gene families (Hahn et al. 2007). These expansions can occur to the extent that few one-to-one orthology relationships can be made. An extreme case is found between P. pacificus and C. elegans, which both have 59 GST genes in their genomes. While the total number of genes is identical in both species, only GST-11 is a one-to-one ortholog among all GST genes of both organisms.

Tens of Lineage-Specific Gene Expansions in the CYP, SDR, and UGT Families

The CYP family consists of well-conserved proteins, some of which are involved in phase I xenobiotic metabolism and in the synthesis of various endogenous hormones (Fig. 1b) (Brown et al. 2008). They correspond to the single PFAM domain PF00067. It is already known that lineage-specific expansions have occurred in C. elegans (Nelson 1998), in vertebrates (Thomas 2007), and in various other metazoan groups (Markov et al. 2009). Based on previous knowledge, we rooted the nematode CYP tree using CYP51, which is conserved in almost all living organisms except for the ecdysozoa, and CYP39A, a vertebrate paralog that branched basal in previous analyses, which is also conserved in annelids (Markov et al. 2009). We show a total of 20 independent lineage-specific expansions in nematodes from clades IV and clade V (Fig. S2). The number of genes involved in such expansions is highly variable ranging from only three paralogs for a Meloidogyne-specific group to 52 in the Caenorhabditis CYP34-CYP35 group (Fig. S2). Thirteen out of those 20 expansions are unambiguously supported by a likelihood-ratio test value higher than 0.97 at their basis, and six of the remaining seven expansions encompass a subclade of at least three sequences also supported by a likelihood-ratio test value higher than 0.97, suggesting that these findings are overall robust. Besides these major amplifications, other nematode CYPs are not necessarily one-to-one orthologs because of many single duplications and losses.

The SDR constitute a large family of NAD(P)(H)-dependent oxidoreductases with crucial functions in lipid, amino acid, carbohydrate, cofactor, hormone and phase I xenobiotic metabolism, as well as in redox sensing mechanisms (Kavanagh et al. 2008). Sequence identities are low within the SDR family with many undefined boundaries. The current family classification is not based upon phylogeny, but rather relies on motif recognition (Persson et al. 2009). As an exhaustive study on SDRs from C. elegans is still lacking, we deciphered as a starting point the phylogenetic relationships among the classical nematode SDRs. Specifically, we took into account all sequences identified by the PFAM domains adh_short (PF00106) and adh_short_C2 (PF13561), because all genes that are proposed to be involved in xenobiotic response are from these groups (Menzel et al. 2007; Kisiela et al. 2011; Son et al. 2011). The family history is shaped by four expansion events within the Caenorhabditis genus, 11 expansions in Pristionchus, 17 expansions in clade IV, two in clade III, and even one small expansion in T. spiralis (Fig. S2). Of these 35 expansions, only 14 are supported by a LRT value higher than 0.97. However, this number increases to 23 when incorporating subparts of the weakly supported amplifications with internal branches that cover more than three sequences from the same species. This lower support reflects the limited level of sequence identity in the SDR family, which makes phylogenetic reconstructions difficult. In total, there are nine one-to-one orthologs between C. elegans and P. pacificus with one orthologous pair showing differential domain composition.

Enzymes of the UGT family are involved in the addition of glucuronyl residues during phase II of xenobiotic metabolism (Fig. 1b). They are known to be highly diverse at the intra-phylum level. For example, out of a dataset of 310 insect proteins, only one was found to be conserved among holometabolous insects (Hahn et al. 2012). Our phylogenetic analysis identifies 19 amplification events within nematode members of this family (Fig. S2). Four small amplifications are found in the clade III nematode A. suum, seven in clade IV nematodes, and five in P. pacificus, including two major amplifications encompassing 34 and 54 sequences, respectively. Another three amplifications are observed in the Caenorhabditis genus, the biggest encompassing 154 sequences. Only one gene, coding for UGT-60, is a one-to-one ortholog between C. elegans and P. pacificus (Fig. S2). Taken together, the extensive lineage-specific gene expansions observed for CYP, SDR, and UGT families indicate that the observation originally made for GST genes is not an exception but more likely the rule.

Nine Small Expansions and Some Domain Losses in the ABC Transporter Family

ABC transporters are a highly modular family of proteins that are involved in pumping metabolites from the cytoplasm of the cell to other compartments or outside of the cell (Sheps et al. 2004). The domain number and order can vary among subfamilies (Sheps et al. 2004; Fig. 4a). We have therefore analyzed separately the four types of domain combinations that exist in nematodes (Fig. S3).

The first type of domain architecture of ABC transporters has a tandem repeat of an ATP-binding-cassette domain followed by a transmembrane domain, which is characteristic of transporters of the subfamilies ABCA, ABCB, and ABCC (Fig. 4a). The full-length ABCB, also called P-glycoproteins (PGP), and the ABCC, also known as multidrug-resistant proteins (MRP), are supposed to be involved in drug excretion in C. elegans (Sheps et al. 2004). A more exhaustive study among the Caenorhabditis genus showed that these proteins are roughly conserved as one-to-one orthologs between C. elegans, C. briggsae, and C. remanei (Zhao et al. 2007). Consistently, we identified only a small number of moderate expansions relative to the previously analyzed families. We notice seven expansions, four of them involving three to four paralogs in the Caenorhabditis genus, in P. pacificus or in B. xylophilus (Fig. S3).

The second type of domain architecture of ABC transporters has a single ATP-binding-cassette in the N-terminus followed by a transmembrane domain in C-terminus. Such proteins need to dimerize to be fully functional, and such structures are characteristic for members of the half-length (HAF) members of the ABCB family (Sheps et al. 2004) and for the ABCD members, also called peroxisomal membrane proteins (PMP) in nematodes (Zhao et al. 2007). The half-length ABCB includes some mitochondrial transporters, whereas the ABCD subfamily contains proteins involved in fatty acid transport into peroxisomes and vitamin B12 metabolism in lysosomes in vertebrates (Tarling et al. 2013). We find a level of conservation that is even higher than in the previous group with seven of the 15 C. elegans proteins having one-to-one ortholog in P. pacificus (Fig. S3). Only two expansions of six to eight paralogs occur specifically in B. xylophilus. Strikingly, the largest of these expansions occurs in the HAF-5 subfamily, which is otherwise conserved as one-to-one ortholog in mammals, fruit fly (Zhao et al. 2007), and other nematodes (Fig. S3).

ABCE, ABCF-1, ABCF-2, and ABCF-3 are strictly conserved as one-to-one orthologs in all analyzed nematodes. The only variation is a partial ABCF-1 ortholog in T. spiralis, which is probably an incomplete sequence (Fig. S3). Although classified in the ABC transporter family due to the presence of two ATP-binding cassettes, ABCF proteins are not transmembrane transporters, rather they are involved in protein synthesis processes that are conserved throughout the metazoan (Dean and Annilo 2006).

The ABCG family, also known as white-related (WHT), and the ABCH family consist of those ABC transporters with the ATP-binding cassette in the N-terminus, whereas the transmembrane domain is in C-terminal part of the protein (Fig. 4a). In general, the ABCG family shows high conservation. All sequences that are unambiguously placed are one-to-one orthologs of WHT-1, WHT-4, WHT-7, and WHT-8 (Fig. S3). In contrast, there are small amplifications of genes encoding WHT-2, WHT-3, and WHT-6 proteins with two P. pacificus sequences, two B. xylophilus sequences and one sequence from M. hapla, S. ratti, and A. suum (Fig. S3). For the ABCH family, we found orthologs only in P. pacificus. Due to the high level of sequence divergence in this group, it is possible that other nematode orthologs were missed in our database searches due to a low level of sequence similarity.

In summary, the conservation patterns of the ABC transporters is much higher than in the gene families discussed above. However, the exact conservation patterns vary from one subfamily to another. An additional factor of diversification of ABC transporters is the presence of partial sequences, which could be interpreted as incomplete protein predictions or real expressed pseudogenes, as already documented in vertebrates (Dean and Annilo 2006). In P. pacificus, one likely case of partial sequences are the tandem proteins coded on Contig4-snap.264 and Contig4-snap.265, two close paralogs of the C. elegans MRP-1 and MRP-2 (Fig. 4b, Fig. S3).

Pristionchus Expansions of Desaturases and Elongases

Desaturases (FAT) and elongases (ELO) are two gene families that are involved in the production of polyunsaturated C:20 fatty acids (PUFA) from a saturated C:16 precursor in C. elegans (Fig. 1c) (Watts 2009). Additionally, elongases also elongate branched-chain fatty acids (Fig. 1c) (Watts 2009). Desaturase activity is carried out by a single protein domain (PF00487), whose tridimensional structure has not yet been resolved in animals because it is very reluctant to purification (Napier et al. 2003). The desaturase domain is sometimes combined in the same protein with a Cytochrome b5-like Heme/Steroid binding domain (PF00173), which contributes to the electron transfer during the formation of the double bound on the fatty acid chain (Dean and Annilo 2006), or with a sphingolipid delta(4)-desaturase domain (PF08557), which is associated with a sphingolipid delta(4)-desaturase activity in various eukaryotes (Ternes et al. 2002). We find that in nematodes the desaturase family divides into four principal clades, all of them being well supported (Fig. 5). One clade contains FAT-1 and FAT-2, that are specifically duplicated in the Caenorhabditis genus. The ancestral gene leading to FAT-1 and FAT-2 has also been independently duplicated in P. pacificus and in B. xylophilus. The second clade contains the proteins bearing the additional sphingolipid delta(4)-desaturase domain, including the functionally uncharacterized C. elegans proteins F33D4.4 and Y54E5A.1, that are both conserved as one-to-one orthologs in P. pacificus. This is also the only clade to contain a sequence for T. spiralis. The third clade, containing the C. elegans FAT-3 and FAT-4, consists of all proteins with an extra Cytochrome b5-like Heme/Steroid binding domain in their N-terminus. Both genes are well conserved among nematodes, with few exceptions in form of a single duplication of FAT-4 in P. pacificus, a loss of both genes in A. suum and a loss of FAT-4 in B. malayi. The fourth clade comprises the C. elegans FAT-5, FAT-6, and FAT-7 genes and their closer homologs. FAT-5 is unambiguously conserved with the three members of the Caenorhabditis genus. In contrast, FAT-6 and FAT-7 are specific to C. elegans and have one single paralog in C. remanei and two paralogs in C. briggsae. A single duplication occurs specifically in A. suum, whereas a lineage-specific amplification, leading to six paralogs, occured in P. pacificus (Fig. 5).

Fig. 5
figure 5

Expansion of the desaturase family in Pristionchus pacificus. Maximum-likelihood tree of the nematode desaturases, calculated under the LG matrix with an estimated number of invariant sites and a gamma law with four substitution categories, with empirically estimated substitution frequencies. Sequence names are colored according to the phylogenetic distribution of the species, as described in Fig. 1. Pristionchus-specific expansions that contain proteins encoded by at least three paralogous genes are highlighted in red

The elongase family, corresponding to proteins with the PFAM domain PF01151, shows similar general pattern with a moderate number of duplications and losses in various nematodes and two more important amplifications in P. pacificus (Fig. S4). The nine C. elegans elongases are unambiguously conserved accross the Caenorhabditis genus, and five of them (ELO-2, ELO-3, ELO-5, ELO-6, and ELO-8) have one-to-one orthologs in at least one nematode from clade IV and one nematode from clade III. For ELO-4, an ambiguous pattern of multiple sequences for all nematodes outside of the Caenorhabditis genus is found. Similarly, for ELO-1 and ELO-7, the relationships among the sequences from different clades are unresolved. P. pacificus underwent an expansion to nine paralogs in the ELO-9 group unambiguously from a single expansion event. Also, P. pacificus has 14 paralogs in the ELO-1/ELO-7 group, resulting from one or two distinct amplification events. These high number of duplicates result in very divergent sequences and a partially unresolved phylogeny.

Together, the amplitude of gene duplications is smaller between genes from the PUFA synthesis pathway compared to genes from the xenobiotic metabolism pathway. However, this moderate level of variation is sufficient to conclude that only one of those genes is conserved as a one-to-one ortholog between the model nematodes C. elegans and P. pacificus (Fig. 6).

Fig. 6
figure 6

Lack of conservation of PUFA-synthesizing enzymes between C. elegans and P. pacificus. Synthesis of polyunsaturated and branched-chain fatty acids in C. elegans, after Watts 2009. Enzymes in red are those coded by genes for which there is no one-to-one ortholog in P. pacificus. The only conserved enzyme in both model nematodes, FAT-3, is indicated in green. The conservation of ACC and FAS was not investigated here

High Variation Both in Absolute Gene Numbers and in Relative Proportions in Seven Families

To obtain a more complete overview of the general dynamics of genes birth and death rates, we reconstructed the ancestral gene set profiles at all nodes of the nematode tree among the eleven fully sequenced species (Fig. 7; Table S3). Variation occurs among nematode genomes both in terms of total gene numbers and in terms of relative proportions among the different gene families. Specifically, there is a global gene increase in all clade V nematodes, up to the intragenus level in Caenorhabditis, and with a stronger overall increase in the branch leading to P. pacificus. On the contrary, there is a global gene decrease in branches leading to clade III and clade IV nematodes, and also in the branch leading to the clade I T. spiralis. The global birth rates (β) were estimated to vary between almost 0 up to 1.65 births per gene per million of years, the maximum being in the branch between B. xylophilus and its last common ancestor with S. ratti, whereas the global death rates (δ) were estimated to vary between 0 and 0.32 deaths per gene per million of years. Those values are much higher than the rates that are reported in the literature for eukaryotes, ranging from 0.0006 to 0.0193 (Demuth and Hahn 2009). Maximum-likelihood estimations were recently shown to be prone to overestimation of turnover rates (Almeida et al. 2014). Therefore, we also estimated those rates under the two parsimony methods implemented in BadiRate (BDI-FR-CSP and BDI-FR-CWP, Librado et al. 2012) and found that the rates are even higher, ranging up to 2.2 births in the CWP method and up to 4.24 in the CSP method. We hypothesize that those extremely high estimates reflect both the biological and technical peculiarities of our dataset. Indeed, all species from the two most basal clades (clade I and clade III) are parasites that are known to have lost many genes, and this is the same at the basis of clade IV. In addition, except for T. spiralis, which has a high quality assembly (N50 = 6 Mb), the other parasites have also a genome assembly of lower quality when compared to free-living nematodes and B. xylophilus (Rödelsperger et al. 2013). Thus, the different quality of genomes may contribute to an artefactual increase of the gene turnover rate estimates, and we therefore believe that further careful calibration studies are necessary before the reported rates can be compared with those outside nematodes.

Fig. 7
figure 7

Evolutionary dynamics of gene families in nematode genomes. Estimation of ancestral proportions of the investigated gene families, with estimates of global birth (+) and death (−) rates among all branches under the BDI-FR-ML method. Pie-chart sizes are proportional to the overall gene number in the investigated families. The input gene number table is given as Table S2

In addition to the discussed limitations about the absolute turnover rates, the relative proportions of the seven analyzed gene families also vary. Starting from a common ancestor where all proportions are almost equal, the families enriched in xenobiotic-metabolizing genes increase faster in respect to the PUFA-metabolizing genes. PUFA-metabolizing genes indeed represent 26 % of the estimated gene set in the common ancestor of the 11 nematodes, but drops to 4.8 and 19.4 % concerning the gene numbers that were counted in the genomes of the 11 investigated species. For the xenobiotic-metabolizing genes, the trends are different. For example, CYPs represent 27 % of the total gene number in P. pacificus, whereas it forms only 4 % in T. spiralis and between 3 and 8 % in clade III nematodes. Even more strikingly, the amount of UGT genes varies from 2 % in B. malayi up to 40 % of genes in M. hapla. Thus, there is no sign of obvious correlation in the detailed pattern of gene expansion even inside families that are sharing functional similarities in term of biochemical activity.

Higher Variation in Gene Numbers for Xenobiotic-Metabolizing Families Relative to PUFA Synthesis or Citrate Cycle Enzymes

To place our findings in a broader context and to gage the expected baseline of gene conservation, we compared the variation in genes and clusters of orthologs for the seven studied families to the variation levels of those families containing genes coding for the enzymes of the citrate cycle (Fig. 8; Table S4). We chose this pathway because it is a central metabolic pathway that is considered to be universal, and also because previous data already suggested that it is strictly conserved in terms of one-to-one orthologs between the models C. elegans and P. pacificus (Sinha et al. 2012). For this, we made an automated estimate of gene number both for our families of interest and for the enzymes from the citrate cycle, using a clustering approach by OrthoMCL (see “Materials and Methods” section for details). We measured variation as the difference between the maximal and minimal number of genes or clusters in a given family. The observed variation for the ABCE domains, which we know to be strictly conserved from our manual analysis, indicates that the estimates given by OrthoMCL contain a certain amount of errors. Similarly, the differences are much higher to the maximum PFAM domain counts for the CYP, UGT, and ELO families, which are mainly due to artefactual domain fusions (listed in additional dataset S1), that make the program to aggregate genes from other families into the clusters. For citrate cycle and PUFA synthesis genes, the variation ranges between 0 and 26 genes or clusters, whereas the numbers rise up to 38 clusters or 160 genes in some of the xenobiotic-metabolizing families, excluding the artefactual overestimates discussed above. Note that the separation between both categories is not straightforward, because some genes from the ABCD and ABCE family are not involved in xenobiotic metabolism. Taken together, the gene numbers from the xenobiotic metabolism pathway are more variable than those from the citrate cycle or PUFA synthesis pathway. But even the comparatively lower level of variation observed in the gene families coding for PUFA synthesis enzymes is sufficient to give an almost complete lack of one-to-one orthologs between the two model nematodes C. elegans and P. pacificus (Fig. 6).

Fig. 8
figure 8

Variation in xenobiotic on PUFA-metabolizing machinery in comparison to citrate cycle. Variation in clusters of orthologs (dark colors, lower bars) or gene counts (intermediate colors, middle bars) calculated by OrthoMCL or direct counts for PFAM domains (light colors, upper bars) corresponding to the investigated pathways. Detailed counts for each domain are provided in Table S3

Discussion

In this study, we investigated the homology relationships in seven gene families using some of the main enzymes involved in the detoxification of xenobiotics and in the synthesis of polyunsaturated and branched-chain fatty acids in nematodes. The analysis of the genomes of 11 species of seven nematode genera revealed that only a small proportion of the genes involved in these pathways are one-to-one orthologs. For example, the phylogenetic relationships of the 349 C. elegans and 528 P. pacificus genes analyzed in this study showed that only 11.17 % of the C. elegans genes and 7.39 % of the P. pacificus genes are conserved one-to-one orthologs. The analysis of the genes encoding GST enzymes is a particular point in case. The apparently constant number of 59 genes in C. elegans and P. pacificus first pointed toward a high level of conservation and orthology. However, the presence of only one of these genes as a true one-to-one ortholog between both species indicates that orthology assignments require detailed phylogenetic analyses relying on manually cured genome-wide datasets.

While every single data point (gene or gene family) is of course subject to change because of the potential incompleteness of the genome sequencing data, the overwhelming number of lineage-specific expansions observed in nematodes makes it likely that the overall phenomenon is robust. Our sampling dataset is enriched in genes that duplicate at a higher rate than the average. Indeed, previous comparisons based on whole-genome sequence data of C. elegans and P. pacificus indicated that 23.5 % of the approximately 20,000 C. elegans genes and 16.1 % of the originally predicted 29,201 P. pacificus genes are one-to-one orthologs (Dieterich et al. 2008) and, consistently with this, we show that xenobiotic metabolic genes vary in numbers at a higher rate than genes encoding for core metabolic enzymes of the citrate cycle (Fig. 8). However, these genome-wide comparisons also point toward a limited to moderate orthology relationship of genes in nematodes. It should be stated that a complete analysis of the orthology relationships of nematode genes in a genome-wide context awaits future analysis. Such studies are only possible using a systematic approach involving more species as indicated here for the detoxification and fatty acid synthesis enzymes. One should also note that we did not check specifically for gene conversion events. If such events would occur, as already described for some CYPs in Helicoverpa moths (Li et al. 2002), this would even more substantiate our point that nematode genomes have quite distinct evolutionary histories, because for those genes, orthology would even not be a valid possibility.

Lineage-Specific Divergence and Functional Specificity

The most important finding and overall conclusion of our study is the existence of massive duplication events, lineage-specific amplifications, and gene losses in nematodes, with variation in birth and death rates from one family to another even among enzymes belonging to the same pathway. These results have two major evolutionary implications. First, a major evolutionary challenge resulting from our observations is the relationship between lineage-specific amplifications and functional specificity. It is generally assumed that orthologous genes are functionally more conserved than paralogous genes (Tatusov et al. 1997), although this is still a matter of intense debate (Dessimoz et al. 2012; Gabaldón and Koonin 2013). Orthology of genes encoding enzymes in a given pathway is generally considered a reasonable basis to assess functional conservation of molecular mechanisms between species. In contrast to this assumption, the massive lineage-specific gene amplifications and the resulting limitations of one-to-one orthology relationships strongly suggest that the functional specificity of many enzyme-encoding genes might have changed during the course of nematode evolution.

While functional studies are generally sparse, some case studies suggest that genes encoding enzymes with metabolic activity have an unusual amount of convergence, promiscuity, and redundancy. For example, the short-chain reductase LET-767 does not belong to the elongase family but has elongase activity by elongating fatty acids with a linear carbon chain (Watts 2009) (Fig. 6). Similarly, redundancy among enzymes seems to represent a widespread phenomenon, i.e., the distantly related ELO-1 and ELO-2 are both able to elongate a 20:3(n-6) fatty acid from a 18:3(n-6) fatty acid and a 20:4(n-6) fatty acid from a 18:4(n-3) fatty acid (Watts 2009). Another example is GST-10 from the GST pi group and its distantly related paralogs GST-6 and GST-8 from the sigma group that are all able to conjugate 4-HNE, a product from lipid peroxidation (Ayyadevara et al. 2007). Finally, it is well known that some enzymes are promiscuous, being able to process different substrates. In PUFA metabolism, ELO-1 is known to elongate 16:0, 18:3(n-6) and 18:4(n-3) fatty acids, and FAT-1 performs an omega-3 desaturation on 18:2(n-6), 20:3(n-6), and 20:4(n-6) fatty acids (Watts 2009). Xenobiotic-metabolizing enzymes are even more versatile with C. elegans F54F3.4 (SDR25C22) metabolizing isatin and 4-oxono-2-enal (Kisiela et al. 2011), DHS-21 converting l-xylulose into xylitol and reducing various dicarbonyl compounds (Son et al. 2011). In the filarian worm O. volvulus, the conserved GST-11 conjugates various xenobiotics and additionally acts as prostaglandin D2 synthase, thus producing a molecule fooling the immune system of its human host (Sommer et al. 2003). Taken together, functional convergence, promiscuity, and redundancy might de-constrain the evolution of genes in these multigene families resulting in the pattern of massive lineage-specific amplifications observed in this study. The release of functional constraints might differ between gene families and could therefore account for the observed differences in gene amplifications found in distinct gene families. The absence of signatures for positive selection in GSTs duplicated specifically within the Caenorhabditis genus, together with similar results on xenobiotic-metabolizing ABCs (Zhao et al. 2007), suggests that variation in the number of genes coding for catalytically redundant enzymes could also contribute to the modulation of metabolic phenotypes.

A related implication of the absence of one-to-one orthology relationships is the resulting uncertainty of assigning the existing enzymes to specific metabolic activities. For example, genes encoding candidate enzymes for the synthesis of polyunsaturated fatty acids are currently uncharacterized biochemically in P. pacificus. However, there is indirect evidence for their activity based on the identification of ascaroside pheromones (Bose et al. 2012), which are thought to derivate from polyunsaturated fatty acids (von Reuss et al. 2012). Thus, the almost complete lack of one-to-one orthologs in P. pacificus for the enzymes involved in the biosynthesis pathway from C:16 to C:20 PUFAs in C. elegans suggests that, in P. pacificus, those molecules are synthesized by enzymes encoded by genes that are paralogous to the C. elegans ones.

Lineage-Specific Gene Expansions and Homology in Metabolic Pathways

The second major implication of our survey is in the context of the homology concept. In the pre-genome era, a proper distinction between orthology and paralogy of individual homologous genes was nearly impossible. Genome-wide studies, such as the one described here, clearly indicate that in certain gene families one-to-one orthology is an exception rather than the rule. Studies in insects come to the same overall conclusion indicating that at least in these species-rich phyla massive gene amplifications are common (Sánchez-Gracia et al. 2009, Fang et al. 2009, Bass and Field 2011). Both nematodes and insects are special among sequenced eukaryotes because there is no documented whole-genome duplication event, such as those found in vertebrates, land plants, and ciliates, that are proposed to be major drivers for evolutionary diversification (Van de Peer et al. 2009). Previous genomic data in nematodes already made us propose that in those organisms, horizontal gene transfers and gene amplification could produce similar levels of genomic diversity in the absence of whole-genome duplication (Markov and Sommer 2012). However, it remained unknown if those amplifications were occurring before the radiation of some nematode species, as could be suggested by the similar numbers of GSTs in C. elegans or P. pacificus, or if it occurred preferentially in a species-specific way. Our dataset indicates that species or at least genus-specific amplifications are an important component of the duplication pattern. More generally, there is a great variety in duplications patterns. Specifically, even the comparatively smaller scale amplification events, such as what we observed in PUFA-synthesizing enzymes, are important to consider when addressing the question about functional conservation of metabolic pathways. Some authors argued that there is no reason to distinguish between one-to-one orthologs and recently diverged paralogs as far as gene/protein function is concerned (Koonin 2005). This is also the main principle behind cluster searches in OrthoMCL (Li et al. 2003). However, this assumption relies mainly on data in bacteria, where genomes experience evolutionary forces that differ strongly from those of eukaryotes (Lynch et al. 2011). Moreover, such concepts also depend on the level of resolution in the definition of gene or protein “function.” For example, in the PUFA synthesis pathway, all desaturases and elongases could be annotated as members of this pathway, and indeed such an assumption would be a reasonable inference to start with. However, when searching for candidates for particular enzymatic reaction in nematodes, the available knowledge of enzymatic activities in C. elegans suggests the lack of important one-to-one orthologs resulting in the absence of candidate genes and absence for particular reactions.

The evolutionary patterns observed in this study are in strong discrepancy to the observed evolution of many other genes and gene families, i.e., developmental control genes. During the last two decades, the comparative analysis of developmental processes revealed a surprising and unexpected conservation of genes controlling development. Throughout the animal kingdom, transcription factors and signaling pathways are conserved (Carroll 2005; Gerhard and Kirschner 1997; Wilkins 2002; Pires daSilva and Sommer 2003). Interestingly, the comparison of species within one phylum often reveals that genes encoding for key factors in signal pathways show one-to-one orthology relationships, i.e., genes encoding for Hedgehog, Wnt, EDA, and other ligands, but also their receptors and downstream cytoplasmic adaptors and kinases (Keys et al. 1999; Zheng et al. 2005; Pantalacci et al. 2008).

This discrepancy strongly suggests that different evolutionary forces shape the evolution of genes and genomes. The widespread existence of one-to-one orthologs in signaling pathways and developmental control genes can best be explained by purifying selection that constrains duplication and amplification events in these genes, and the same could be true for core essential metabolic pathways such as the citrate cycle. In contrast, such constraints seem to be released for members of other gene families. At least in part, this de-constrain might result from the promiscuity and redundancy as discussed above. In addition, this de-constrain might result from the influence of the environment on organisms and their genomes. We hypothesize that duplications of enzyme-encoding genes might be initially tolerated because gene and protein activity are not crucial for the survival of the individual. Depending on changes of the environment, the resulting variance in metabolic activity might become adaptive, resulting in rapid changes in the genomic composition of the organism. While such hypotheses await experimental validation, they can explain the different patterns of evolution seen in different gene classes.

In conclusion, our systematic approach to homology assignments in gene families strongly indicates that in the era of whole-genome sequencing data the comparison of multiple related species in a defined phylogenetic context allows precise evaluations of the history of individual genes. Such studies can be of tremendous values for experimentalists. Given the still growing rate of whole-genome sequencing projects, the refinement of rough bioinformatic studies by detailed phylogenetic analyses on manually curated protein family datasets will be of great importance in the near future and they are likely to provide many surprising findings and re-assignments of homology relationships.