Introduction

Metabolism is one of the primary biological processes that underlie the survival of an organism within a given environment, due to its fundamental role in synthesis of biomass and energy generation. Even though the individual metabolic enzymes per se are highly conserved across species, adaptation to diverse environments brings about novel innovations in metabolic pathway function (Szappanos et al. 2016). Numerous features like horizontal gene transfer, gene expression, gene dispensability, gene duplications, and metabolic network structure are responsible for changes in metabolic function (Yamada and Bork 2009; Papp et al. 2011). The dominance of one feature over another largely depends on the nature, variations in the environment, and the effective contribution of a factor towards successful adaptation to that particular environment. In general, the change in metabolic function due to changes in a feature can either be selected in a population for its usefulness in adaptation or else it can be purged, if deleterious. This change in function is reflected within the coding sequence of a gene and is conventionally measured by assessing the number of non-synonymous substitutions per non-synonymous site relative to the number of synonymous substitutions per synonymous site, commonly referred as the evolutionary rates (Yang 1998). However, the knowledge of potential determinants of evolutionary rates of a metabolic enzyme within an organism still remains to be an open, unsolved problem.

Members of the Leishmania genus cause the widespread neglected tropical disease leishmaniasis in humans. Biologically, the Leishmania parasite exhibits a digenetic lifecycle, where the promastigote stages thrive within the midgut of the sandfly vector, and the amastigotes persist in the macrophage phagolysosome of the human host; the environments being largely antagonistic with respect to pH, temperature, and availability of carbon sources (Zilberstein and Shapira 1994; McConville and Naderer 2011). To ensure maximal survival, the parasites need to selectively adapt to these dual environmental constraints. This controlled biological setup provides us with a unique platform for investigating the contributory role of different genotypic and phenotypic factors in metabolic enzyme evolution.

Numerous genotype and phenotype factors are known to contribute to evolutionary rate variation in eukaryotes. The factors that are known to have a probable effect on protein evolution largely falls into two categories, namely, translation selection and functional constraint. Translation selection refers to the evolutionary selection of features that can increase efficiency of translation, whereas functional constraint of an enzyme refers to the degree at which random mutations are removed from the population by natural selection so as to avoid their deleterious effect on protein function (Zhang and Yang 2015). With respect to features explaining translation selection, gene expression, mRNA transcript length (or length of a coding sequence), and codon usage were demonstrated as important factors that explain the evolution of protein-coding genes in yeast and Arabidopsis (Kawaguchi and Bailey-Serres 2005; Drummond et al. 2006; Zhang and Yang 2015). With respect to features explaining functional constraint, pleiotropy of a gene due to multiple functional domains, involvement of enzymes in multiple biological processes, and multiple gene duplications can contribute to enzyme evolution thereby providing a dynamism to the metabolic network structure (Salathé et al. 2005; Warringer and Blomberg 2006; Chu et al. 2014; Chesmore et al. 2016). Another less studied functional constraint that affects the evolution of a metabolic enzyme is the role of an enzyme in the context of other enzymes within a metabolic network. As metabolic function is a result of stepwise transformation and utilization of different environmental metabolites through multiple pathways, it is not the effect of a single enzyme. Hence, more central proteins within a metabolic network are also resistant to functional change (Vitkup et al. 2006). Previous studies in yeast and human erythrocytes have also demonstrated that enzymes bearing higher metabolic flux tend to evolve slowly (Vitkup et al. 2006; Colombo et al. 2014). It was also demonstrated that co-regulation in metabolic genes is largely explained by flux-coupling within a metabolic network (Notebaart et al. 2008) suggesting it to be an important factor constraining metabolic function, and hence enzyme evolution.

Similar to other organisms, a few studies in Leishmania species also provide indirect hints towards the roles of translation selection and functional constraints on metabolic enzyme evolution. Stage-specific transcriptomics and proteomics studies identify variations in transcriptome and proteome abundances of metabolic genes across stages and species in Leishmania (Lahav et al. 2011; Nirujogi et al. 2014). Also, mutation pressure and translation selection are shown to preserve codons within genes which possess a high GC bias at the synonymous position and avoid the formation of mRNA secondary structures at the 5′ end of the mRNA; thereby indicating probable modes of translation regulation within genes (Subramanian and Sarkar 2015). Chromosomal aneuploidy is another well-known mechanism that causes variations in gene copy numbers across Leishmania species (Mannaert et al. 2012). Recent computational predictions of metabolic flux for different input metabolites and targeted 13C-based metabolomics studies have identified that the Leishmania metabolome adapts to changing host environments through common metabolic routes, which are largely constrained by the inherent metabolic organization (Saunders et al. 2014; Subramanian and Sarkar 2017). The inherent metabolic organization also constrains enzyme evolution in L. major metabolism (Subramanian and Sarkar 2016).

The aforementioned studies in Leishmania have largely explored the genotype and phenotype complements of metabolism independently. The combined effects of these features on the disparate forces of conservation and divergence in enzyme evolution are yet to be tested. To establish their effects on evolutionary rates among metabolic enzymes, a comprehensive comparative strategy that can examine the relative effects of the different genotype and phenotype features simultaneously is required. In this study, we estimate the rate of non-synonymous substitutions per non-synonymous site (dN), rate of synonymous substitutions per synonymous site (dS), and their ratio (ω = dN/dS) and for the first time, identify the potential determinants of dN, dS, and ω among orthologous singleton metabolic genes in three Leishmania species (Leishmania major, Leishmania donovani, and Leishmania infantum) using a principal component regression (PCR)-based analysis (Drummond et al. 2006; Jovelin and Phillips 2009; Alvarez-Ponce and Fares 2012; Alvarez-Ponce et al. 2017). Although it is possible to use these features for assignment of genes to the three Leishmania species using Bayesian classifiers and other techniques (Wang et al. 2007), the above regression-based analysis appropriately suits our objective of discerning the relationships of the genotype and phenotype features to evolutionary rates of metabolic genes and their comparisons across the three Leishmania species. We introduce the flux-coupling potential of an enzyme within a metabolic network (Subramanian and Sarkar 2016), as a potential feature for regression along with other available features for Leishmania metabolism. Despite the unavailability of broad range of confounding cellular factors that influence both codon usage and protein evolutionary rates (for example, UTR length, recombination rate, gene essentiality, protein–protein interactions features) for Leishmania species, the results provided in this article highlight the significant contribution of codon usage, multi-functionality, gene duplications, and flux-coupling constraints as novel mechanisms underlying evolutionary divergence and conservation in Leishmania metabolic genes. Comparisons of gene clusters across the three species demonstrate that the same gene can be constrained by different features and hence, a unique set of species-specific genes governed by multiple features can occur across species. The targetable mechanisms and genes identified in this study can be further perused for designing novel strategies against parasite persistence.

Materials and Methods

Potential Determinants of Metabolic Enzyme Evolutionary Rates

In this study, a total of eight features representing the genotype and phenotype characteristics of the Leishmania parasite were computed. Leishmania species with known, curated metabolic reconstructions, namely, the L. major strain Friedlin reconstruction comprising of the 560 metabolic genes, the L. donovani BPK282A1 reconstruction comprising 604 metabolic genes and the L. infantum JPCM5 reconstruction with 556 genes were used for multivariate analysis (Chavali et al. 2008; Sharma et al. 2017; Subramanian and Sarkar 2017).

Genomic Features

The coding nucleotide sequences (CDS) of the metabolic genes curated within each metabolic reconstruction, obtained from the TriTrypDB database, v.8.1, release 32 (Aslett et al. 2010) were used for calculation of codon adaptation index (CAI), GC content, and the gene length. CAI values for each gene were computed using the EMBOSS package (Rice et al. 2000), with respect to a reference set of ribosomal protein-coding genes in each species (Subramanian and Sarkar 2015). The length and GC content for each CDS were computed using an in-house PERL script (Sect. 8A of Supplementary Text S1).

Gene Expression

Pre-calculated Fragments per million kilobases (FPKM) values were obtained for L. major promastigotes from an independent RNA sequencing study (Rastrojo et al. 2013). To maintain consistency, the total number of reads mapped onto each gene, reported in the Gene Expression Omnibus database for L. donovani (GEO ID: GSE48475) and L. infantum (GSE48394) were used for calculation of Reads per million kilobases (RPKM) values of each gene (Martin et al. 2014; Zhang et al. 2014). FPKM and RPKM are considered to be synonymous within the article. Further details provided in Sect. 7A of Supplementary Text S1.

Functional Constraint

Number of Processes and Functions

As the curated annotation GO processes and function IDs still remain unavailable for all genes in the three Leishmania species, computed Gene Ontology (GO) processes and functions associated with each gene was extracted from the TriTrypDB database (Aslett et al. 2010). The number of predicted processes (NumProcs) and functions (NumFuncs) was calculated from this information using an in-house PERL code (Sect. 8B of Supplementary Text S1). Further details provided in Sect. 7E of Supplementary Text S1.

Flux-Coupling Potential of an Enzyme

In this study, we introduce the flux-coupling potential of an enzyme as a proxy for quantifying the flux-based functional constraint imposed on a metabolic enzyme. The flux-coupling potential is calculated by the centrality of an enzyme (degree or number of flux-couplings, NCoup) and the tendency of an enzyme to cluster together with other enzymes with similar number of flux-couplings (local clustering coefficient, CCoFCA), within a flux-coupled subgraph of the metabolic network. Further details provided in Sect. 7B of Supplementary Text S1.

Sequence-Based Evolutionary Rates

For the estimation of the evolutionary rates, multiple sequence alignment of each gene in all the three species was performed with its orthologous sequences across five genomes, namely, L. major, L. infantum, L. donovani, L. mexicana, and L. braziliensis species. This captures the degree of sequence divergence across closely related species within the Leishmania lineage. These five Leishmania species were chosen, as their genomes are completely sequenced and assembled. The orthology information was available within the TriTrypDB database, v.8.1, release 32 (Aslett et al. 2010). The alignment was processed to remove sequence positions with gaps using a standalone version of the PAL2NAL program (Subramanian and Sarkar 2016). dN, dS, and ω (dN/dS) were estimated using the one-ratio M0 branch model implemented in the ‘codeml’ subroutine of the PAML package version 4.8a (Yang 1998, 2007).

Pre-processing the Datasets for Multivariate Analysis

For each species, the dataset of metabolic genes was pre-processed to remove—(a) genes with obsolete sequences, less than 200 codons, dS > 0.3, (b) duplicates and (c) genes for which either of the targeted genomic, expression, or metabolic network-based features was unavailable. Finally, only 233 singletons common to the three species of Leishmania was considered for multivariate analyses. Details behind extraction of singleton genes are provided in Sect. 7C of Supplementary Text S1.

Multivariate Analysis and Clustering

Principal Component Regression

Principal Component Regression (PCR) analysis on metabolic networks of the three Leishmania species was used to identify the potential contribution of the genomic, gene expression and function-based features to the total variance in evolutionary rates among metabolic genes. The ‘pls’ package version 2.6 implemented in R was used to perform PCR with dN and dS as the response and the aforementioned eight parameters as the predictor variables. A subset of predictor variables with loadings of 0.45 or more was considered for interpretation of a principal component with respect to that subset (Tabachnick and Fidell 2007). Further details provided in Sect. 7D of Supplementary Text S1.

Selection of Minimum Principal Components for Regression

A randomization test approach was used to check whether the squared prediction errors of regression models with fewer components are significantly (P < 0.01) larger than the reference model predicting absolute minimum prediction accuracy or not, by generating a distribution of prediction errors in each model for comparison using 1000 random permutations (van der Voet 1994). Out of these significant models, the model with least number of principal components was chosen as the best model to predict dN and dS in all three species. The randomization test approach is implemented within the ‘pls’ package.

K-Means Clustering

K-means clustering of genes was performed in an n-dimensional space, where n represents the selected number of principal components. Clustering was performed so as to identify the groups of genes, governed by a particular set of principal components and thereby a subset of predictors. The number of clusters represented in each dataset was determined by computing the Akaike’s Information Criterion (Manning et al. 2008) for every K clusters (AIC); where K = 1–100. The number of clusters corresponding to the model with least AIC was considered to be representative for each dataset.

Results

Features Associated with Evolutionary Rates are Also Inter-correlated in Leishmania Species

Performing a pairwise correlation analysis for the orthologous metabolic genes in Leishmania major Friedlin, Leishmania donovani BPK282A1 and Leishmania infantum JPCM5, it was identified that there is no significant correlation obtained between dN and dS, whereas ω is obviously correlated with both dN and dS (Fig. 1, Sect. 1 of Supplementary Text S1). A significant correlation is observed between the codon adaptation index (CAI) and evolutionary rates in all species, suggesting an obvious association of translation selection and enzyme evolution (Fig. 1, Sect. 1 of Supplementary Text S1). In comparison, features representing functional constraints demonstrate relatively weak species-specific associations with dN, dS and ω. In L. major (Fig. 1a), dN and ω are negatively correlated with number of processes in which a gene is involved (NumProcs), indicative of a weak functional constraint (dN:r = − 0.161; P = 0.014, ω:r = − 0.172, P = 0.008). Similarly, the number of flux-couplings per reaction associated with a gene (NCoup) is significantly associated with ω (r = − 0.152; P = 0.02). In L. donovani (Fig. 1b), dN seems to weakly correlate with NumProcs (r = − 0.16592; P = 0.011). In L. infantum (Fig. 1c), ω demonstrates a weak positive association with a gene’s tendency to occur in a flux-coupled module (CCoFCA) (r = 0.165; P = 0.011). It seems apparent from the pairwise correlation-based analysis that, with an exception of CAI and GC content, each of the aforementioned features was weakly correlated with evolutionary rates across the three Leishmania species.

Fig. 1
figure 1

Correlation dot plot demonstrating inter-correlations between the eight predictors and evolutionary rates for a Leishmania major, b Leishmania donovani, and c Leishmania infantum. This plot displays correlated pairs of features having significant correlation at P < 0.05. Dots represent significant positive or negative correlations. Colors represent both the nature and degree of the association between any two features. The size of the dots represents the degree of the association between any two features. Pairwise correlation values are given in Sect. 1 of Supplementary Text S1. (Color figure online)

Apart from associations of the predictors with evolutionary rates, inter-correlations between predictors were also observed. As observed in a previous study (Subramanian and Sarkar 2015), CAI also correlates positively with GC content with varying strengths of associations in each species. GC content of a gene increases with larger gene lengths as indicated by their significant association across species (Fig. 1, Sect. 1 of Supplementary Text S1). In L. major and L. donovani (Fig. 1a, b), CAI of a gene is positively associated with NumFuncs (L. major: r = 0.145, P = 0.026; L. donovani: r = 0.173, P = 0.008) suggestive of multifunctional genes to contain more frequent codons. As popularly known, CAI correlates with mRNA abundance (measured in reads per million kilobases, RPKM) in L. major and L. infantum (Fig. 1a, c). In L. donovani and L. infantum, gene length and RPKM are negatively correlated suggesting expression of metabolic genes is probably limited by gene length in these species (Fig. 1b, c). Specifically in L. infantum, the number of functions associated with a gene (NumFuncs) demonstrates a weak negative association (r = − 0.20133; P = 2 × 10−3) with the tendency of a gene to cluster with genes demonstrating similar physiological fluxes (CCoFCA) hinting the role of multifunctional genes in routing fluxes within functional flux modules. The values of features for the selected genes in all three species are given in Supplementary File S1. This analysis further demonstrates that it is inappropriate to directly use these features to predict evolutionary rates of genes in Leishmania as they are not independent of each other.

Contribution of Features to the Variation Observed in Enzyme Evolutionary Rates

As indicated in Fig. 1, although many features are independently correlated with the evolutionary rates, some of them are also inter-correlated with each other. Hence, it is difficult to identify the potential contribution of each individual features to evolutionary rates. For this purpose, PCR was performed to identify independent principal components, which represent a linear combination of features, the coefficients representing the weight of a particular feature in explaining the variation in dN, dS, or ω (Drummond et al. 2006). The distribution statistics of evolutionary rates for the selected datasets is given in Sect. 2 of Supplementary Text S1. The identified principal components for the response dN and dS rates in the three Leishmania species are given in Supplementary File S2. PCR analysis with dN and dS in all three species indicates that the amount of variation explained by the principal components in the response variables (dN and dS) need not always be in descending order of the principal components (Jolliffe 1982). Additionally, it can also be observed that in most of the cases, a 90% variation in dN and dS, cannot be explained by considering only the first few components suggesting that no single factor dominates enzyme evolutionary rates. The pairwise correlation-based analysis fails to identify this observation, as only the effects of the strongest pairwise associations are highlighted. Furthermore, as there are inter-correlations among predictors, a combination of other related predictors probably outweighs the contribution of the apparent strongly associated codon usage/GC content features. Another important observation suggests that though the flux topological features explain a low variance in dN, their occurrence within the 1st principal component suggest that these features explain a majority of variation observed for metabolic genes in all three species.

With respect to dN, it can be observed that the first two components (principal components 2, 3 of L. major, 2, 3 of L. donovani and 3, 7 of L. infantum), which cumulatively represent around 28.01% variance in L. major (Fig. 2a), 21.18% variance in L. donovani (Fig. 2d) and 23.41% variance in L. infantum (Fig. 2g) are dominated by genomic and gene expression features like CAI, GC, RPKM and gene length. In all the cases (Fig. 2a, d, g), the components dominated by flux-coupling potential and functional constraints explain a relatively small amount of variance in evolutionary rates. In L. infantum (Fig. 2g), a comparable amount of variation (7.92%) in dN is explained by the principal components (1 and 8), which is dominated by metabolic flux-coupling potential of an enzyme, where the total variance explained in dN by all the principal components is 38.21%.

Fig. 2
figure 2

Principal components regression on dN (a, d, g), dS (b, e, h), and ω (c, f, i) rates of 233 singleton orthologous metabolic genes in L. major, L. donovani, and L. infantum using eight different features. Each principal component represents a linear combination of the eight predictors, dominated by components that demonstrate a large variation in dN and dS. The colors correspond to the percentage variance explained by a particular feature, with respect to that principal component. (Color figure online)

With respect to dS, it can be observed that the first two components (principal components 1, 8 of L. major, 2, 3 of L. donovani and 3, 8 of L. infantum), which cumulatively represent around 13.96% variance in L. major (Fig. 2b), 14.24% variance in L. donovani (Fig. 2e), and 21.37% variance in L. infantum (Fig. 2h) are dominated by genomic and gene expression features like CAI, GC, RPKM, and gene length. A relatively large amount of variance (7.2%) is also explained in dS rate of enzymes in L. major by two principal components governed by the flux-coupling potential, where the total variance explained in dS by all the principal components is 25.6% (Fig. 2b).

As observed for the dN and dS rates, the largest percentage of the total variation in ω is explained by features related to translation selection (CAI, GC content), as indicated by the 6th or 7th principal components in all the three species (variations − 35.5% in L. major, 21.88% in L. donovani, and 44.33% in L. infantum; Fig. 2c, f, i). But, as observed for L. major and L. donovani, multi-functionality and flux topology features explain larger variations in ω as compared to their contributions to individual dN and dS rates (the heights of orange/black bars representing flux topology and purple/yellow representing multi-functionality are greater in ω as compared to dN and dS in all three species). An almost equal variation in ω is explained by flux-coupled features (NCoup and CCoFCA) in L. donovani (8th principal component − 21.86%, Fig. 2f). The second largest percentage of variance is explained by the variable related to multi-functionality in L. major (24.19%, Fig. 2c) and L. infantum (24.51%, Fig. 2i). Similar to the dN and dS rates, no single component is alone enough to explain more than 90% of the variation in ω.

Selection of Components for Predicting Enzyme Evolutionary Rates

A set of principal components were shortlisted for predicting evolutionary rates using a randomization test approach (see “Materials and Methods”). The principal components selected for regression are given in Sect. 3 of Supplementary Text S1. Features with loadings greater than 0.45 were considered for interpreting a principal component (Table 1). Most of the principal components explaining any variation in dN or dS can be interpreted on the basis of three distinct classes of features—(a) codon usage (CAI) and GC content, (b) multi-functionality (NumProcs, NumFuncs), and (c) flux phenotypic features (NCoup, CCoFCA). Most importantly, in all species (except L. infantum), effect of CAI and GC content of a gene on evolutionary rates can be interpreted by the same principal component suggesting their combinatorial effect in constraining dN and dS. To explain dN rate of a gene, two principal components (2 and 7) involving CAI and GC content as principle features can be observed in L. infantum, where GC content negatively contributes to dN in the 2nd principal component and positively contributes to dN in the 7th principal component. Additionally, the 7th component has a relatively large role in explaining dN as compared to the 2nd component. In all species, CAI negatively relates to dN and positively relates to dS. In all species, number of processes associated with a gene (NumProcs) negatively contributes to dN. Further, no principal component can be interpreted solely on the basis of gene length, to explain both dN and dS.

Table 1 Contribution of the eight predictors to the selected principal components (loading cut-off > 0.45) and hence, the log10(dN) and log10(dS) rates in L. major, L. donovani and L. infantum

Gene expression (RPKM) positively contributes to dS rate in L. donovani and L. infantum and negatively contributes to dN rate in L. major and L. infantum. In case of L. major, it can be seen that distinct principal components (2 and 3) can be interpreted using CAI and RPKM, respectively, suggesting weak associations with each other and their independent associations with dN (Table 1). Most of these relationships corroborate with the pairwise correlation-based analysis performed above (Fig. 1).

To explain dS in L. donovani, it can be seen that distinct principal components (3 and 2) can be interpreted using CAI and RPKM, respectively, suggesting their independent relationships with dS and no association with each other (Table 1). On the contrary, in L. infantum, principal component 3 can be interpreted by both CAI and RPKM suggesting their inter-relatedness. Interestingly, an important observation points out that synonymous substitution rates are not constrained by the multifunctional potential of a gene (NumFuncs, NumProcs). Flux topological features significantly contribute to dN rates of genes in L. infantum and dS rates of genes in L. major. Patterns common to both dN and dS are observed with respect to the ω rate across the three Leishmania species (Sect. 4 of Supplementary Text S1). Features related to translation selection (CAI, GC, RPKM, gene length) demonstrate a significant association with ω in all the three species. In L. major, translation selection is the only factor affecting ω. Multi-functionality (NumFuncs, NumProcs) is significantly associated negatively with ω in L. donovani and L. infantum. Further, in L. infantum, the flux topological features (NCoup, CCoFCA) are also significantly associated with the ω rate.

Relationship Between Physiological Flux Coupling and Enzyme Evolutionary Rates

The pairwise correlation analysis indicated a weak correlation between flux-coupling features and evolutionary rates in L. major and L. infantum (Fig. 1). But, in the above analysis, it was found that across Leishmania species, physiological flux coupling potential seems to be a poor predictor of evolutionary rates (Table 1). This relationship between evolutionary rates and flux-coupling potential can be affected because certain enzymes demonstrate no flux coupling with other reactions within the network. Apart from explaining variations, PCR analysis also allows us to classify genes into two clusters, with respect to the contribution of the predictor features of the genes (interpreted through a principal component) to a response. It was observed that the potential of an enzyme to be physiologically coupled to other enzymes within metabolism or not can be classified only using scores of enzymes loaded on the first principal component (PC1) associated with the three evolutionary rates in all the species (Insets, Fig. 3a–i).

Fig. 3
figure 3

Association between rates of protein evolution and number of couplings (NCoup) is affected by gene duplications. Relationship between dN rates and NCoup of flux-coupled set of enzymes is given for a L. major; d L. donovani; and g L. infantum. Relationship between dS rates and NCoup of flux-coupled set of enzymes is given for b L. major; e L. donovani; and h L. infantum. Relationship between ω and NCoup is given for c L. major; f L. donovani; and i L. infantum. j Violin plot demonstrating the differences in the variance of number of couplings associated with duplicated genes between L. major (median = 3), L. donovani (median = 1.03), and L. infantum (median = 2); k Violin plot demonstrating the differences in variance of singleton genes between L. major (median = 9), L. donovani (median = 7.55), and L. infantum (median = 6.64). Insets represent the two clusters of metabolic enzymes that are flux-coupled (1) and uncoupled (2)

With respect to this coupled set of enzymes (cluster 1 in insets, Fig. 3a–i), a negative relationship is observed between dN or ω and the number of couplings associated with an enzyme with varying strengths (Fig. 3). In all three species, no association was observed between dS and number of couplings (Fig. 3b, e, h). With respect to the number of couplings, the association between dN or ω and NCoup decreases as L. major < L. donovani < L. infantum. In L. major (Fig. 3a, c), the association, although weak, is statistically significant at P < 0.01 (dN:r = − 0.252, P = 0.007; ω:r = − 0.291, P = 0.002). In L. donovani (Fig. 3d, f), the association is weaker than L. major (dN:r = − 0.159, P = 0.094, ω:r = − 0.198, P = 0.036). In L. infantum (Fig. 3g, i), the association is the weakest and seems to be a purely chance phenomenon (dN:r = − 0.019, P = 0.83; ω:r = − 0.094; P = 0.29). The associations become weaker from L. major to L. infantum due to the gain or loss of flux-couplings by enzymes across species. This gain or loss is affected by the coupling between duplicated and singleton genes in unique subcellular locations across species (Supplementary File S3). Furthermore, the number of flux-couplings observed for duplicated genes is much higher as compared to singletons (Supplementary File S3).

Hence, we asked the question whether gene duplications affect the relationship between dN or ω and number of couplings associated with an enzyme or not? Comparing the distributions of number of couplings associated with duplicated enzymes in the three species revealed that most of the duplicated enzymes are coupled to a less number of other enzymes within the metabolic network of Leishmania species (Fig. 3j). But, the variance in the number of couplings of duplicated enzymes is notably higher in L. major with some duplicated enzymes displaying a large number of couplings. On the contrary, the variance drastically reduces in L. donovani and L. infantum as compared to L. major. Similar to duplicated enzymes, comparing the distributions of number of couplings associated with coupled set of singleton enzymes in the three species also revealed that most of the singleton enzymes are coupled to a less number of other enzymes within the metabolic network of Leishmania species (Fig. 3k), with decreasing variance from L. major to L. infantum. This decreasing variance relates to the decreasing association of flux coupling potential with evolutionary rates of metabolic genes from L. major to L. donovani to L. infantum (Fig. 3a–i).

Comparing the variance in the number of flux-couplings across species in both the duplicated and singleton cases using Levene’s test of homogeneity of variances (Martin and Bridgmon 2012) indicated that the variance in number of couplings significantly differs between species at P < 0.001 (duplicated: F = 10.968, P = 3.25 × 10−5, singletons: F = 8.54, P = 2.6 × 10−4). The similarity in distributions of number of couplings between duplicated enzymes and singletons indicates that more gene duplications might indirectly create new flux coupling associations with singletons, under stoichiometry, reversibility, and environmental constraints, thereby promoting the association of the evolutionary rate with number of couplings associated with singleton genes. Furthermore, variance in number of couplings from L. major to L. infantum decreases at a slower rate in singletons as compared to duplicated enzymes indicating that the association between evolutionary rates and number of couplings in singletons is not promoted equally by all gene duplication events across species.

Identification of Metabolic Genes Constrained by Translation Selection, Multi-functionality, and Flux Topology

From Table 1, it is possible to identify principal components that can be interpreted by the independent features namely, CAI, Number of processes (NumProcs) and number of flux-couplings (NCoup) associated with a gene and the nature of their contributions to the evolutionary rates. Each of these features explains the role of translation selection, multi-functionality, and flux topology respectively on evolutionary rates of metabolic genes. Observing the centroids of the clusters (Sect. 5 of Supplementary Text S1, Supplementary File S4), the gene clusters that are associated with contributions of such principal components can be identified. Likewise, the dN rate of genes in cluster numbers 4 and 14 in L. major, 9, 12, 13, 17 in L. donovani and 8, 18 in L. infantum are dominated by non-zero values of NCoup (positive scores on respective principal component in L. major, L. donovani and negative scores on respective principal component in L. infantum) and low values of NProcs (positive scores on principal component in L. major, L. donovani and negative scores on principal component in L. infantum) and CAI (negative scores on respective principal component in L. major, L. donovani and positive scores on respective principal component in L. infantum). Multi-functionality, which is represented by NumProcs or NumFuncs does not appear to be a dominant predictor in explaining the dS rate and hence, does not occur as a major contributor in any of the selected components (Table 1). Hence, those gene clusters whose evolutionary rates can be interpreted by CAI and flux topology alone were identified. Likewise, the dS rate of genes in cluster numbers 2, 3, 4, 7, 8, 11 in L. major, 1, 2, 3 10, 15, 18 in L. donovani and 2, 3, 10, 18 in L. infantum are associated with high values of CAI (positive scores on respective principal component in all three species) and NCoup (positive scores on respective principal component in L. major and L. infantum and negative scores on respective principal component in L. donovani). Comparison of chosen genes between the species indicates five genes in all species, whose evolutionary rates are dominated by all the three factors—translation selection, multi-functionality and flux topology, whereas 13 genes whose evolutionary rates are governed by translation selection and flux topology (Fig. 4).

Fig. 4
figure 4

Comparison of genes demonstrating high values of independent dominant factors namely, codon adaptation, number of biological processes, and number of flux-coupling associations between species with respect to a dN and b dS

There is a larger overlap of genes between the L. major and L. donovani species with respect to dN as compared to dS. Further, the overlap between L. donovani and L. infantum is restricted with respect to dN as compared to dS. In all species, there are also a unique set of genes whose evolutionary rates are specifically explained by the identified independent features (Fig. 4, Sect. 6 of Supplementary Text S1).

Discussion

Owing to its parasitic nature and the long-standing evolutionary association with hosts, Leishmania species experience a largely constrained metabolic environment. For efficient adaptation within the host, both translation selection and functional constraint might constrain evolution of enzymes within Leishmania metabolism. To our knowledge, there is no study available till date in Leishmania parasites that compares these heterogeneous potential determinants in predicting non-synonymous (dN) and synonymous (dS) substitution rates in metabolic enzymes simultaneously, on a single platform. Also, the inter-relationship between these factors and their differences across species is seldom explored. As used in other eukaryotes (Drummond et al. 2006; Yang and Gaut 2011; Alvarez-Ponce et al. 2017), the present study integrates the available, potential features of metabolic enzymes into a principal component-based regression model to identify the unknown confounding factors that explain observed variation in the evolutionary rates and compares them across three Leishmania species.

As observed in other eukaryotes (Drummond et al. 2006), codon usage negatively correlates with dN, ω, and positively correlates with dS in all species, signifying translation selection to be an important constraint in Leishmania metabolic enzyme evolution. This can also be observed from the highest percentage of variation explained by the principal component dominated by CAI. Furthermore, GC content also occurs as a dominating factor of the same principal component as CAI, indicating their relatedness, supporting previous observations (Subramanian and Sarkar 2015). But, as observed in all the three Leishmania species, neither a single principal component is enough to explain a significant proportion of variation among evolutionary rates nor does a single set of similar features explain sufficient variation across principal components, indicating that multiple features potentially contribute to enzyme evolution in Leishmania species. Hence, more than one principal component was observed to be selected for regression (van der Voet 1994). Although with an exception in L. infantum, results indicate that gene expression (RPKM) does not always occur in the same principal component as CAI, suggesting their independent roles in governing evolutionary rates of enzymes. This is contrary to the observations in yeast and E. coli, where gene expression complements CAI as a dominant factor governing evolutionary rates (Drummond et al. 2006). This also contrasts observations in Trypanosoma brucei, an evolutionary-related Trypanosomatid, where codon usage is demonstrated to affect global mRNA levels (Jeacock et al. 2018). This might be due to the weak association observed between mRNA and protein abundances in Leishmania species (Lahav et al. 2011); CAI being an important predictor of protein abundance (Subramanian and Sarkar 2015). Similarly, the occurrence of CAI, multi-functionality and flux-coupling features as dominant features on distinct principal components suggests that these features affect evolutionary rates independently. Further, the multi-functionality of a gene (NumProcs, NumFuncs) contributes only to the non-synonymous substitution rate (dN) and is negatively associated with dN. Hence, as observed in yeast (Salathé et al. 2005), genes (enzymes) with multiple processes or functions evolve slowly as compared to genes associated with low number of functions in the Leishmania species as well.

As the parasite stages live in fixed host environments, the pathways used to metabolize resources across stages remain strikingly similar (Subramanian and Sarkar 2017). Thus, enzymes (reactions) that are more coupled to other enzymes within the metabolic network might be constrained evolutionarily as opposed to enzymes that are less or not coupled to other enzymes. Hence, for the first time, we introduce the notion of the flux-coupling potential of an enzyme within its metabolic network and investigate whether it is an important determinant of evolutionary rate in Leishmania species or not. Although the associations of the flux-coupling features with evolutionary rates are weak, unlike multi-functionality, the occurrence of flux topological features in the first principal component and the selection of their associated principal component for regression against evolutionary rates explains their important contribution to variation in both dN and dS rates. Supporting this factor, a significant amount of variation in the dS rate of enzymes in L. major and dN rate of enzymes in L. infantum is also sufficiently explained by these features. Considering only the flux-coupled set of enzymes in all three species, a weak negative association can be observed between dN, ω and number of couplings associated with an enzyme (NCoup). Flux-coupling reaction subsets capture the total number of paths of metabolite distribution under defined uptake constraints, as they can explain co-regulation between metabolic genes (Notebaart et al. 2008). A negative association was observed between ω and metabolic flux through an enzyme in yeast, human RBCs and L. major (Vitkup et al. 2006; Colombo et al. 2014; Subramanian and Sarkar 2016). This suggests that an enzyme is slow-evolving if it is coupled to large number of other enzymes by flux (hubs) within the flux-coupled network when compared to enzymes with low number of couplings. Further, few numbers of enzymes with high number of flux-couplings are observed as compared to enzymes with low number of flux-couplings. This indicates that a hierarchical organization of fluxes within Leishmania metabolism is largely constrained during evolution.

Chromosomal aneuploidy in Leishmania gives rise to significant variations in copy numbers of genes across species that might increase genomic plasticity, gene dosage, and rescue of essential functions from deleterious mutations (Mannaert et al. 2012). In addition to the aforementioned roles, for the first time, we document an observation indicating a possible species-specific involvement of duplicated metabolic enzymes in increasing the evolutionary constraints on other metabolic enzymes within a network, through re-wiring of physiological flux dependencies within the metabolism. This is typically indicated by a higher variance in the number of couplings associated with singleton and duplicated enzymes and relatively stronger associations between number of couplings associated with singletons and evolutionary rates. With decrease in the variance of number of couplings of duplicated enzymes from L. major → L. donovani → L. infantum, the strength of associations between number of couplings and evolutionary rates also reduces. A similar re-wiring of fluxes due to cross-compartmentalized metabolism was also hypothesized for glycolysis and isoprenoid biosynthesis in other Trypanosomatids (close evolutionary relatives of Leishmania) and other protists (Ginger et al. 2010). Interestingly, not all gene duplications are highly flux-coupled with other enzymes in the network, suggesting that the species-specific metabolic network structure dynamically constrains the choice of unique gene duplications occurring at multiple subcellular locations for flux re-wiring, thereby imposing evolutionary constraints on other singletons associated with them.

Previously, codon bias, pleiotropy, and centrality within a biomolecular network were implicated to impose relatively strong evolutionary constraints on enzymes that are important pharmacological targets for a disease (Searls 2003; Pál et al. 2006; Gladki et al. 2013; Lv et al. 2016). As mentioned above, codon adaptation, multi-functionality, and flux topological constraints independently affect evolutionary rates; each of these features being negatively associated with dN. Comparison of genes with the dN rate dominated by these factors leads to the identification of both common and species-specific enzymes, which are evolutionarily constrained by multiple genotype–phenotype factors, reckoning them to be important enzymes. Likewise, this analysis was able to identify enzymes like trypanothione reductase, aspartate carbamoyltransferase, orotidine-5-phosphate decarboxylase, and dihydrolipoamide dehydrogenase common to all three species. Among the enzymes common to the three Leishmania species, trypanothione reductase, the sole enzyme in the Leishmania parasite to combat oxidative stress (Tovar et al. 1998), aspartate carbamoyltransferase and orotidine-5-phosphate decarboxylase, involved in production of pyrimidines, like ump and cmp, (Mukherjee et al. 1988; Bello et al. 2007) are previously speculated pharmacological targets in Leishmania and other eukaryotes. On the other hand, unique enzymes majorly belonging to energy metabolism and conservation (C), Carbohydrate transport and metabolism (G), Amino acid transport and metabolism (E), and Nucleotide transport and metabolism (F) were also identified for each species (Sect. 6 of Supplementary Text S1). Among these unique enzymes, known virulence factors like trypanothione synthetase, phosphomannose isomerase and GDP-mannose pyrophosphorylase were specifically identified for L. major; dihydrofolate-reductase/thymidylate synthase, pyrroline-5-carboxylate reductase and phosphomannomutase were identified for L. infantum and tyrosine aminotransferase for L. donovani (Mukherjee et al. 1988; Titus et al. 1995; Tovar et al. 1998; Garami and Ilg 2001b, a; Scott et al. 2008; Moreno et al. 2014; Mantilla et al. 2015). Their role in virulence probably makes them more resistant to change. From this analysis, few more novel species-specific enzymes were also predicted (Tables G, H, Sect. 6 of Supplementary Text S1). These can be used as potential drug targets because they are governed by unique evolutionary constraints. Their biological role in virulence, survival or visceralization of the parasite needs to be experimentally investigated.

Although the results provided here are limited by the unavailability of genome-scale metabolic networks for multiple known species, strains, and isolates of Leishmania (Cantacessi et al. 2015), the use of such comprehensive multivariate analyses in teasing apart the known confounding factors of enzyme evolution provides a broad insight into the organization of Leishmania metabolism and the underlying factors governing its change. Additionally, this work also provides a multitude of hypotheses that can be tested experimentally in Leishmania. Furthermore, identification of the role of multiple factors in constraining evolutionary divergence within metabolic enzymes suggests that the survival and adaptation of the parasite within the host are a complex problem. This emphasizes the need for systems-level experiments to identify other features, like UTR length, recombination rate, gene essentiality, protein–protein interactions features, etc. unavailable at an organismal level for Leishmania species and to analyze their integrated effect. The integration of these diverse features can thus provide the complete knowledge of the strategies employed by the parasite for survival and virulence, which can help the community to combat this largely neglected tropical parasitic infection.

Conclusion

For the first time, we measure the relative contribution of eight inter-correlated genotype, phenotype predictors on the evolutionary rates of singleton metabolic genes and further compare them across three Leishmania species. Codon usage, multi-functionality, and flux-coupling potential of an enzyme independently constrain evolution of metabolic genes in Leishmania. This seems to be a unique feature of Leishmania metabolic evolution which was previously not reported. Our observations suggest that occurrence of duplicated genes in novel subcellular locations can create new species-specific flux routes through certain singleton flux-coupled enzymes, thereby constraining their evolution. This observation asserts the role of gene duplications in contributing to evolutionary innovations of Leishmania metabolism. Our results reveal that although Leishmania metabolic genes are very similar with respect to their sequence information, the systems-level function of metabolic genes can affect metabolic enzyme evolution. The unique and common enzymes identified for all the three species from our analysis were previously reported to govern important biological roles for Leishmania metabolism and virulence. Moreover, some of these were pharmacological targets experimentally reported for related Leishmania species. Unique enzymes whose evolutionary rates are affected by a high contribution of dominating factors can explain species-specificity and the reasons for within-host adaptation. Most importantly, these might be perused as mechanisms to be targeted for in vivo control or as important causes of parasite visceralization.