Introduction

Glycine plays an essential biological role as a precursor for the synthesis of proteins, nucleic acids, and other metabolites. Most organisms produce glycine by serine hydroxymethyltransferase (SHMT), which uses the cofactor pyridoxal 5′-phosphate (PLP) and cleaves serine to produce glycine and the C1 unit for tetrahydrofolate (THF)-dependent reactions (Fig. 1a, i). In some cases, glycine biosynthesis is supplemented by a second enzyme threonine aldolase (TA) (Duckers et al. 2010; Franz and Stewart 2014; Liu et al. 2000a). TA is also PLP-dependent and cleaves threonine to produce glycine and acetaldehyde (Fig. 1a, ii). TAs have been isolated from various organisms including bacteria, fungi, and mammals, and were shown to be necessary for yeast glycine auxotrophy (McNeil et al. 1994). Putative TA homologues have also been found in various organisms including protozoa, insects, and plants (Edgar 2005; Fesko et al. 2008; Jander et al. 2004).

Fig. 1
figure 1

a Two glycine biosynthetic pathways that are catalyzed by SHMT (i) and TA (ii), respectively. b Structures of the four threonine stereoisomers

TAs are classified into l-TAs and d-TAs according to their specificity at the α-carbon of threonine (Fig. 1b) (Fesko et al. 2008; Liu et al. 2000a). l-TAs are specific for l-threonine and can be further classified into three subgroups, which prefer l-threonine or specifically select l-allo-threonine, or have low specificity toward the β-carbon of l-threonine (Fesko et al. 2008; Liu et al. 2000a) (Fig. 1b). All the known d-TAs are non-specific for the β-carbon of d-threonine (Fesko et al. 2008; Kataoka et al. 1997a; Liu et al. 1998b; Liu et al. 2000b). The non-specificity of TA catalysis raises an interesting question regarding their physiological roles. Because the reverse reaction of TA involves carbon–carbon bond formation that results in a β-hydroxyl-α-amino acid with two adjacent chiral centers, TAs are of high interest in synthetic chemistry (Liu et al. 2000a). More promisingly, these enzymes accept a wide variety of acceptor aldehydes and therefore make an important addition to the synthetic tool repertoire (Duckers et al. 2010; Franz and Stewart 2014). Notably, β-phenylserine, a threonine analog possessing a much larger phenyl group at the β position (instead of a methyl group in threonine), is a more active substrate than threonine in all the investigated cases (Kataoka et al. 1997b; Liu et al. 1998a; Liu et al. 1998c), arguing whether threonine is a genuine natural substrate for TAs.

Intrigued by this interesting class of enzymes, here we report detailed phylogenetic investigation on TAs together with their closely related enzymes SHMTs and alanine racemases (ARs) (Contestabile et al. 2001; Paiardini et al. 2003). Our results show that, interestingly, l-TAs are derived from two distinct families that share low sequence similarity with each other but likely have the same structural fold, suggesting a convergent evolution of these enzymes. One TA family contains enzymes of both prokaryotic and eukaryotic origins, whereas the second TA family contains only prokaryotic enzymes. Phylogenetic analysis suggests that horizontal gene transfer may occur frequently during the evolution of both TA families, as the tree topology is highly inconsistent with the taxonomic classification of host organisms. Our results indicate a complex evolutionary process for TAs and suggest an updated classification scheme for these enzymes.

Methods

Sequence Similarity Network Analysis

The amino acid sequences of TAs, SHMTs, and ARs were obtained from the National Center for Biotechnology Information (NCBI) sequence database (Pruitt et al. 2007) and are listed in Supplementary Table 1. To identify the putative TAs, BlastP searches (Gish and States 1993) were performed using the protein sequences of biochemically characterized TAs as the queries against the database of different organisms (archaea, actinobacteria, cyanobacteria, firmicutes, proteobacteria, fungi, and other eukaryotes). Hits with expected values less than 1E-70 and query coverage >80 % were usually selected. Network analysis was performed by BlastP searches comparing each sequence against another. A VBA script was written to remove all the duplicate comparisons, and the result was imported into Cytoscape software package (Cline et al. 2007). The nodes were arranged using the yFiles organic layout provided with Cytoscape version 2.8.3. The arrangements were slightly modified in some cases for a better illustration.

Phylogenetic Analysis

The same sequences of TAs and SHMTs (outgroup) from network analysis were aligned using ClustalX (Thompson et al. 1997) with iteration at each alignment step, and the alignment was manually fine-tuned afterward to minimize hypothetical insertion/deletion events. Bayesian Markov Chain Monte Carlo (MCMC) inference analyses were performed using the program MrBayes (version 3.2) (Ronquist et al. 2012). Final analyses consisted of two sets of eight chains each (one cold and seven heated), run for about 2 million generations with trees saved and parameters sampled every 100 generations. Analyses were run to reach a convergence with standard deviation of split frequencies <0.01. Posterior probabilities were averaged over the final 75 % of trees (25 % burn in). The analysis utilized a mixed amino acid model with a proportion of sites designated invariant (+I), and rate variation among sites modeled after a gamma distribution (+G) divided into eight categories, with all variable parameters estimated by the program based on random starting trees. The figure of the Bayesian phylogram was prepared using MEGA5 (Tamura et al. 2011).

Maximum likelihood analysis was performed using the program PhyML (Guindon and Gascuel 2003) with the WAG + I + G + F (Whelan and Goldman 2001) model. Gamma distribution was divided into eight categories and the tree topologies were estimated by SPR + NNI branch swapping, with 20 random starting trees. Branch support was determined by SH-like approximate likelihood-ratio test (aLRT) statistics (Anisimova and Gascuel 2006; Guindon and Gascuel 2003).

Results and Discussion

Sequence Similarity Network of TAs

Sequence similarity network analysis is a powerful and computation-economic method to depict the relationship among different protein sequences (Atkinson et al. 2009; Lukk et al. 2012; Zhao et al. 2014). In a network, each node represents a protein sequence, and each edge (line) indicates a pair of nodes (protein sequence) that have a BlastP e-value more stringent than a certain cutoff value. To study the evolution of TAs and their relationship with other PLP-dependent enzymes, we constructed a sequence similarity network containing TAs, SHMTs, and ARs from different organisms (Fig. 2). In this analysis, l-TAs and d-TAs completely separate from each other with a very relaxed cutoff value of 1E-10 (Fig. 2a), and this is consistent with previous studies suggesting that l-TAs and d-TAs are structurally and evolutionarily different (Paiardini et al. 2003). Under the same cutoff value, l-TAs also separate from SHMTs and bacterial ARs, suggesting that, although l-TA, SHMTs, and ARs are structurally and mechanistically closely related, (Eliot and Kirsch 2004; Hayashi 1995), these enzymes have evolved from different ancestors.

Fig. 2
figure 2

Sequence similarity network analysis of TAs, SHMTs, and ARs from different organisms with the cutoff e-values of 1E-10 (a), 1E-22 (b), 1E-70 (c), and 1E-95 (d), respectively. Nodes corresponding to SHMTs, ARs, and d-TAs are represented as SHMT, AR, and DTA, respectively; other nodes correspond to l-TAs

Intriguingly, l-TAs completely separate into two different clusters with a relatively decreased cutoff value of 1E-22 (Fig. 2b), suggesting that l-TAs are derived from two different origins. Detailed examination of the network indicated that the sequence identities for proteins within a cluster and between the two clusters are normally above 40 % and below 20 %, respectively. The first cluster (cluster A) consists of diverse enzymes, including fungal ARs and l-TAs from both prokaryotic and eukaryotic origins (Fig. 2b). Biochemically characterized enzymes in this cluster include l-TA from Aeromonas jandaei (l-TAaj) that is specific for l-allo-threonine (Kataoka et al. 1997b; Liu et al. 1997a; Qin et al. 2014), and low-specificity l-TAs from Escherichia coli (l-TAe) (di Salvo et al. 2014; Liu et al. 1998a), Saccharomyces cerevisiae (l-TAsc) (Liu et al. 1997b), and Thermatoga maritime (l-TAtm) (Kielkopf and Burley 2002). l-TAaj, l-TAe, and l-TAtm have also been structurally characterized (di Salvo et al. 2014; Kielkopf and Burley 2002; Qin et al. 2014). The second cluster (cluster B) has only prokaryotic l-TAs (Fig. 2b), including low-specificity l-TA from Pseudomonas sp. NCIMB 10558 (l-TAps) (Liu et al. 1998c), and two enzymes from Pseudomonas aeruginosa (l-TApa) and Pseudomonas putida (l-TApp) that prefer l-threonine (Fesko et al. 2008). So far no structure has been reported from cluster B. Further decreasing the cutoff value led to separation of fungal ARs and fungal l-TAs from cluster A, but no separation of cluster B, even when the cutoff was decreased to a very stringent value of 1E-95 (Fig. 2c, d). Multiple sequence alignment of the selected l-TAs show that, although enzymes from one cluster are clearly different from those of the other cluster, both clusters of enzymes share many conserved residues (including those constituting the active site) and likely possess the same structural fold (Supplementary Fig. 1). These results suggest that l-TAs have been evolved convergently from two ancestral families.

The catalytic specificity is highly diverse among cluster A enzymes, ranging from very stringent specificity for l-allo-threonine (l-TAaj) to high tolerance with regard to the β-position (l-TAe and l-TAsc). We note that all the above-mentioned enzymes prefer l-allo-threonine over l-threonine to a certain extent (Kataoka et al. 1997b; Liu et al. 1998a; Qin et al. 2014). On the contrary, all three biochemically characterized enzymes l-TAps, l-TApa, and l-TApp in cluster B prefer l-threonine, although the specificity was reported to be very low for l-TAps (Liu et al. 1998c). It remains to be investigated whether the two clusters of l-TAs have different substrate specificities toward the β-position (i.e., cluster A enzymes may generally prefer l-allo-threonine, whereas cluster B may generally prefer l-threonine).

Phylogenetic Analysis

To confirm the proposal that l-TAs have been evolved from two ancestral families, we performed phylogenetic analysis of l-TAs using the Bayesian MCMC method (Mau et al. 1999). The analysis includes all the enzymes from cluster A and cluster B, and SHMTs that serve as the outgroup (due to the very low sequence similarities of d-TAs and bacterial ARs with l-TAs, the former two classes of enzymes were not included in the phylogenetic analysis). Indeed, l-TAs separate into two clusters with good statistical support (Fig. 3). Detailed examination of each enzyme in the Bayesian MCMC tree showed that the two clusters correspond well with cluster A and B in our sequence similarity network (Fig. 3 and Supplementary Table 1), further supporting that l-TAs are derived from two different origins. We also constructed a phylogenetic tree using maximum likelihood (ML)-based method, which is very similar to the Bayesian MCMC tree and therefore confirmed the robustness of our analysis (Supplementary Fig. 2). We noted that in many cases, enzyme phylogeny does not correlate with the host taxonomy (e.g., enzymes from cyanobacteria and firmicutes are found in many branches of the tree and do not form separate clusters) (Fig. 3 and Supplementary Fig. 2). This observation suggests that horizontal gene transfer might occur frequently during TA evolution. Another interesting finding is that no significant divergence is observed between high- and low-specificity enzymes. For example, l-TApp is believed to be l-threonine specific (Fesko et al. 2008; Liu et al. 1998c) but is phylogenetically closely related with a low-specificity enzyme l-TAps (Fig. 3 and Supplementary Fig. 2). In addition, l-TAaj which is specific for l-allo-threonine (Kataoka et al. 1997b) is not far away from a low-specificity enzyme l-TAe (Liu et al. 1998a) (Fig. 3 and Supplementary Fig. 2). These observations suggest a possibility that many TAs may be engineered, using techniques such as random PCR mutation, DNA shuffling, or chemical modification, to significantly alter their catalytic specificity.

Fig. 3
figure 3

Bayesian MCMC tree of l-TAs with SHMTs that serves as the outgroup of the tree. l-TAs are shown by the filled circle of different colors according to the taxonomy of their host; SHMTs are shown as the unfilled circles. Bayesian inferences of posterior probabilities are shown only for the major branches. The tree topology was supported by a parallel ML-based analysis shown in Supplementary Fig. 2

Conclusions

We performed sequence similarity network and phylogenetic analysis on TAs and their closely related PLP-dependent enzymes. We show that SHMTs, bacterial ARs, d-TA, and l-TAs are derived from different origins within the same cluster, and l-TAs are further grouped into two evolutionarily distinct families. Our results suggest a convergent evolutionary process for l-TAs and the importance of these enzymes for the life process. As horizontal gene transfer occurred frequently during TA evolution, and the distribution of these enzymes is highly diverse among different organisms (e.g., many organisms do not have a TA-encoding gene in the genome, whereas many others contain more than one TA genes), it is likely that TAs may be involved in some secondary metabolic pathways besides glycine biosynthesis. The fact that several l-threonine analogs are more active substrates of l-TAs than l-threonine is consistent with this proposal, which needs to be further tested. Our analysis also suggests the potential of engineering the catalytic specificity of TAs and screening novel TAs with desired activity by sampling sequence space that so far has not been tapped (e.g., enzymes from archaea). These studies are of particular interest because the low diastereospecificity of TAs is currently the main hurdle in using these enzymes for synthetic applications.