Introduction

Mitogen-activated protein kinases (MAPKs) play pivotal roles in signal transduction pathways and have been identified in all eukaryotes examined thus far. MAPKs are part of the eukaryotic protein kinase (ePK) superfamily consisting of seven major groups: (1) the cyclic nucleotide and calcium/phospholipids-dependent kinases (the AGC group); (2) the CMGC group, comprising the (2a) cyclin-dependent kinases (CDK), (2b) MAPKs, (2c) glycogen-synthase kinases (GSK), and (2d) CDK-like kinases; (3) calmodulin-dependent kinases (CaMK); (4) tyrosine kinases (TyrK); (5) the casein kinase 1 (CK1) group; (6) the STE group, which includes many signaling molecules functioning in MAPK pathways; and (7) the tyrosine kinase-like (TKL) group, which includes enzymes that are related to those in the TyrK group, although, they are serine-threonine protein kinases (Hanks 2003; Hanks and Quinn 1991).

All MAPKs are proline-directed serine/threonine kinases that phosphorylate substrates with a proline residue in the P + 1 site of the substrate recognition consensus motif ψX[S,T]P, where ψ represents either proline or an aliphatic amino acid (Clark-Lewis et al. 1991). All MAPKs must possess the [LIVM][TS]XX[LIVM]XT[RK][WY]YRXPX [LIVM] [LIVM] signature sequence, where X represents any amino acid (Kültz 1998). Historically, mammalian MAPKs have been classified as extracellular signal-regulated kinases (ERKs), c-Jun amino-terminal kinases (JNKs), or p38 stress-response MAPKs, based on whether the central amino acid in the TXY dual phosphorylation loop (Roux and Blenis 2004) is a glutamic acid, proline, or glycine, respectively (Martin-Blanco 2000).

In 1998, Kültz produced a comprehensive phylogenetic and functional analysis of all MAPK sequences that had been identified at that time. This analysis included 93 sequences, primarily representing animals, plants, and fungi. Based on the early divergence of organisms within these three kingdoms, the MAPK family of proteins is thought to have evolved 1.0 to 1.5 billion years ago. Kültz’s analysis determined that the MAPK family could be subdivided into three major subgroups: the stress-activated protein kinases (SAPKs), the extracellular signal-regulated kinases (ERKs), and a third group, which he called the MAPK3 subgroup. The first two groups reproduced the traditional dichotomy, with members of the ERK subgroup sharing the TEY dual phosphorylation motif and members of the SAPK subgroup represented by either the TGY or TPY motifs. The final group largely consisted of “oddball” sequences that did not share a high degree of similarity with either of the other two MAPK subgroups. Among the sequences in the MAPK3 group were five protozoan sequences: two from Trypanosoma brucei, one from Leishmania mexicana, and two from Plasmodium falciparum. These were intermingled with unusual human and rat MAPKs, as well as sequences from Caenorhabditis elegans and Dictyostelium discoideum. Kültz’s phylogenetic results for the protozoa indicated little similarity among most of these sequences, and there was no evidence to suggest the existence of one or more distinct protozoan subfamilies.

As increasing numbers of MAPKs have been discovered in organisms of diverse evolutionary origin, it has become clear that a more robust classification system is needed. Since the time of Kültz’s study, there has been considerable progress in the discovery and characterization of protozoan MAPKs, particularly with respect to the Apicomplexa. Additional MAPK sequences have been identified in P. falciparum (Carlton et al. 2002), Cryptosporidium hominis (Xu et al. 2004), Cryptosporidium parvum (Abrahamsen et al. 2004), and Theileria annulala (Pain et al. 2005), and we have recently characterized a MAPK in T. gondii (Brumlik et al. 2004). Motivated by our identification of a second MAPK in T. gondii, we have analyzed a collection of Apicomplexan MAPK sequences along with a diverse sample of other protozoan and non-protozoan MAPKs. The phylogenetic results presented in this report offer a reevaluation of Kültz’s overly broad “MAPK3” subgroup, with our findings strongly supporting the existence of at least two distinct protozoan subfamilies.

Materials and Methods

Toxoplasma gondii Sequence Identification and Selection

We translated the T. gondii genomic database (Release 3.0; http://www.toxodb.org), which affords 10-fold coverage of the avirulent Me49 genome, in all possible open reading frames (ORFs) that were greater than 50 codons in length. In our previous study, we used the highly conserved MAPK motif HRDLKP to screen for tgMAPK1, which was the first MAPK gene to be identified and characterized in T. gondii (Brumlik et al. 2004). In the current study we wished to identify any possible homologues of tgMAPK1 using a slightly less stringent screen. In order to do this, we rescreened the T. gondii genomic database for genes encoding a HRD[LIVM]KP motif instead. In order to ensure that the presence of introns did not obscure our screen, we additionally screened the genome using other highly conserved MAPK motifs (GXGXXG, DFGLXR, RWYRXP) to ensure complete coverage of all possible ORFs in the T. gondii genome. This screening strategy identified a second MAPK gene candidate (designated tgMAPK2) which possessed the classical TXY dual phosphorylation motif (Hanks and Quinn 1991; Roux and Blenis 2004). We subsequently cloned this gene by reverse-transcriptase polymerase chain reaction (RT-PCR), using total RNA that was isolated from both the avirulent Me49 strain as well as the virulent RH strain. The primers used for PCR are listed in Table 1, using the same RT-PCR methodology we described previously (Brumlik et al. 2004). Following gel purification, amplicons were either directly sequenced using an ABI 377 Automated Sequencer (Applied Biosystems, Foster City, CA) or were first cloned into a pCR2.l-TOPO vector (Invitrogen, Carlsbad, CA), prior to sequencing. Primers used for RT-PCR were also utilized for sequencing (Table 1). We additionally used the forward and reverse sequencing primers that were provided with the pCR2.1-TOPO vector kit (Invitrogen).

Table 1 Primers used in this study

Sequence Selection

Representative sequences from plants, animals, and fungi were chosen from major subfamilies in the ERK and SAPK families to reflect the diversity within each of these taxonomic groups, identified by previously reported accession numbers (Kültz 1998). To incorporate recently identified MAPKs, we used the predicted amino acid sequence encoded by the tgMAPK1 and tgMAPK2 genes to perform BLASTp analyses against all sequence databases available at the National Center for Biotechnology Information (Altschul et al. 1997). We also performed BLASTp searches of the Apicomplexan Database Resources (http://www.apidb.org), a NIH/NIAID funded Bioinformatics Resource Center, to search for additional MAPKs not yet included in the NCBI databases. Our search of these databases detected a third putative T. gondii MAPK which has recently been annotated (T. gondii genomic database; release 3.0; http://www.toxodb.org), a sequence with a TGH motif, that can only be activated by a single phosphorylation event at its threonine. We have designated this MAPK TgMAPK3 for the purposes of this analysis.

After filtering out those sequences that did not possess all required residues within the MAPK signature sequence and accounting for duplication in the databases, we were initially left with a very large assortment of over one hundred putative protozoan MAPK sequences. To avoid the inclusion of nearly identical sequences, we restricted this set to include only a single representative species for each genus (P. falciparum for Plasmodium, L. major for Leishmania, C. hominis for Cryptosporidium, etc.), resulting in a collection of 35 protozoan sequences. Since our primary focus was the analysis of the three T. gondii MAPKs and their Apicomplexan MAPK homologues, we selected sequences from other well-studied protozoa based on their similarity to TgMAPK1, TgMAPK2, or TgMAPK3. An analysis based on pairwise sequence similarity identified two sequences in L. major, two sequences in T. brucei, and two sequences in Tetrahymena thermophila as good candidates, and we included these six sequences in our analysis as representatives from non-Apicomplexan protozoan species. The final selection of 32 protozoan and nonprotozoan MAPK sequences is provided in Table 2.

Table 2 MAPK sequences analyzed

Alignment

Sequences were aligned using ClustalW version 1.83 (Thompson et al. 1994). Due to the considerable variation in sequence length among the 32 sequences aligned (see Table 2) and the sensitivity of our alignment to slight changes in the ClustalW gap and mismatch penalty parameters within certain nonconserved regions, blocks of unambiguously aligned positions were selected using Gblocks (Castresana 2000) and the alignment was manually adjusted using BioEdit 7.0.5.2 (Hall 1999). Allowed substitution groups for conserved positions were [LIVM], [DN], [EQ], [ST], [KR], and [FYW], as previously described (Kültz 1998).

Phylogenetic Analysis

Two sets of phylogenetic trees, the first including 10 Apicomplexan sequences (A-Seqs) and the second including 28 selected protozoan, plant, yeast, and metazoan sequences (B-Seqs), were constructed by the methods of maximum likelihood (ML), neighbor-joining (NJ), and Bayesian inference (BI) using PHYLIP version 3.63 (Felsenstein 1989) and MrBayes version 3.1.2 (Ronquist and Huelsenbeck 2003). Phylogenetic results were visualized using the TreeView package (Page 1996).

For the ML analysis, the PROML algorithm was used to estimate the topology and relative branch lengths. The Jones-Taylor-Thornton (JTT) transition matrix (Jones et al. 1992) was selected as the amino acid substitution model, with rate variation across sites modelled according to the gamma distribution along with an invariant class (gamma+I). Five categories were selected to approximate the gamma distribution for rate variation, with one reserved for the invariant sites. The gamma rate α and the proportion of invariant sites p were estimated by ProtTest version 1.2.6 (Abascal et al. 2005), with α = 1.52 and p = 0.129 for the A-Seqs and α = 1.41 and p = 0.135 for the B-Seqs. In each case, we ran the PROML algorithm specifying a slow and thorough analysis, global rearrangements, and random jumbling of the sequences three times. Under the same modelling assumptions, we then ran the PROML algorithm on 100 bootstrap replicates that were obtained using SEQBOOT (Felsenstein 1985) and constructed the majority consensus tree (Margush and McMorris 1981) using CONSENSE. Global rearrangements and multiple sequence jumbling were not selected for the bootstrap datasets because of the considerable computational time that was required for each run.

For the NJ analysis, distances were first estimated using PROTDIST according to the JTT distance matrix with a gamma correction for rate variation. Parameter estimates for the gamma rate variation model (without invariant sites), obtained using ProtTest, were α = 0.935 for the A-Seqs and α = 0.793 for the B-Seqs. Phylogenies were constructed using the NJ algorithm (Saitou and Nei 1987) as implemented in NEIGHBOR. Five hundred bootstrap replicates were obtained using SEQBOOT and analyzed using the same modeling assumptions, and the majority consensus tree was produced using CONSENSE.

The Bayesian analysis was run using MrBayes with default priors. The JTT model with gamma+I rate variation was specified a priori, and four gamma rate categories were used to approximate the distribution. For the A-Seqs, 2 runs of 4 chains each (3 heated, 1 cold) were run for 200,000 generations, with sampling every 100th generation, producing 2000 samples from the posterior distribution, of which the first 500 were discarded as burn-in. The final standard deviation of the split frequencies between the two runs was 0.0036, and mean parameter estimates for the gamma+I model were α = 1.669 and p = 0.139. For the B-Seqs, 2 runs of 4 chains each (3 heated, 1 cold) were run for 500,000 generations, with sampling every 100th generation, and the first 1000 of the resulting 5000 samples were discarded as burn-in. The final standard deviation of the split frequencies was 0.005 with mean parameter estimates of α = 1.403 and p = 0.157.

Results

Identification of a Second T. gondii MAPK Gene, Designated tgMAPK2

We have previously described the cloning of tgMAPK1 from T. gondii and the preliminary characterization of its gene product, a 58-kDa stress-response MAPK (Brumlik et al. 2004). In this report, we have identified a 2.0-kilobase pair (kb) gene, designated tgMAPK2, predicted to encode a 683-amino acid, 73-kDa protein in the virulent RH T. gondii strain (Fig. 1). A stop codon is situated 219 base pairs (bp) upstream from a plausible initiation (ATG) codon and accompanying T. gondii translational initiation site (Seeber 1997), with no alternative initiation codon residing in the intervening region. Moreover, a region predicted to encode a conserved MAPK nucleotide-binding domain closely follows the initiation codon (starting at G20XG22XXG25), consistent with our assignment of the tgMAPK2 initiation codon. The deduced amino acid sequence encoded by the first 305 codons of the tgMAPK2 gene shares 50% amino acid sequence identity with T. gondii TgMAPK1 (Brumlik et al. 2004) throughout all 12 of the conserved domains that are characteristic of MAPKs (Kültz 1998). Remarkably, the deduced amino acid sequence of tgMAPK2 shares an astounding 64% amino acid sequence identity with human ERK8 (Fig. 2). The TDY(178–180) motif is typical of other protozoan MAPKs (with an acidic amino acid between the threonine and the tyrosine residues). Interestingly, the tgMAPK2 gene from the virulent RH strain encodes 11 copies of a VSSSHHG repeat in the C-terminal coding region extension, versus nine copies in tgMAPK2 from the avirulent Me49 strain. In addition, three nonneutral amino acid substitutions are predicted to be encoded by the virulent (RH) T. gondii tgMAPK2 sequence compared to the avirulent (Me49) T. gondii tgMAPK2 sequence. The large number of strain-specific nucleotide polymorphisms in T. gondii is unusual. A total of 46 nucleotides differ between tgMAPK2 in RH and Me49, which conceivably could lead to measurable differences in the biological activities of their encoded TgMAPK2 variants (Fig. 1).

Fig. 1
figure 1

Nucleotide sequence of the Toxoplasma gondii RH tgMAPK2 (GenBank accession no. DQ115400). Nucleotide sequencing of both strands of the 2.3-kb region that is shown was performed using an ABI 377 Automated Sequencer (Applied Biosystems, Foster City, CA). Primers that were specifically used for in this study are described in Table 1. Their positions are indicated with half-arrows and numbers in parentheses identify each primer. Lowercase letters delineate the 5′ and 3′ boundaries of the coding region. Uppercase letters denote the coding region except where individual nucleotide substitutions within this region were identified in the RH sequence relative to the corresponding Me49 sequence. Nucleotide sequence differences between the RH and the Me49 strain tgMAPK2 genes, which were predicted to encode altered codons, are highlighted in gray. The predicted amino acid sequence encoded by the RH tgMAPK2 gene is shown beneath each codon using the one-letter amino acid code, with the TDY(178–180) motif and the G20XG22XXG25 nucleotide-binding motif highlighted in black, respectively. Pertinent stop codons are indicated by asterisks. Perfect and imperfect VSSSHHG repeats are underlined using solid and dashed lines, respectively. Nucleotide (nt) and amino acid (aa) position numbers are indicated at the right.

Fig. 2
figure 2

Alignment of ten Apicomplexan MAPK sequences. The subset of 10 Apicomplexan sequences extracted from the full alignment of 32 sequences are shown, along with the Homo sapiens ERK8 sequence as a reference. The amino acid position numbers are listed at the right. The MAPKs shown have highly divergent sequences at their C-termini, making it necessary to truncate all sequences following the last conserved residue (Y-305) based on the TgMAPK2 amino acid sequence. Filled circles (•) indicate conserved positions across all 32 sequences. Invariant positions for protozoan MAPKs relative to metazoan MAPKs (within allowed substitution groups) are highlighted in white type over a black background. Positions shaded in grey are conserved within an Apicomplexan subfamily. Three distinctly different subfamilies emerge from this alignment and are indicated by the subdivision in each row. Deleted regions are indicated in boxes, with numbers indicating the corresponding deletion start and end positions. Open circles (○) indicate positions lacking any deletion for a particular sequence.

Sequence Alignment

The final alignment of the 32 sequences, which consisted only of regions for which the majority of the sequences could be unambiguously aligned as selected by Gblocks, included 255 positions. The subset of the alignment including the 10 Apicomplexan sequences of central interest to this study is shown in Fig. 2, with the human ERK8 sequence included for reference. Seventy-three positions are invariant or conserved within allowed substitution groups for all 32 sequences included in the alignment. We identified 34 additional positions that are conserved in all 10 of the Apicomplexan sequences, together accounting for 42.1% of the aligned positions. Between the sequences TgMAPK1 and ChMAPK, 77.1% of the positions were conserved, with 70.4% of the positions conserved among the quartet including TgMAPK2, ChMAPK1, PfMAPK1, and TaMAPK and 72.3% of the positions conserved among the non-TXY quartet including TgMAPK3, ChMAPK2, PfMAPK2, and TaMAPK2.

Phylogenetic Analysis

We first focused on the 10 Apicomplexan sequences, constructing unrooted trees using the methods of BI, ML, and NJ as described. The Bayesian consensus topology and the posterior probabilities associated with each node are shown in Fig. 3. The phylogeny groups the Apicomplexan sequences into the three clades suggested by the alignment: ApiMAPK1 (TgMAPK1, ChMAPK), ApiMAPK2 (TgMAPK2, ChMAPK1, PfMAPK1, TaMAPK), and ApiMAPK3 (TgMAPK3, ChMAPK2, PfMAPK2, TaMAPK2). The ML and NJ methods both reproduced this trichotomy, with bootstrap support values for (ApiMAPK1, ApiMAPK2, ApiMAPK3) of (76%, 100%, 100%) and (95%, 100%, 100%) respectively. The topology of the ApiMAPK3 clade was also consistently reconstructed by all three methods. There was, however, some discrepancy among the methods regarding the arrangement of the four sequences within ApiMAPK2, with the NJ analysis suggesting an alternative topology. The ML phylogeny was in agreement with the BI topology for ApiMAPK2, but the bootstrap support was low (values for the two internal nodes were 57% and 62%) and removing the two internal edges only decreased the log-likelihood from −3565.1 to −3570.0, an insignificant difference according to the Kishino-Hasegawa-Templeton test. We have therefore dashed the internal edges within this clade to indicate that the consensus among the three methods is represented by a tetrachotomy at the root of ApiMAPK2.

Fig. 3
figure 3

Bayesian consensus tree for Apicomplexan MAPK sequences. Branch lengths indicate relative distances between sequences, estimated using MrBayes under the JTT model with the gamma+I model for rate variation. Posterior probabilities are included at each node. Solid edges indicate agreement between the BI, NJ, and ML analyses, while discrepancies in topology among the three methods are indicated by thin dashed lines.

After confirming the three Apicomplexan clades as suggested by the alignment, we selected the pairs (TgMAPK1, ChMAPK), (TgMAPK2, ChMAPK1), and (TgMAPK3, ChMAPK2) as representatives of ApiMAPK1, ApiMAPK2, and ApiMAPK3, respectively. These sequences were analyzed with 6 other protozoan sequences and 16 sequences from mammals, plants, yeast, and other metazoans (see Table 2). The consensus tree from the Bayesian analysis is shown in Fig. 4. This tree has a central tetrachotomy and reproduces the (YSAPK, SAPK) and (YERK, ERK) clades identified by the Kültz analysis. The third major clade joins metazoan ERK8-like sequences (HsERK8, RnERK7, CeERK8, DmERK8) to a monophyletic group including the representative Apicomplexan sequences from ApiMAPK2 (TgMAPK2 and ChMAPK1) and the protozoan pair (TbMAPK2, LmMAPK2). We call this protozoan clade pzMAPK2. The posterior probability of the node creating the pzMAPK2/ERK8 clade is 1.00, and this clade is separated from the central tetrachotomy by a long internal edge. The fourth clade includes the plant MAPK sequences (PERK) and the remaining protozoan sequences, joining these clades with posterior probability 0.99. Within the protozoa, the two representative sequences from the ApiMAPK3 are joined with TtMAPK3 (another sequence with the TGH motif) with posterior probability 1.00, creating a clade that we call pzMAPK3. The pair (LmMAPK4, TbMAPK4) is then joined with this clade with posterior probability 0.78, while TtMAPK1 is joined with the ApiMAPK1 sequences with posterior probability 0.88. We call the collection of five non-TXH protozoan sequences pzMAPK1.

Fig. 4
figure 4

Bayesian consensus tree for protozoan and metazoan MAPK sequences. Branch lengths indicate relative distances between sequences and posterior probabilities are included at each node. Solid edges indicate agreement in topology between the BI, NJ, and ML analyses. Thin dashed lines indicate discrepancies in topology among the methods, while heavy dashed lines indicate agreement in topology across the methods but with bootstrap support below 60% for both the maximum likelihood (ML) and neighbor-joining (NJ) methods.

The ML and NJ analyses produced trees consistent with the BI consensus tree shown in Fig. 4, with the exception that both methods suggested a different topology for the four sequences within the PERK group (a bifurcating root node connecting the pairs [AtMPK1, CrMAPK] and [OsMAP1, NtWIPK]). However, none of the three methods strongly supported a specific PERK topology (the BI posterior probability was 0.58, while bootstrap support values for both the NJ and the ML nodes were under 60%), suggesting that the PERK topology is best represented by a tetrachotomy. We have indicated this in Fig. 4 by dashing the internal edges within the PERK clade.

While the three methods all agreed on the topology of the tree outside of the PERK clade, the bootstrap support values from the ML and NJ analyses were far lower than the BI posterior probability values for certain nodes. Such discrepancies between bootstrap support values and Bayesian posterior probabilities are commonly observed in practice (Huelsenbeck et al. 2002). Simulations have determined that while bootstrap values may be slightly conservative in their support for true interior nodes, Bayesian methods are more likely to assign high posterior probabilities to incorrect nodes (Alfaro et al. 2003; Douady et al. 2003). Furthermore, Bayesian methods may assign high posterior probabilities to arbitrary resolutions of polytomies (Lewis et al. 2005; Suzuki et al. 2002). These findings suggest that high posterior probabilities that are unsupported by bootstrap values should not be overly emphasized in practice.

Concerning our analysis, respective ML and NJ bootstrap support values for the node joining ApiMAPK1 to TtMAPK1 were 37% and 44%, and for the node joining the pzMAPK3 clade to the (LmMAPK4, TbMAPK4) pair the respective support values were 52% and 35%. The decrease in log-likelihood between the BI consensus tree shown in Fig. 4 (which has a log-likelihood of −8496.8) and a tree in which the internal edges in the protozoan clade were shortened to length 0 to create a tetrachotomy joining TtMAPK1, ApiMAPK1, the pair (LmMAPK4, TbMAPK4), and the pzMAPK3 clade was only 6.0, with a p-value of 0.456 according to the Shimodaira-Hasegawa test as implemented in PROML. For this reason, we consider the topology of the pzMAPK1 clade to be somewhat ambiguous and do not attach much significance to the association of (LmMAPK4, TbMAPK4) with the non-TXH sequences in pzMAPK3. The bootstrap support for the node separating the pzMAPK1/pzMAPK3 clade from PERK was also weak, with support values of 33% for both the ML and NJ analyses. Finally, the methods did not agree in the magnitude of their support for the joint PERK-protozoan clade. While the BI posterior probability for this node was 0.99, the ML and NJ bootstrap support values were only 50% and 25%, respectively. These results suggest that the phylogenetic connection between plant MAPKs and the pzMAPK1/pzMAPK3 protozoan MAPKs may be somewhat tenuous and should be further investigated.

To reflect the differences in statistical support among the methods, the edges joining the nodes that were consistently reproduced by all three methods but with less than 60% bootstrap support in both the ML and the NJ analyses are indicated by heavily dashed lines in Fig. 4. While these dashed edges demonstrate our lower level of confidence for the topology in this region of the tree, we do not mean to suggest that the simultaneous removal of all of these edges would have a negligible impact on our results. In fact, doing so significantly decreases the likelihood by 43.1 (p = 0.002).

Discussion

Traditionally, MAPK sequences have been categorized by their TXY activation loop signature sequence (Roux and Blenis 2004). Kültz (1998) modified this definition to include several distantly related proteins, which instead possess diverse [T/S]XX motifs, including THE and SEG, that can obviously only be phosphorylated at the T/S position. The inclusion of non-TXY sequences resulted in the formation of the Kültz MAPK3 subgroup, which included animal MAPKs with the SEG motif, a C. elegans MAPK with the THE motif, and protozoan sequences with TDY and TEY motifs. Within this subgroup, the three animal SEG sequences (two human and one rat) formed the “MAPK3 subfamily” (Kültz 1998). These three sequences were highly similar, with 86% identity among the positions shown in Kültz’s alignment figure, but had little similarity to the protozoan sequences: only 32% of the positions were conserved across the larger MAPK3 subgroup, a figure which includes the 25% of positions that were invariant in all 93 MAPKs considered by Kültz. Furthermore, the phylogenetic evidence for the existence of the larger MAPK3 subgroup was weak, with only 19% bootstrap support. We believe that Kültz’s inclusion of non-TXY sequences distorted his results, and that the classification of the protozoan sequences within the MAPK3 subgroup was simply an artifact of long-branch attraction.

The broadening of the definition of MAPKs to include non-TXY sequences has also resulted in the addition of protozoan sequences to the MAPK family. Sequences such as Pfmap-2 from P. falciparum (with a TSH motif instead) have been classified as “atypical MAPKs” (Dorin et al. 1999), and in a recent study of the Plasmodium berghei Pb-map2 MAPK, Tewari and coworkers (2005) analyzed a collection of non-TXY protozoan MAPKs. This analysis phylogenetically distinguished these sequences from other protozoan and metazoan MAPKs and noted that these “atypical MAPKs” seem to be restricted to the Alveolata. Our present analysis supports these conclusions, clearly demonstrating the existence of a monophyletic non-TXY subfamily of closely related sequences (our pzMAPK3). However, we note that the pzMAPK3 subfamily is no more “atypical” phylogenetically than pzMAPK2, with both of these subfamilies separated from the center of the tree by equally large distances (Fig. 4). Furthermore, amino acid sequence alignment of MAPKs from pzMAPK2 and pzMAPK3 subfamilies does not indicate a lesser degree of homology between these subfamilies (Fig. 2) except with respect to the TDY to T[GS]H conversion of their corresponding activation loops.

In fact, a striking and unexpected consequence of our analyses was the demonstration of an extremely large evolutionary distance between the ERK8-like MAPKs and the remaining MAPK families. While we are not the first to notice significant differences between the mammalian ERK7/ERK8 sequences and other mammalian ERKs (Abe et al. 2002; Bogoyevitch and Court 2004), our identification of ERK8 homologues in C. elegans and D. melanogaster indicates that ERK8-like MAPKs represent a distinct family that is evolutionarily conserved across a wide variety of eukaryotes except plants (we found that while similarity searches yielded several potential ERK8-like plant homologues, all of these were consistently grouped phylogenetically within the PERK family, suggesting a single subfamily for all plant MAPKs).

We also note that an association between mammalian ERK7/ERK8 and protozoan MAPKs has been previously observed, as reported in a recent analysis of a newly identified MAPK homologue in Entamoeba histolytica (Ray et al. 2005). This report placed HsERK8 and RnERK7 within a phylogenetic clade that included certain protozoan MAPKs, a result which led the authors to suggest, “[It] is likely that a common ancestor for MAPKs and ERK 7/8 existed before protozoans and slime molds diverged from the ancestral lineage that gave rise to animals, fungi, and plants” (Ray et al. 2005). Although our results also clearly demonstrate the monophyletic nature of the ERK8-like protozoan and metazoan MAPKs, there is a significant difference in our findings. Ray and coworkers suggested a single protozoan MAPK subgroup with ERK7/ERK8 as members, while we instead have presented evidence of two distinct protozoan subfamilies separated by a large phylogenetic distance.

It is interesting to note that, despite extensive research on the set of all P. falciparum kinases (Ward et al. 2004), no MAPK has yet been identified in this species that is homologous to TgMAPK1 and ChMAPK, and our database searches indicate that the ApiMAPK1 subfamily does not appear to have any other members. We also note that the evolutionary distance between TgMAPK1 and ChMAPK is substantially greater than for the T. gondii and C. hominis sequence pairs in the other two Apicomplexan subfamilies, suggesting that these MAPKs are far less conserved across even closely related species.

The identification of meaningful phylogenetic relationships among the protozoan MAPKs is a challenging problem due to the high level of divergence among many of these sequences. The goal of our analysis has been to obtain a phylogeny with strong statistical support while reflecting the diversity of protozoan MAPKs, and for this reason we found it beneficial to focus on sets of protozoan MAPKs with high levels of similarity. We acknowledge that the results presented here are “Apicomplexan-centric” and do not represent the majority of the protozoan MAPKs currently included in the NCBI databases. However, our opinion is that a smaller, more resolved tree is of greater benefit than a larger star-like tree with poor statistical support, and we therefore have included only a small subset of the many protozoan MAPKs that we catalogued in our searches.

Furthermore, because our study involved a collection of 32 highly diverse sequences, many positions could not be unambiguously aligned and were discarded from the final analysis. This led to a considerable loss of phylogenetic signal, and for this reason we did not attempt to precisely reconstruct phylogenetic relationships within closely related subfamilies. Instead, we selected representative sequences to broadly outline the major relationships among these groups. More focused analyses of specific protozoan and metazoan subfamilies would increase the number of well-aligned positions, allowing for the resolution of phylogenetic relationships among groups of closely related MAPKs.

In addition to suggesting further phylogenetic analyses, the existence of multiple distinct protozoan subfamilies also poses compelling biological questions. Our results will prove to be useful for identifying common structural features of MAPKs, which is of great value for the development of novel MAPK inhibitor drugs. However, we would certainly like to determine whether or not the protozoan MAPK subfamilies are functionally distinct. The vast majority of the protozoan MAPK sequences that we have selected for phylogenetic analysis have not been characterized, making every newly cloned protozoan MAPK gene such as tgMAPK2 of great potential significance. While P. falciparum MAPK2 seems to play a critical role in stage differentiation (Dorin et al. 1999) and T. thermophila MAPK3 and T. gondii MAPK1 are both associated with stress response (Nakashima and Nozawa 1999); (Brumlik et al. 2004), little biologic information is available regarding the remaining protozoan MAPKs described in Table 2. Most MAPKs possess C-terminal extensions beyond the region shown in our ClustalW alignment (Fig. 2), ranging in size from 88 ammo acids in the case of human p38α (41 kDa) to 449 amino acids in the case of the P. falciparum MAPK1 (90 kDa). These extensions have been found to be responsible for interactions with upstream/downstream signaling molecules (Zhang and Liu 2001). Therefore, it does not seem currently possible to obtain functional information from MAPK primary sequence data alone, and experimental results will be necessary to determine the connection, if any, between the phylogenetic and functional classification of the protozoan MAPKs.