Introduction

Transcription factors constitute major components of the genetic basis for phenotypic evolution (Wray et al. 2003). In plants, they belong to multigene families which have a much higher expansion rate than in animals (Shiu et al. 2005). In many plant lineages, the genome was duplicated several times during evolution, which explains some of these expansions. Important evolutionary transitions result from new gene functionalities which were acquired through these large-scale duplication events in plants (Maere et al. 2005). In addition, some transcription factor gene duplications in Arabidopsis thaliana arose through segmental duplication events (Remington et al. 2004).

TCP proteins, named after TEOSINTE BRANCHED 1 (TB1) in maize, CYCLOIDEA (CYC) in Anthirrinum majus, and PCF in rice (Cubas et al. 1999), are a plant-specific family of transcription factors involved in multiple developmental control pathways (Cubas 2002). TCP proteins, described up to now only in angiosperms, can be classified into two subfamilies based on the primary structure of their DNA binding domain. In the present study, the CYC/TB1 subfamily and PCF1/PCF2 subfamily described by Cubas in 2002 will be referred to as TCP-C and TCP-P, respectively. In Arabidopsis thaliana, they constitute a small gene family of 24 members that map on all five chromosomes (Cubas 2002). Most genetic and molecular studies on TCP proteins have focused on the TCP-C subfamily and have shown their involvement in flower and leaf shapes or in shoot branching (Cubas 2004; Doebley et al. 1995; Nath et al. 2003). Although the function of TCP-P proteins is far less studied, it is known that they participate in organ border delimitation (Weir et al. 2004) and influence cell growth and proliferation (Kosugi and Ohashi 1997, 2002).

The appearance of TCPs during plant evolution and their subsequent copy number changes are described here. New members of the TCP protein family were identified from databases, not only in angiosperms, but also in early-diverged groups. Our data, based on DNA amplification analysis using degenerate primers, demonstrate that TCP proteins already existed in Charophycean algae, which comprise the sister lineage of land plants. From these data, the evolutionary pattern of this gene family in Viridiplantae was investigated. The function of TCP proteins in the context of plant development evolution is discussed.

Materials and Methods

Identification of TCP Proteins

Amino acid sequences of known TCP proteins (Supplementary Table 1) were aligned using ClustalX 1.83 (Thompson et al. 1997) with the default parameters. Two consensus sequences were obtained using BoxShade (web links to tools are available in Supplementary Table 2) and were refined manually. TBlastn analysis (Altschul et al. 1997) was performed at the National Center for Biotechnology Information (NCBI) against nonredundant and EST databases using consensus protein sequences corresponding to the TCP-P or TCP-C subfamilies. ESTs corresponding to non-angiosperm TCPs (Supplementary Table 3) were contiged with CAP3 (Huang and Madan 1999).

TBlastn was performed against 470 complete genomes of Eubacteria, 28 of Archeonta, and 96 nonplant Eukaryota or against specific databases of the very early-diverged eukaryote Phytophthora sojae (Heterokontae, Oomycota), the red algae Cyanidioschyzon merolae (Rhodophyta), and two green algae (Chlorophyta): Ostreococcus tauri (Prasinophyceae) and Chlamydomonas reinhardtii (Chlorophyceae). For land plants, Populus trichocarpa and the Oryza sativa ssp. japonica cv. nipponbare genomes were analyzed. No restriction was imposed on the e-values, and duplicates or non-TCP proteins were discarded manually. When possible, TCP genes were assigned their Tigr nomenclature (Supplementary Table 3). Analysis of the Physcomitrella patens (Bryophytes) haploid genome was performed at Physcobase using tBlastn against JGI raw sequences. Analysis of the Selaginella mollendorffii (Lycophytes) genome was performed with discontinuous megaBlast at NCBI using TCP nucleic acid sequences of Physcomitrella against the WGS database. The Physcomitrella and Selaginella nucleic acid sequences obtained were contiged using CAP3. Gene and protein sequences of Physcomitrella and Selaginella, and some from rice, were predicted using FgenesH (Salamov and Solovyev 2000), Eukaryotic GeneMark HMM (Lukashin and Borodovsky 1998), and Augustus (Stanke and Waack 2003). When prediction programs failed to predict genes, ORFs were searched for in the six translation phases using Traduc at Infobiogen.

Phylogenetic Analysis of Sequences

Protein sequences of Arabidopsis thaliana (Cubas 2002), Populus trichocarpa, Oryza sativa ssp. Japonica cv. Nipponbare, Selaginella mollendorffii, Physcomitrella patens, ESTs found in our study, and all TCP sequences listed in Supplementary Table 1 were aligned using ClustalX 1.83 (Thompson et al. 1997) with a gap open penalty (GOP) of 4.0 and a gap extension penalty (GEP) of 0.10, and using the Gonnet 250 matrix for pairwise alignment parameters. A GOP of 9.0, a GEP of 0.20, and the Gonnet series were used for multiple alignment parameters. Alignments were corrected manually with Seaview (Galtier et al. 1996). Amino acid sequence alignments were performed with the whole protein sequence. Phylogenetic analysis was carried out by the BioNJ method (Gascuel 1997), using the Phylo_Win 2.0 software (Galtier et al. 1996) with the observed divergence for the distance parameter, pairwise gap removal option, and 1000 bootstrap replicates. The resulting tree was edited using Mega3.1 (Kumar et al. 2004). Maximum parsimony (MP) analysis was carried out with Mega3.1 using all sites, with a close neighbor interchange search level of 3, 10 random additions, and 100 bootstrap replicates. Appropriate protein models for maximum likelihood (ML) analysis were selected using ModelGenerator v0.6 (Keane 2004). ModelGenerator estimates the best-fit substitution model from a total of 80 possible amino acid models using the PAL library (Drummond and Strimmer 2001). ML analysis was performed using PHYML 2.4.4 (Guindon and Gascuel 2003), using a de novo BioNJ tree with a JTT matrix (Jones et al. 1992) substitution model, with the proportion of invariable sites set to 0 and four substitution rate categories with an estimated γ distribution parameter. One hundred nonparametric bootstrap replicates were applied.

Search for Conserved Motifs

Analysis of conserved motifs within TCP groups was performed using MEME-MAST (Bailey and Elkan 1994), and results were checked manually. Simple sequences were found using SIMPLE v3.0 (Alba et al. 2002). Motifs were compared using tBlastn against a nonredundant database on the NCBI server and submitted to Motif Scan (Pagni et al. 2004) and CD-Search (Marchler-Bauer and Bryant 2004) to look for known domains or motifs.

Search for Targeting Sequences

Putative bipartite nuclear localization signals (NLS) were identified following Dingwall’s rule (Dingwall and Laskey 1991). Proteins were submited to TargetP (Emanuelsson et al. 2000) to predict potential chloroplastic or mitochondrial targeting.

Estimation of Synonymous (Ks) and Nonsynonymous (Ka) Substitutions

Paralogous sequences (duplicated genes) were aligned pairwise using RevTrans 1.3 (Wernersson and Pedersen 2003) with protein alignments as guides (using ClustalW 1.83 as the protein alignment method). Pairwise synonymous (non-amino acid-changing: Ks) and nonsynonymous (Ka) substitutions per site were estimated pairwise by the Nei and Gojobori method (1986) with PAML 3.15 (Yang 1997). Synonymous substitutions do not result in amino acid replacements and are, in general, not under selection. Consequently, the rate of fixation of these substitutions is expected to be relatively constant in different protein coding genes and, therefore, to reflect the overall mutation rate. As a result, the fraction of synonymous substitutions per synonymous site (Ks) is used to estimate the time of duplication between two sequences. The time since duplication was calculated as T = Ks/(2λ), with λ, the rate of synonymous substitutions, estimated as 6 ×10−9 per site and per year (Muse 2000).

PCR Techniques

The COnsensus-DEgenerate Hybrid Oligonucleotide Primers (CODEHOP) strategy (Morant et al. 2002; Rose et al. 1998) was used, based on the output of the BlockMaker server. For each species analyzed, the most phylogenetically related sequences were used as input. Two forward (F1 and F2) and two reverse (R1 and R2) primers were selected within the more conserved region of each TCP subfamily and used after optimization of primer codon usage. Primer sequences are given in Supplementary Table 4. Genomic DNA was prepared from 100 mg of fresh tissues of Pinus pinaster (Gymnospermaphyta), Equisetum arvense (Pteridophyta), Selaginella martensii (Lycophyta), and Chara hispida and C. vulgaris (Charophyta), following a protocol from Dempster et al. (1999). Analysis was also performed on Cosmarium sp. (Zygnemophyta), Klebsormidium flaccidum (Klebsormidiophyta), Chlorokybus atmophyticus (Chlorokybophyta), Mesostigma viride (Mesostigmatophyceae), and Scenedesmus subspicatus (Chlorophyceae). PCR was then performed on 100 ng of DNA template using 1 U of Taq polymerase (InVitrogen), 2.5 mM MgCl2, 0.2 mM dNTP, and a 0.5 μM concentration of primer with the polymerase manufacturer’s buffer in 25-μl final volume. The PCR program was designed according to the CODEHOP server’s instructions as follows: 3 min of initial denaturation at 94°C, followed by a manual hot start, then a touchdown: 15 cycles of 30 sec at 94°C, 30 sec at 70°C (–1°C/cycle), and 1 min at 72°C, then a classical PCR with 25 cycles of 30 sec at 94°C, 30 sec at 55°C, and 30 sec at 72°C, and a final 2-min extension. When amplification failed, the touchdown starting temperature was decreased (65°C).

Analysis of PCR Products

PCR products were analyzed on 2% agarose gels, and fragments of the expected size were cloned into the pGEM-T vector (Promega). PCR products were sequenced and the sequences were analyzed by tBlastx against the nonredundant database on the NCBI server.

Results

TCP genes are considered to be specific to the plant kingdom (Riechmann et al. 2000). Within plants, TCP proteins have been identified and studied so far only in angiosperms. Two consensus sequences specific for TCP-C and TCP-P subfamilies, differing both in length and in sequence, were obtained by alignment of known rice, maize, Antirrhinum majus, and A. thaliana TCP proteins (Fig. 1). Databases were then searched for the existence of TCP genes in order to estimate the date of emergence of TCP genes and also to understand how these protein families evolved.

Fig. 1.
figure 1

TCP domain consensus obtained for each TCP-C (A) and TCP-P (B) protein subfamily. Am, Antirrhinum majus; At, Arabidopsis thaliana; Gh, Gossypium herbaceum; Lv; Linaria vulgaris; Os, Oryza sativa ssp. Japonica cv. Nipponbare; PCF, rice proliferating cell factor; Zm, Zea mays.

Identification of New TCP Genes

EST database information

EST database mining identified TCP sequences in a large range of nonangiosperm plant species. TCP-P and TCP-C consensus amino acid sequences were found to match coding sequences perfectly in groups outside the angiosperms (Supplementary Table 3). Twenty-nine ESTs were found in Gymnospermaphyta, corresponding after contig sequence generation to three TCP-P and two TCP-C from Pinus taeda genes, one TCP-P from Pinus pinaster, one TCP-P and one TCP-C from Picea sitchensis, one TCP-C from Picea glauca, one TCP-C from Gnetum gnemon, two TCP-P from Welwitschia mirabilis, and one TCP-P from Cycas rumphii. One EST in Pteridophyta, from Ceratopteris richardii, encoded a TCP-P gene. This did not, however, allow us to determine with confidence whether TCP genes are present or absent in groups of Viridiplantae for which EST databases were not available. In addition, since TCPs are not highly expressed in A. thaliana, their representation in the EST databases is probably low. We therefore used a PCR approach to obtain additional data.

PCR amplification of TCP sequences in basal plant genomes

We used the CODEHOP PCR technique, developed to reveal gene families. DNA samples were chosen from several species, targeting major branches of the Viridiplantae (Fig. 2). TCP sequences were amplified from genomes of Pinus pinaster (Coniferales) (positive control), Equisetum arvense (Equisetophyta), Selaginella martensii (Lycophyta), and Physcomitrella patens (Bryophyta). TCP sequences were also found in the early-diverged streptophytes Chara hispida, Chara vulgaris (Charophyta), and Cosmarium sp. (Zygnemophyta) but were not detected in Klebsormidium flaccidum (Klebsormidiophyta), Chlorokybus atmophyticus (Chlorokybophyta), or Mesostigma viride (Mesostigmatophyceae), which is a likely representive of the most early-diverged streptophyte lineage (Lewis and McCourt 2004). No TCP genes were found in Scenedesmus subspicatus (Chlorophyta). The 61 sequences which were obtained aligned well throughout the different groups of Streptophyta, with several variants of each the two subfamilies in all species studied (except for Cosmarium TCP-P, where only three sequences were obtained) (Fig. 3).

Fig. 2.
figure 2

Synopsis tree of plant relationships, highlighting the presence of TCP (grey background). The synopsis tree was modified from Karol et al. (2001) and Pennisi (2003). Uncertain placements are represented by a dotted line. The tree includes species for which whole-genome data (red), ESTs (black), or CODEHOP results (green) are available. The origin of the TCP genes could be deduced from this study and is indicated by a black arrow. Approximated ages of divergence, in millions of years (Hedges et al. 2004; Sanderson et al. 2004; Yoon et al. 2004), are indicated on the nodes.

Fig. 3.
figure 3

Nucleotide sequences of CODEHOP amplified fragments. All the sequence variants are shown for each plant species and are compared to the consensus sequence of the corresponding domain in angiosperms (above each set). The numbers of sequences obtained for each variant are indicated in parentheses.

Whole-Genome Database Searches

To identify complete TCP gene families, Blast homology searches were performed against complete genome sequences of several species of Streptophyta. Previously, 24 TCPs were found in Arabidopsis (Cubas 2002). In addition to three angiosperm genomes (A. thaliana, P. trichocarpa, O. sativa), two genomes of basal embryophytes have recently been sequenced (club moss, S. moellendorffii, and moss, P. patens). This identified numerous new putative TCP genes and indicated that many poplar and rice TCP sequences had been partially misannotated by the automated annotation processes. Therefore, all the sequences were checked manually. The gene structure predictions were improved by using additional information such as partial or complete cDNA sequences and by analyzing the ORFs deduced from the genomic sequences (Supplementary Table 3). Prediction of TCP proteins was especially difficult in rice since few EST sequences corresponding to TCP domain proteins were available in the databases (Kikuchi et al. 2003; Osato et al. 2003; Yazaki et al. 2004) (Supplementary Table 3). Incorrect start codon predictions, splicing errors, and missing or additional exons were detected. Most of the predicted introns were uncertain and the TCP genes probably contained no or few introns, as in A. thaliana (Supplementary Table 3). The presence of allelics variants in Selaginella cannot be excluded because no physical map was available. Two Selaginella sequences with only 2% divergence were considered to be paralogues following previous work on Antirrhineae (Hileman and Baum 2003). In poplar, nine genes were not anchored to the physical map, among which seven are probably close paralogues, with protein divergence always higher than 10%, which excludes the possibility that they are allelic forms of the same gene. Thus, four new complete families of TCP genes were identified: 34 genes in the poplar, 29 in rice, 10 in club moss, and 6 in moss genomes. There was no correlation between the size of TCP families and the size of genomes (Supplementary Table 3). In A. thaliana, poplar, and rice, the TCP genes were dispersed throughout the genomes, without any clustering.

These analyses were consistent with the CODEHOP results since data mining identified several TCP genes in Selaginella and Physcomitrella, but not in two Chlorophyta green algae, Ostreococcus tauri (Prasinophyceae) and Chlamydomonas reinhardtii (Chlorophyceae), corroborating the absence of TCP genes in our PCR analysis of Scenedesmus subspicatus.

Evolution of the TCP Gene Family

Phylogenetic analysis in land plants

To evaluate evolutionary relationships within the TCP gene family, we performed phylogenetic analyses including all the sequences identified as well as other angiosperm sequences for which functional data were available. The construction of a reliable phylogenetic tree of TCP proteins is problematic due to the small size (62 amino acids maximum) of the conserved TCP domain sequence. Trees constructed based on such a low number of residues are often poorly supported by statistical analysis (Brocchieri 2001). We therefore aligned the maximum number of amino acids for each protein (Supplementary Table 1) (Remington et al. 2004; Tian et al. 2004); sequences obtained from PCR amplification in our analysis (shorter than the TCP domain length) were thus not included. A ML tree is presented in Fig. 4. The neighbor-joining (NJ) method and maximum parsimony (MP) trees were similar and are presented in Supplementary Fig. 1. All TCP members could be classified into one of the two subfamilies, TCP-C or TCP-P, and these subfamilies were then divided into several subclades. All trees (BioNJ, ML, and MP) showed better support for the TCP-C subclade topology than for TCP-P. The latter group consisted of small classes whose relationships were difficult to infer with confidence since internal branches were poorly supported. However, some groups have bootstrap values >80, suggesting some sublineages. Angiosperm- and eudicot-specific groups are highlighted in Fig. 4. In contrast, the TCP-C subfamily is divided into two major clades supported by high bootstrap values (88 and 68). The largest clade (I) contains members belonging to all phylogenetic groups from moss to angiosperms. This clade included one monocot- and one eudicot-specific terminal subgroup, highlighted in Fig. 4. Within the second clade (II), only sequences from angiosperm species were found. Numerous pairs of poplar and rice TCP and the majority of moss and club moss TCP (apart from SmoTCP2 and SmoTCP6) could be recently duplicated paralogues. Conversely, such paralogues were not clearly detected in A. thaliana, except for AtTCP3/AtTCP4 and AtTCP17/AtTCP5. Two or more TCP sequences from poplar were frequently associated with a single A. thaliana sequence (e.g., AtTCP19/PtrTCP11/PtrTCP13), suggesting a recent duplication specific to poplar. No strict orthologues between species were revealed by these trees, apart from OsTB1/ZmTB1 and AmCYC/LvCYC in the TCP-C subgroup. The absence of identifiable orthologues might be explained either by the great distance between the rice, poplar, and Arabidopsis lineage-specific duplications or by numerous gene losses following amplification.

Fig. 4.
figure 4

Unrooted ML tree of 126 TCP proteins from embryophytes. Bootstrap values >50% are indicated as percentages. Names in boldface correspond to proteins described in the literature, whereas ESTs are in italics. The gene predicted to be a target for micro-RNA regulation (Palatnik et al. 2003) is underlined. The presence of an R domain (Cubas et al. 1999) is indicated by an asterisk. Shaded groups designate clades discussed in the text, light gray is used for angiosperm-specific groups, medium gray for eudicot-specific groups, and black for monocot groups. A diamond (♦) is placed at the root of the group which contains the NLS motif in class I of the TCP-C family. Am, Antirrhinum majus; At, Arabidopsis thaliana; Cri, Ceratopteris richardii; Cru, Cycas rumphii; Gg, Gnetum gnemon, Gh, Gossypium herbaceum; La, Lupinus albus; Lv; Linaria vulgaris; Os, Oryza sativa ssp. Japonica cv. Nipponbare; Pg, Picea glauca; Ppi, Pinus pinaster; Pp, Physcomitrella patens; Ps, Picea sitchensis; Pt, Pinus taeda; Ptr, Populus trichocarpa; Smo, Selaginella mollendorffii; St, Solanum tuberosum; Wm, Welwitschia mirabilis; Zm, Zea mays.

In conclusion, the precise relationships of the TCP homologues between species and sometimes within species were difficult to determine, probably because of a complex history of duplication and loss. This conclusion supports and extends the observations made by Citerne and Reeves for the TCP-C subfamily in angiosperms (Citerne et al. 2003; Reeves and Olmstead 2003).

Evidence for duplication events

To identify duplication events within the TCP families in the five complete genomes, we built individual trees for each species (Supplementary Fig. 1). Ages of paralogous gene duplications were inferred using the number of synonymous substitutions per synonymous site (Ks), which is assumed to be correlated with the time of divergence. Figure 5 presents duplicated loci with their Ks values. Four A. thaliana TCP loci were correlated with a recent whole-genome duplication event (see the α event in Fig. 5A and associated references), and four others to an earlier duplication (named “old” or β by different authors in Fig. 5A). No paralogues arising from the α event were detected (Fig. 5A). In rice, paralogues corresponded to seven blocks duplicated in a recent whole-genome duplication event, which predates the divergence of cereals 50–70 Mya (PPP1; Fig. 5A). The estimated time of the duplication was, however, closer to that of Goff et al. (2002) and Yu et al. (2005), about 50 Mya. To date, no rice TCP has been shown to belong to the older duplication event (PPP2). In poplar, the low Ks values suggest a recent duplication with many extant duplicated genes (Fig. 5B). The high conservation between Physcomitrella paralogue sequences and their low Ks values suggest a recent duplication event, which can be estimated at about 40 Mya (Fig. 5B). This could correspond to a genome duplication previously suspected by others (Cove et al. 1997; Markmann-Mulisch et al. 2002). A duplication in the Selaginella lineage in the last 5 Mya is also suggested (Fig. 5B). All the duplication events by which the TCP family probably expanded are shown in Fig. 5C.

Fig. 5.
figure 5

Duplication history based on Ks values. A Duplicated loci and their Ks values in A. thaliana and rice. TCP duplications are inferred from our study and from previous genomic analyses. In the latter cases, corresponding references are indicated: 1 (AGI 2000; Blanc et al. 2000; Paterson et al. 2000); 2 (Simillion et al. 2002); 3 (Blanc et al. 2003); 4 (Bowers et al. 2003); 5 (Guyot and Keller 2004); 6 (Paterson et al. 2004); 7 (Blanc and Wolfe 2004b). An asterisk in the A. thaliana table represents a pair of genes for which an ancestral locus was not inferred. B Duplicated loci and their Ks values in poplar, Selaginella, and moss. C The timing of the major genome duplication events were shown on a phylogenetic tree using previous references: for Arabidopsis thaliana (α, ∼50 Mya; β, ∼200 Mya) (AGI 2000; Blanc et al. 2000, 2003; Blanc and Wolfe 2004a; Bowers et al. 2003; Paterson et al. 2000; Raes et al. 2003; Simillion et al. 2002; Vandepoele et al. 2002; Ziolkowski et al. 2003); for rice (50–70 Mya) (Blanc and Wolfe 2004b; Guo et al. 2005; Guyot and Keller 2004; Paterson et al. 2004; Tian et al. 2005; Vandepoele et al. 2003; Wang et al. 2005; Zhang et al. 2005); and for Populus (∼10–20 Mya) (Sterck et al. 2005). The timing of Physcomitrella and Selaginella genome duplication events was obtained from these data, about 40 and 5 Mya, respectively. The scale below C represents approximate ages (Mya).

Searches for motifs in TCP proteins

Comparative analysis of TCP protein amino acid sequences using MEME software (Bailey and Elkan 1994) detected common motifs (Supplementary Fig. 2 and Supplementary Table 5) and confirmed the presence of the previously described CC domain (Cubas et al. 1999; Kosugi and Ohashi 1997), R domain (Cubas et al. 1999), and SP domain (Lukens and Doebley 2001). In addition, the target site of the miR-Jaw miRNA (Palatnik et al. 2003) could be identified in nucleic acid sequences encoding the motif 8 presented in Supplementary Fig. 2 and Supplementary Table 5. Outside these domains, we identified 23 distinct motifs with no significant similarity to known motifs or domains. Interestingly, only one was found in both subfamilies (motif 22 in Supplementary Fig. 2 and Supplementary Table 5). Most TCPs are thought to be targeted to the nucleus. The putative bipartite NLS was included in the MEME motif TCP. However, no NLS was predicted for most members of the TCP-C subgroup I, except in the group indicated in Fig. 4. Some proteins without NLS were predicted by TargetP to be targeted to the chloroplast (AtTCP17, OsTCP18, PgTCP1, PtrTCP23, and PtrTCP34). It is noteworthy that AtTCP13 (also named PTF1, TFPD, or TCP10) was shown to be targeted either to the chloroplast (Baba et al. 2001) or to the nucleus (Suzuki et al. 2001). It is therefore possible that some TCPs are targeted to both the nuclear and the organellar genomes.

The SIMPLE software (Alba et al. 2002) identified numerous sequences enriched for single amino acids (average of 3.44 simple sequences per TCP protein versus an average of 1.88 simple sequences per protein for the whole A. thaliana proteome [Sim and Creamer 2002]) (Supplementary Fig. 2). The most common simple sequences in our dataset are glycine (comprising 22.5% of the simple sequences of our dataset), serine (20.3%), alanine (16.9%), and glutamine (13.4%). Two of them (Ser and Ala) are overrepresented in transcription-related proteins in Arabidopsis (Sim and Creamer 2002).

Discussion

Our work is the first demonstration of the presence of TCP genes outside angiosperms and shows that TCP transcription factors are ancient proteins. TCP genes probably appeared in the Streptophyta lineage before the divergence of the Zygnemophyta, probably between 650 and 800 Mya (Yoon et al. 2004), since several TCP coding sequences were detected in Cosmarium but not in Klebsormidium, Chlorokybus, or Mesostigma. The division of the TCP family into two subfamilies, C and P, results from an ancient duplication event before the divergence of the Zygnemophyta. The emergence of the Phragmoplastophyta correlates with the appearance of the TCP family. This taxonomic group has been defined (Fig. 2) based on the evolution of a novel mechanism of cell wall formation during cytokinesis, nearly identical to the cytokinetic phragmoplast (Lecointre et al. 2001). A unique gene duplication event before the emergence of Cosmarium probably gave rise to the initiation of two TCP subfamilies. This event, coupled with subsequent sequence evolution of the two duplicates, could be the start point of the functional divergence in each of the two subfamilies. Indeed, several molecular studies propose that TCP-P and TCP-C are activators and repressors of growth, respectively (Li et al. 2005). Moreover, proteins from one subfamily interact preferentially with members of their own subfamily than with the other subgroup (Kosugi and Ohashi 2002). However, the common ancestor of TCP genes was not identified in this study, although extensive searches within prokaryotic and eukaryotic genomes were performed. Very few amino acids are shared by both subfamilies, which renders it difficult to identify an ancestral sequence from which TCP proteins arose. Although this ancestor gene may have been inherited by vertical transmission to Cosmarium, the possibility that the TCP domain appeared in the plant by horizontal transfer cannot be ruled out. This is, for example, the case of the AP2 DNA binding domain, which was considered plant specific until the demonstration of its lateral transfer from prokaryotic endonuclease sequences (Magnani et al. 2004). The common ancestral TCP gene might also have been present in lineages that have since disappeared.

It is clear from the comparison of early- and more recently diverged plant TCP genes that the two TCP subfamilies experienced a continuous expansion during development and the diversification of streptophytes. Like most plant transcription factor genes, the TCP family took advantage of whole-genome duplication events to expand and they were preferentially retained compared to other plant genes (Shiu 2005). During evolution, the general organization of the TCP family remained well conserved, with significantly more members in the TCP-P subfamily than in the TCP-C (between 1.2- and 2-fold more). This finding suggests a functional connection between the two subfamilies that would necessitate an appropriate gene number in each subfamily. It is noteworthy that the TCP-P subfamily was also the most constrained (lower Ka/Ks; data not shown), which is in favor of better TCP-P gene retention. A different evolution process is described for the MADS-box gene family, in which the largest type I MADS-box gene subfamily was also under a weaker selection (Nam et al. 2004).

Lineage-specific expansions of the TCP family were clearly observed in species-specific phylogenetic trees which were well resolved (Supplementary Fig. 1), particularly in rice and poplar. It will be interesting to explore the functionality of lineage-specific genes. These might permit the adaptation of responses to particular environments by generating specialized functions or morphological traits (Shiu et al. 2005; Xiong et al. 2005).

Globally, the tree structure was supported by the presence of common protein motifs outside the conserved TCP domain, even though, in these regions, TCP proteins exhibit high divergence. The MEME analysis detected several conserved motifs and also numerous insertions and deletions in these coding sequences. Low-complexity regions enriched for single amino acids were also detected outside the TCP domain, some being conserved (Supplementary Fig. 2), indicating that they may have a functional importance. Simple repeats are known to be related to molecular functions like transcriptional activation and repression or protein-protein interactions (Romero et al. 2001). TCP genes might have gained new functions after duplication, as previously proposed to explain the differential evolution of specific gene regions outside the TCP and R domains of Cycloidea-like proteins in Antirrhineae (Gubitz et al. 2003; Hileman and Baum 2003).

Functions described for the TCP-C proteins in higher plants are related to the control of sophisticated morphological traits such as flower and leaf shape or shoot outgrowth (Cubas 2004; Doebley et al. 1995; Nath et al. 2003). Our data and others from the literature suggest that TCP factors control coordination of growth and cell cycle in plants (Gaudin et al. 2000; Li et al. 2005; Tremousaygue et al. 2003). Welchen and Gonzalez (2006) suggested that these factors constitute a link between biogenesis of the plant mitochondrial respiratory chain and cell proliferation. Moreover, TCP binding sites have been detected in promoters of numerous A. thaliana genes involved in ubiquitous processes such as transcription, splicing, translation, proteolysis, and cell organization control (Li et al. 2005; Tatematsu et al. 2005; Tremousaygue et al. 2003; Welchen and Gonzalez 2006). It has been proposed that TCP-P genes positively regulate gene expression, whereas the TCP-C group may exert a negative regulation of proliferation (Kosugi and Ohashi, 2002; Li et al., 2005). In plants exhibiting simple morphologies with no organ or meristem, the function of TCP proteins is unknown.