Introduction

Availability of the Physcomitrella patens genome sequence (Rensing et al. 2008) has afforded an opportunity to examine the full complement of MADS-box genes in a member of the bryophytes, which diverged from the rest of the green plant lineage (after separation of the charophytes) at least 420 MYA (reviewed in Sanderson et al. 2004; Zimmer et al. 2007) and thus hold a phylogenetically informative position.

MADS-box genes comprise a large family of genes, characterised by a well-conserved MADS-box of approximately 180 bp, are found in plants, animals and fungi and encode transcription factors (for a recent review of plant MADS-box genes, see Gramzow and Theiβen 2010). In plants, MADS-box genes are best known for specifying floral organ identities and it has been argued that rapid expansion and diversification of this gene family were critical factors for evolution of angiosperms and the organs that define them (Theiβen et al. 2000).

Expansion of the seed plant MADS-box family has involved tandem duplication (copying of a gene in proximity to the original gene through unequal crossing over) and segmental duplication (copying and translocation of a lengthy DNA section or duplication of an entire genome [polyploidization]) (Pařenicová et al. 2003; Veron et al. 2007; Lee and Irish 2011). In addition, transposable elements are thought to have distributed copies of an AG-like MADS-box throughout the maize genome (Fischer et al. 1995; Montag et al. 1996).

Type I and II genes were originally classified on the basis of their proposed monophyletic relationships to animal SRF- (SERUM RESPONSE FACTOR-) like and MEF2- (MYOCYTE ENHANCER FACTOR 2-) like MADS-box genes, respectively (Alvarez-Buylla et al. 2000). However, more recent genome-wide analyses of MADS-box genes in Arabidopsis have not supported these relationships (De Bodt et al. 2003b; de Folter et al. 2004; Kofuji et al. 2003; Pařenicová et al. 2003). Nevertheless, an artificial, polyphyletic type I grouping (De Bodt et al. 2003b; Kofuji et al. 2003) and a monophyletic type II grouping (Kofuji et al. 2003) were distinguished by the absence or presence, respectively, of a conserved keratin-like (K-) box and by differences in exon–intron architecture. Type I genes in angiosperms usually contain one or two exons (De Bodt et al. 2003a; Pařenicová et al. 2003) and have been classified as Mα, Mβ or Mγ based on the phylogenetic analysis (Pařenicová et al. 2003).

A few type I MADS-box genes have been characterised functionally [Mα genes: AGL62 (Kang et al. 2008), DIANA (AGL61) (Bemer et al. 2008; Steffen et al. 2008), AGL23 (Colombo et al. 2008); Mγ genes: PHERES1 (PHE1) (Köhler et al. 2003), AGL80 (FEM111) (Portereiko et al. 2006)]. These genes play roles in the ontogeny of female gametophytes, embryos and seeds. Furthermore, all of the 38 Arabidopsis type I genes for which expression has been detected, using transgenic plants harbouring GUS-GFP reporter constructs, are active in female gametophytes and seeds (Bemer et al. 2010).

Type II genes encode proteins with the canonical MIKC structure consisting of the DNA-binding MADS domain, a weakly conserved intervening (I-) domain, the K-domain, which is predicted to form a coiled-coil structure, and a variable C-terminal domain (Ma et al. 1991; Theißen et al. 2000). An N-terminal domain may precede the MADS domain.

Type II genes are subdivided into MIKC C genes and MIKC* genes on the basis of the expanded I region and less well-conserved K-box in MIKC* genes (Svensson et al. 2000; Henschel et al. 2002). The MIKC C subtype includes many of the genes that control floral organogenesis. Forward genetics studies of angiosperms displaying homeotic floral phenotypes led to the well-known ABC model of floral morphogenesis (Coen and Meyerowitz 1991). Further investigation resulted in extension of this model to the ABCD model (Colombo et al. 1995), followed by the ABCDE and related protein-based, floral quartet models (Theiβen 2001; Theißen and Saedler 2001). MIKC C genes are also involved in floral meristem development, floral transition, senescence and abscission of flowers, embryonic development (Fernandez et al. 2000), leaf and root morphogenesis (Tapia-López et al. 2008), nodulation (Heard and Dunn 1995; Heard et al. 1997; Zucchero et al. 2001), and fruit development and dehiscence in flowering plants (for a summary of MIKC C functions, see Rijpkema et al. 2007) and development of reproductive structures in non-flowering spermatophytes (reviewed in Theiβen et al. 2000). The functions of MIKC C genes in cryptogams remain elusive. Relatively ubiquitous expression patterns have been observed in ferns (Hasebe et al. 1998; Münster et al. 1997, 2002), clubmoss (Svensson and Engström 2002) and moss (Singer et al. 2007; Quodt et al. 2007), suggesting that MIKC C gene functions are less organ-specific in non-seed plants than in seed plants. MIKC C gene knockouts show that functional redundancy characterises some members of this gene group in Physcomitrella, while gene knock-downs display a multifaceted mutant phenotype affecting both the gametophyte and sporophyte and implicating at least some MIKC C genes in reproductive functions (Singer et al. 2007).

Five of the MIKC* genes in A. thaliana are expressed in pollen (Kofuji et al. 2003; Pina et al. 2005; Verelst et al. 2007a, b; Adamczyk and Fernandez 2009). AGL66 and AGL104 function redundantly in pollen germination and their protein products form heterodimers with those of the remaining three genes. The sixth MIKC* gene is expressed in siliques (de Folter et al. 2004). Little is known about functions of MIKC* genes in other tissues. The expression of all six of the Arabidopsis MIKC* genes in embryos, of five MIKC* genes (all but AGL67) in inflorescences, four MIKC* genes (excepting AGL66 and AGL104) in seedlings (Lehti-Shiu et al. 2005) and AGL67 in siliques (de Folter et al. 2004) suggests that MIKC* functions are diverse in Arabidopsis and not restricted to male gametophytic tissue. In ferns, MIKC* genes are expressed in both the gametophyte and the sporophyte generations (Kwantes et al. 2011). In the lycophyte, Selaginella moellendorffii, two of the MIKC* genes are expressed exclusively in microsporangia while the third is also expressed in vegetative tissues. However, in another lycophyte, Lycopodium annotinum, the MIKC* gene, LAMB1, is expressed exclusively in strobili, the reproductive structures of the sporophyte generation (Svensson et al. 2000). In the moss, Funaria hygrometrica, MIKC* genes are expressed primarily in gametophytes, particularly protonemata (Zobell et al. 2010).

Here, we describe the sequences and architectures of the 26 genes that comprise the complete complement of MADS-box genes in Physcomitrella. We draw attention to the unusually high degree of conservation within and between the MIKC C and MIKC* subtypes and provide evidence that gene conversion has not played a significant role in maintaining sequence similarity. Using the tools of phylogenetic analysis, we have attempted to discern the evolutionary relationships among these genes as well as their relationships to MADS-box genes in other plant taxa. From our investigation of the scaffold locations of closely related MADS-box genes and neighbouring genes, we provide evidence of the gene duplications responsible for expansion of the MADS-box family in the bryophyte lineage leading to Physcomitrella patens.

Materials and methods

Identification and annotation of genes

MADS-box genes in the Physcomitrella patens genome were identified using the keyword “MADS” and the Advanced Search tool in the JGI (US Department of Energy’s Joint Genome Institute) database. In addition, tblastn (Altschul et al. 1990) searches of JGI’s database were performed using the default settings and, as queries, the amino acid sequences of each known MADS-box gene in Physcomitrella and of each novel gene as its sequence was discovered. Similar searches were performed to identify MADS-box genes in JGI’s genome databases for Ostreococcus lucimarinus and O. tauri and the existence of one MADS-box gene in each species has been verified by Palenik et al. (2007). EST evidence for each gene was sought in Unigene and the Cosmoss Physcomitrella genome database.

Coding sequences of MADS-box genes were derived by virtual translation of the 4-kb nucleotide sequence downstream from the 5′ end of the MADS-box using the ExPASy translation tool (Gasteiger et al. 2003) and meticulous comparison of these DNA and amino acid sequences with JGI’s predicted gene models, ESTs (expressed sequence tags) representing Physcomitrella MADS-box genes and also genomic sequences of already identified MADS-box genes of Physcomitrella and other plants. Following release of the Cosmoss v1.6 gene models, conserved N-terminal sequences were added to our MIKCC sequences. In addition, motif searches were performed using the Physcomitrella sequences and various sets of representative MADS domain protein sequences from green algae and vascular plants as input for MEME, version 3.5.4 (Bailey and Elkan 1994). Exon–intron boundaries were determined by identifying splice sites in the genomic DNA sequences that conformed to the Physcomitrella consensus splice sites (Rensing et al. 2005) and that resulted in coding sequences that matched EST evidence and conserved motifs.

All 26 genes were annotated manually in the JGI database.

Sequence alignment and phylogenetic analysis

Sequences were aligned in Clustal W (Thompson et al. 1994) and adjusted manually by eye in MacClade (Maddison and Maddison 2001) where necessary. For phylogenetic tree construction, WMP and ML trees were constructed using PAUP* (Swofford 1998) and Bayesian analyses were performed using MrBayes (Huelsenbeck and Ronquist 2001; Ronquist and Huelsenbeck 2003). Model testing for Bayesian and ML analyses of DNA sequences was performed using Modeltest3.7 (Posada and Crandall 1998) or jModelTest (Guindon and Gascuel 2003; Posada 2008) and the best models were selected according to the Akaike Information Criterion. Burn-in for all Bayesian trees was 25 % of the samples. For all phylogenetic trees, gaps were treated as missing data. Bootstrap support and posterior probabilities are reported as follows: high ≥85 %, moderate 70–84 % or low 50–69 %. Branches with <50 % support were collapsed into polytomies.

Unrooted Bayesian and WMP trees were constructed using the 60 amino acid MADS domain sequences of 143 genes from representative, phylogenetically informative plant taxa including the full complement of MADS-box genes from Physcomitrella (except for the pseudogenes, PPMA5 and PPTIM6), 99 genes from Arabidopsis, eight genes from the ferns, Ceratopteris richardii and Ceratopteris pteroides, five genes from the clubmoss, Lycopodium annotinum, one gene each from the spikemosses, Selaginella remotifolia and S. moellendorffii, the charophycean green algae, Chara globularis, Coleochaete scutata, and Closterium peracerosum-strigosum-littorale, and the chlorophyte green algae, Ostreococcus lucimarinus and O. tauri. For the Bayesian tree, the mixed model and the default settings except for nchains = 8 were used and three million generations were performed. For the WMP analysis, MaxTrees was set at 600 and support for the inferred tree was measured using 500 bootstrap replicates.

Sequences used for the Physcomitrella and Arabidopsis genes are available from JGI’s database (See Electronic Supplementary Material S1 for Protein ID numbers) and The Arabidopsis Information Resource (TAIR) (A hyperlinked list of Arabidopsis MADS-box genes is available at http://www.arabidopsis.org/browse/genefamily/MADSlike.jsp), respectively. GenBank accession numbers for ferns, lycophytes and charophytes are as follows: CRM1 (Y08014), CRM3 (Y08239), CMADS1 (U91415), CMADS2 (U91416), CMADS4 (U95609), CerMADS1 (D89670), CerMADS2 (D89671), CerMADS3 (D89672), LAMB1 (AF232927), LAMB2 (AF425598), LAMB3 (AF425599), LAMB4 (AF425600), LAMB6 (AF425602), SrMADS1 (AB086021), CgMADS1 (AB035567), CpMADS1 (AB091476), CsMADS1 (AB035568). A Mα gene (BXZC46793; ti/1415749262), from S. moellendorffii, was found by a blastn search of the whole genome sequence collection in the GenBank trace archive, using the MADS-box sequence of PPTIM2 as query, and has since been named MADS15 (Banks et al. 2011; Gramzow et al. 2012). Sequences for OlTIM1 from O. lucimarinus (Protein ID 120540) and OtTIM1 from O. tauri (Protein ID 38053) are from the respective JGI databases (http://genome.jgi.doe.gov/Ost9901_3/Ost9901_3.home.html and http://genome.jgi.doe.gov/Ostta4/Ostta4.home.html).

Because the relationships among the type II genes of Physcomitrella were not fully resolved in this comprehensive tree, separate rooted trees of MIKC C and MIKC* genes of Physcomitrella were constructed by WMP, Bayesian and ML methods. DNA sequences of MIKC C and MIKC* genes used for the respective trees included the complete coding sequences except for small portions of sequence that could not be unambiguously aligned. We used Physcomitrella type II genes in preference to genes from other taxa to root the trees since, in addition to the MADS box, a large portion of the I region and the extended K-box (Krogan and Ashton 2000) could be aligned unambiguously. This maximised the resolution and robustness of the trees. We chose two genes, one from each major clade of MIKC* genes (PpMADS2 and PPM6), and two genes, one from each major clade of MIKC C genes (PPM1 and PPMC6), to root the MIKC C and MIKC* trees, respectively.

For completeness and clarity of presentation, trees of Physcomitrella type I genes were similarly constructed. Type I trees were rooted with the sole MADS-box gene present in each of O. lucimarinus and O. tauri. All of the sequences used for the type I trees consisted of the most conserved middle portion of the MADS box, comprising 150 nucleotides.

Alignments are available upon request. In the Bayesian analyses 500,000 generations, 800,000 generations and 100,000 generations were performed for the MIKC C, MIKC* and type I trees, respectively. The robustness of each Physcomitrella gene tree was measured using 1,000 replicates for both WMP and ML. In the Discussion, we have substituted “closely related” for “close phylogenetic relatedness is inferred” to avoid repetitive unwieldy phrasing.

Detection of putative gene conversion events

Putative gene conversion tracts were sought with RDP3 software (Heath et al. 2006), which uses several methods to detect recombination: RDP (Martin and Rybicki 2000), BOOTSCAN (Martin et al. 2005), GENECONV (Sawyer 1989; Padidam et al. 1999), Maximum Chi-square (Smith 1992; Posada and Crandall 2001), CHIMAERA (Posada and Crandall 2001), sister scanning (Gibbs et al. 2000), and 3SEQ (Boni et al. 2007). Default settings were used.

Analysis of presumptive duplications

Physical separations and relative orientations of pairs of MADS-box genes were investigated to evaluate the significance of tandem and segmental duplications in P. patens. All genes located between linked MADS-box genes or within 50 kb segments flanking each MADS-box gene were identified, using JGI’s Genome Browser page and linked Protein pages.

To investigate whether MADS-box genes in Physcomitrella may have been duplicated by transposition, the JGI Genome Browser was used to search for transposons near MADS-box genes. In addition, 8 kb of DNA flanking each MADS-box gene was searched manually for evidence of polyadenylate sequences that might indicate the involvement of non-viral retrotransposons or some form of reverse transcription.

The occurrence of a paleopolyploidisation event in Physcomitrella between 30 and 60 MYA was postulated by Rensing et al. (2007) on the basis of a clear peak in rates of synonymous substitution (K s) in ESTs representing gene paralogues. The peak in K s values was confirmed in a similar study of genomic sequences of gene paralogues (Rensing et al. 2008). To identify MADS-box gene paralogues that may have been generated during the proposed polyploidisation period, Ks values were calculated for all pair-wise alignments of MIKC C genes and MIKC* genes using the method described by Rensing et al. (2007). Cutoff values of 0.5 < K s < 1.1 were chosen to encompass the ranges of K s values that defined the peaks for the ESTs (0.6 < K s < 1.1) and the genomic sequences (0.5 < K s < 0.9). Because some pairs of type I genes could not be aligned unambiguously, K s values were calculated only for pairwise alignments of genes that clustered closely together at the extremities of the phylogenetic tree.

Results

A search of JGI’s database for Physcomitrella revealed twenty-six MADS-box genes of which one type I gene, five MIKC C genes and six MIKC* genes were known before sequencing of the genome (Electronic Supplementary Material S1).

MADS-box hits not accompanied by K-box hits were classified as type I genes. The type I genes were named P. p atens Type I M ADS1-8 (PPTIM1-8). PPTIM2 and PPTIM3 were classified as Mα genes because they encode the motif FSFGHPSIDYV, which closely resembles the consensus sequence YSFGHP(F)DAV characteristic of Mα proteins in Arabidopsis and rice (De Bodt et al. 2003a; Pařenicová et al. 2003).

Novel MIKC MADS-box genes revealed by our searches were categorised as MIKC C or MIKC* by comparing their sequences and architectures with those of previously evaluated and classified Physcomitrella type II genes (Krogan and Ashton 2000; Henschel et al. 2002; Hohe et al. 2002; Riese et al. 2005; Singer et al. 2007). A previously unnamed MIKC C gene and a novel gene were named P hyscomitrella p atens M IKC C 5 (PPMC5) and PPMC6, respectively. The novel MIKC* genes were termed P. p atens M IKC A sterisk5 (PPMA5), PPMA8, PPMA9, PPMA10, PPMA11 and PPMA12. EST data are available for all the novel type II genes and three of the novel type I genes.

Virtual sequences comprising 138 amino acid residues beginning with the MADS domains of PPTIM6, PPTIM7 and PPTIM8 are 50 % identical although the corresponding DNA coding sequence of PPTIM6 is interrupted by five in-frame nonsense codons. In addition, the MADS-box of PPMA5 contains two putative insertions that perturb the translational reading frame. Scrutiny of the genomic sequences of PPTIM6 and PPMA5 failed to reveal potential splice sites that would allow the joining of conserved sense sequences. Thus, both genes were classified as pseudogenes. PPTIM6 was included only in the duplication analysis and PPMA5 was excluded from further analyses except where noted.

The Advanced Search tool in the JGI database yielded a 27th putative MADS-box gene (Protein ID 121924). Its amino acid sequence was only 17 % identical to the sequence of PPM1 when aligned in Clustal W (Thompson et al. 1994) and, when used as a query sequence in a tblastn search of the NCBI database, yielded no MADS-box gene hits. In addition, MEME did not detect a MADS domain in the sequence. Therefore, in our opinion, this gene is not a MADS-box gene and we did not investigate it further.

Conservation of type II MADS-box gene sequences and architectures in Physcomitrella

Amino acid residues in the MADS domains of type II proteins are identical at 35 of 60 positions (Fig. 1a). Conserved and semi-conserved substitutions (conservation of amino acid groups with strongly or weakly similar properties, respectively, as defined by Clustal W) exist at another 20 positions. The amino acid residues at 11 positions are perfectly conserved within the MIKC C and MIKC* subtypes but differ between the two. In addition, the two subtypes may be distinguished by their different motives at the C-terminal end of the MADS domain.

Fig. 1
figure 1

Multiple sequence alignment of type II MADS domains (a) and extended K domains (b) in P. patens. The pseudogene PPMA5 has been omitted. Highlighting indicates the first eight conserved amino acid residues that differ between MIKCC proteins (pink) and MIKC* proteins (yellow). A six-amino acid motif at the C-terminal end of the MADS domain makes distinguishing the sequences of MIKCC (blue) and MIKC* (green) proteins facile. Another six amino acid motif (teal) at the C-terminal end of the K domain is almost perfectly conserved in all type II proteins. Amino acid residues occupying positions corresponding to the a and d positions of heptad repeats in Arabidopsis MIKC proteins or identified as important for protein interactions (Kaufmann et al. 2005) are underlined in the sequence of PPMA12

We have used a traditional definition of the extended K domain, although Kwantes et al. (2011) have provided some evidence that a large section of the I domain may have resulted from duplication of a portion of the K domain. The K domain sequences of MIKCC and MIKC* proteins are identical at 14 of 89 positions and display conserved or semi-conserved substitutions at an additional 23 positions (Fig. 1b). The motif RVRARK in the K domain is identical in 15 of the 17 type II proteins and differs at only one position in the other two. The amino acid residues at 10 positions are identical within the MIKC C and MIKC* subtypes but differ between the two groups.

The positions of hydrophobic amino acid residues in the heptad repeats of K1, K2 and K3 and of two hydrophobic amino acid residues, which lie outside the heptad repeats but are important for protein interactions in SEP3 (Kaufmann et al. 2005), are identical in the six Physcomitrella MIKCC sequences (Fig. 1b). The pattern of hydrophobic amino acid residues in K1 and K2 of PI (Kaufmann et al. 2005) is identical to that in the Physcomitrella MIKC* sequences. However, the positions of hydrophobic amino acid residues in K3 and of the other two hydrophobic amino acid residues mentioned above are not conserved in the Physcomitrella MIKC* proteins.

Proteins encoded by PpMADS-S, PPMC5 and PPMC6 lack the motif, NRLHANIS/LPSVRI, corresponding to DNA at or very near the 3′ end of the coding sequence of the other three MIKC C genes. Interestingly, however, vestiges (underlined in the sequences below) of this motif may be discernible by virtually translating portions of sequence sequestered in the flanking 3′ untranslated regions (UTRs) of these genes. Thus, ignoring the last nucleotide before the stop codons in PpMADS-S, PPMC5 and PPMC6 and continuing translation to the next available stop codon in each 3′ UTR generates the motifs, NR V HAN FP, NRLHA IFP P QGKQYKLHCSFGE and NRLHA TFQ P RGK, respectively.

Exon–intron architecture is highly conserved in the Physcomitrella MADS-box genes. The MIKC C genes contain 9 exons except for PPMC6, which lacks Intron 5 (Fig. 2). Furthermore, intron phases are identical in all six MIKC C genes with the exception of Intron 8 and the missing intron in PPMC6 (Electronic Supplementary Material S2).

Fig. 2
figure 2

Architecture of the type II MADS-box genes in P. patens. Coloured rectangles (N-terminal region, pink MADS-box, red I-region, yellow K-box, blue C-terminal region, purple) and black lines represent exons and introns respectively. Exons and introns are numbered above and below, respectively, PPM1 and PpMADS2. The position of Intron 9 and the approximate position of Intron 11, both absent in PpMADS2, are indicated by arrows

In the MIKC* genes, one I-region exon is absent in PPM3 and PPM4 (Fig. 2) as noted by Henschel et al. (2002). Intron 7 is missing in PPM6, PPM7 (Riese et al. 2005), PPMA9, PPMA10 and PPMA11. The first exonic sequence in the C-terminal region is absent in PPM3 and PPM4, fused to the K-box in PpMADS2, PpMADS3, PPM6, PPMA8, PPMA9 and PPMA12, and continuous with the next downstream exonic sequence in PPM7. In PpMADS2 and PPMA12, one continuous exonic sequence corresponds to the two long C-terminal exons in the remaining MIKC* genes. Excluding the fact that certain introns are absent in some of the MIKC* genes, all intron phases are conserved in MIKC* genes and positions of introns are conserved except that, when compared with the other MIKC* genes, the position of Intron 11 differs by a single codon in PPM6 and PPMA9. Similarly, the position of Intron 12 differs by one codon in PpMADS2, PpMADS3, PPMA8 and PPMA12 (Electronic Supplementary Material S3).

Phylogenetic analyses

Our comprehensive (multi-taxon) Bayesian tree (Electronic Supplementary Material S4) was consistent overall with trees published by others (see, for example, Becker and Theiβen 2003; Kofuji et al. 2003; Pařenicová et al. 2003, Gramzow et al. 2012). The MIKC C genes from Physcomitrella formed a single cluster supported by a high posterior probability. The Physcomitrella MIKC* genes clustered together in a highly supported group within a moderately supported larger cluster that included the MIKC* genes from Arabidopsis. LAMB1 appeared separately from the other genes.

The Physcomitrella type I genes were grouped into three separate clusters. PPTIM1 with PPTIM4 and PPTIM5 formed a cluster supported by a high posterior probability. PPTIM2 and PPTIM3 formed a group that clustered, with a high posterior probability, with the S. moellendorffii gene MADS15 as a sister, within a cluster that included the majority of the Mα genes from Arabidopsis. The posterior probability for the larger cluster was low, however, and the remaining Mα genes from Arabidopsis formed a separate group. PPTIM7 and PPTIM8 clustered with all but two of the Mβ genes from Arabidopsis in a cluster within a larger cluster that included the remaining Mβ genes and all of the Mγ genes from Arabidopsis as well as the unclassified type I gene, AGL33. The posterior probability for this cluster was low.

The comprehensive WMP tree was generally consistent with the Bayesian tree although the WMP tree provided less resolution and the majority of the bootstrap values were numerically lower than the Bayesian posterior probabilities. The Physcomitrella MIKC* genes formed a cluster with moderate bootstrap support, separate from the Arabidopsis MIKC* genes. PPTIM2 and PPTIM3 formed a cluster with SmMADS15, with high bootstrap support, but the Mα genes from Arabidopsis formed several separate clusters. PPTIM7 and PPTIM8 formed a cluster within a larger cluster that included, with low bootstrap support, all of the Arabidopsis Mβ and Mγ genes.

The topologies of the WMP, Bayesian and ML trees of Physcomitrella genes were identical for the MIKC* genes but somewhat different for the MIKC C genes and the type I genes. The MIKC* genes (Fig. 3a) were resolved, with high support, into two main clades each of which contained smaller, highly supported clades. Thus, one of the main clades comprised two smaller clusters, (PpMADS2, PPMA12) and (PpMADS3, PPMA8), and the other incorporated two subclades, the first one containing PPM6 and PPMA9 and the second containing two smaller clusters. One of these comprised PPM3 and PPM4 and the other included PPM7 as sister to a clade containing PPMA10 and PPMA11. A second WMP analysis (not shown), including the pseudogene, PPMA5, its sequence having been aligned with the other MIKC* sequences by manually correcting for two presumed indels in the MADS domain, produced a tree with a subclade consisting of PPMA5 and PPM7 and otherwise identical topology.

Fig. 3
figure 3

Rooted weighted maximum parsimony (WMP) trees of the major groups of P. patens MADS-box genes: a MIKC*, b MIKC C and c type I. MIKC*, MIKC C and type I genes are shown in blue, red and green respectively. Bootstrap values and posterior probabilities from the WMP, Bayesian and ML trees are shown top to bottom, left to right

In the WMP tree, the MIKC C genes were resolved into two highly supported clusters, (PPM1, PpMADS1, PPM2) and (PpMADS-S, PPMC5, PPMC6) (Fig. 3b). Furthermore, the former clade contained a highly supported subclade consisting of PPM1 and PpMADS1 and the latter clade also included a highly supported subclade comprising PPMC5 and PPMC6. The Bayesian and ML trees were consistent with the WMP tree with respect to the first clade, although support was moderate or low for some branches. However, in the Bayesian tree, the second clade and the PPMC5-PPMC6 subclade within it were moderately supported. In the ML tree, PpMADS-S, PPMC5 and PPMC6 were unresolved.

The Physcomitrella type I gene trees (Fig. 3c) revealed identical relationships to those seen in our comprehensive tree (Electronic Supplementary Material S4) except that, in the WMP tree, the clade containing the Mα genes, PPTIM2 and PPTIM3 was sister to the clade containing PPTIM1, PPTIM4 and PPTIM5.

Duplications

One triplet and four pairs of type II MADS-box genes are located on five DNA scaffolds (Fig. 4a), with a combined length of approximately 10.9 Mb, from a total of 2,106 scaffolds, corresponding to approximately 480 Mb. Pairs comprising linked MIKC* genes are separated by a minimum of 6 kb and a maximum of 24 kb. Two pairs each consist of a MIKC C gene and a MIKC* gene, separated by 83 and 224 kb.

Fig. 4
figure 4

Evidence for tandem and segmental duplications in P. patens. Light blue rectangles represent segments of scaffolds containing duplicated MADS-box genes and linked genes (pentagons pointing in the 5′ → 3′ direction). Spaces between the black lines represent DNA of the lengths indicated. (Diagrams are not to scale.) a Locations of one triplet and four pairs of type II MADS-box genes on five scaffolds. MIKC C genes are shown in pink and MIKC* genes in yellow. PIP genes and CLASP genes are shown in purple and brown respectively. b Segment of DNA containing PPTIM4 and PPTIM5, which are oriented in opposite directions with approximately 3 kb separating their MADS-boxes. c PPTIM2 and PPTIM3 (dark blue) and regions (approximately 27 and 46 kb downstream from PPTIM2 and PPTIM3 respectively) containing genes encoding mitochondrial transcription termination factors (green), two similar predicted proteins of unknown family (pink), Trehalose-6-phosphate synthase components TPS1 and related subunits (purple), and the catalytic subunits of a Serine/Threonine protein phosphatase 2A (red). (d) PPMC5 and PPMC6 (pink) and segments (approximately 37 and 19 kb downstream from PPMC5 and PPMC6 respectively) containing genes encoding DDHC-type zinc-finger proteins (purple), RNA binding proteins (green), a GTP-binding signal recognition particle SRP54 (blue), an HNH endonuclease (brown), an asparaginase (red) and an unknown predicted gene (orange)

The type I genes, PPTIM4 and PPTIM5, are also physically linked in a tail-to-tail arrangement with approximately 3 kb separating their respective MADS-boxes (Fig. 4b).

PPTIM2 and PPTIM3 are located in syntenic arrangements with four other genes encoding a mitochondrial transcription termination factor, an unclassified predicted protein, the trehalose-6-phosphate synthase component TPS1 and related subunits, and the catalytic subunit of serine/threonine protein phosphatase 2A within approximately 27 and 40 kb, respectively (Fig. 4c). A duplicate of the fourth gene is located immediately downstream from the first copy on the scaffold containing PPTIM3. Similarly, duplicate sets of DDHC-type zinc-finger genes and RNA binding protein encoding genes are linked, within 35 kb, to PPMC5 and PPMC6 in the same order and relative orientations (Fig. 4d).

Search for transposable elements located within or near MADS-box genes

No transposable element was found overlapping a MADS-box gene and no putative DNA transposase or helitron was detected on any scaffold containing a MADS-box gene. No polyadenylate sequence was found within 500 nucleotides (the approximate maximum length of a SINE) on either side of a MADS-box gene or within 8 kb (the approximate maximum length of a LINE) and accompanied by a putative reverse transcriptase gene.

Discussion

Gene duplication

Gene families expand in number and roles by tandem and segmental duplications with subsequent nonfunctionalisation (pseudogenisation) of one copy of each duplicate pair or retention of both genes (Ohno 1970; reviewed in Zhang 2003). Duplicate copies may be preserved as functionally redundant genes resulting in increased amounts of gene product (Kondrashov and Koonin 2004) or, in the case of large-scale duplications, such as diploidisation, preservation of the stoichiometry of dimerisation or complexing of the gene products (Lynch and Conery 2000). Alternatively, duplicate genes may diversify in sequence and expression with concomitant neofunctionalisation (Ohno 1970) or they may partition the functions of the ancestral gene (subfunctionalisation) (Force et al. 1999).

PPTIM4 and PPTIM5 are tandemly arrayed genes (Fig. 4b) that are closely related (Fig. 3c), suggesting that they are the result of tandem duplication. Conversely, pairs of MADS-box genes, in some cases linked to homologues of other genes, appear to have been copied during whole genome duplication or other large-scale segmental duplication events. The synteny involving PPTIM2 and PPTIM3 and linked homologues of four other genes on scaffolds 81 and 88, respectively, implies that these linkage groups arose by segmental duplication.

Although the subclade comprising PPMC5 and PPMC6 with PpMADS-S as sister is strongly supported only in the WMP tree (Fig. 3b), other evidence corroborates these relationships. The synteny surrounding PPMC5 and PPMC6 indicates that they are a duplicate gene pair (Fig. 4d). In addition, in PpMADS-S, PPMC5 and PPMC6 the first intron is significantly longer (1,005–1,116 bp) than the corresponding intron in PPM1, PPM2 and PpMADS1 (565–652 bp). Finally, the three genes of the former clade all possess nonsense (translation stop) codons at the same upstream position relative to genes of the latter clade (Electronic Supplementary Material S2). Because the only putative gene conversion tract detected by RDP3 consisted of the first 80 amino acids in PpMADS-S and PPMC6, tree construction may have been confounded by this tract.

In most instances, the component genes of physically linked MIKC gene pairs (Fig. 4a) are more closely related to genes within one or several other linked pairs (on different scaffolds) than they are to each other (Fig. 3a, b), suggesting that the genes have been duplicated together during segmental duplication events. For example, the linked genes, PPMA9 and PpMADS3, are closely related phylogenetically to the linked genes, PPM6 and PPMA8, respectively. PPM3, which is linked to the pseudogene PPMA5, is closely related to PPM4, itself linked to PPMA11.

Similarly, PPM2 and PpMADS3 are linked genes which are closely related to the linked genes PPM1 and PpMADS2, respectively. Genes encoding plasma membrane intrinsic protein (PIP) subfamily aquaporins, PpPIP2;4 and PpPIP2;2, are situated within 8 and 22 kb, respectively of the MIKC C genes, PPM1 and PPM2 (Fig. 4a). Interestingly PpMADS1, the gene most closely related to PPM1 and PPM2, is also linked to a nearby PIP gene, PpPIP2;3. A second copy of PpPIP2;4 is located approximately 27 kb upstream from the first copy. These are four of the five PIP genes that comprise one of three clades of PIP genes in Physcomitrella (Danielson and Johanson 2008).

PpMADS2 is oriented in the opposite direction from PPM1, PPM2, PPMA9 and PpMADS3 and is separated from PPM1 by approximately 224 kb, whereas the distance between PpMADS3 and PPM2 is only approximately 92 kb. Therefore, PpMADS2 plus a flanking DNA segment were probably inverted during or subsequent to the duplication. To investigate this possibility, genes located in the 100 kb segment immediately 5′ to PpMADS2 were identified and compared with the genes situated between PPM2 and PpMADS3 (data not shown). No similarity was found indicating either the initial chromosomal rearrangement was not a simple inversion or subsequent structural reorganization destroyed or relocated the expected synteny.

Recent peaks of retrotransposon activity have occurred in Physcomitrella (Rensing et al. 2008). However, we found no evidence that duplication of MADS-box genes had occurred by transposition of any kind. Although it remains possible that (retro) transposon-mediated duplication of MADS-box genes in Physcomitrella occurred and can no longer be detected, we suggest that duplication of MADS-box genes by unequal crossing over between repetitive DNA elements, possibly including transposons, within the genome and polyploidisation are more likely explanations.

The K s values of most pairs of closely related MADS-box genes in Physcomitrella (Electronic Supplementary Material S5) fall within the range of K s values (0.5 < K s < 1.1) representing the polyploidisation period proposed by Rensing et al. (2007), suggesting that these sets of gene duplicates may have been generated during this event. Since the K s values for two of three pairwise comparisons of genes in each of the two major clades of MIKC C genes and in all three pairwise comparisons of genes in the clade comprising PPM7, PPMA10 and PPMA11 are within this range, it is possible that a large scale segmental duplication occurred just before or very soon after the proposed diploidisation.

Model of MADS-box gene duplication in Physcomitrella

We propose a parsimonious model of two tandem and three segmental duplications, which can account for the expansion of the MADS-box gene family from 4 members to 26 in the P. patens lineage. In our model (Fig. 5), the names of extant genes have been used for the sake of simplicity, but it should be noted that the genes, which were actually duplicated, are the closest ancestors of those named.

Fig. 5
figure 5

A parsimonious duplication model rationalising expansion of the MADS-box gene family. All MIKC C genes (pink), MIKC* genes (yellow) and type I genes (blue) existing at the end of each of steps 1 through 4 (proposed diploidisation) are shown in the corresponding numbered panels. Proposed gene losses are shown as pentagonal outlines. Horizontal and vertical arrows represent segmental and tandem duplications respectively; the curved arrow represents the proposed inversion of a DNA segment. Losses (-) and gains (+) of individual exons (E) and introns (I), denoted by numbers, are indicated in parentheses after gene names

A plausible sequence of events is that a MIKC C gene, a MIKC* gene, a Mα and a Mβ gene, existing prior to the divergence of the bryophytes and the tracheophytes, passed into the P. patens lineage as PPM2, PpMADS3, PPTIM2 and PPTIM7, respectively. The following steps then occurred in sequence.

Step 1

Tandem duplication of PpMADS3 gave rise to PPMA9 and, sometime later, Intron 7 of PPMA9 was lost.

Step 2

Segmental duplication of PPM2, PPMA9 and PpMADS3 gave rise to PpMADS-S, PPMA11 and PPM4. During or following the duplication, chromosomal rearrangement separated PpMADS-S from PPMA11 and PPM4. PPMA11 gained Intron 9 and PPM4 lost Exon 3 and Exon 10. An indel resulted in introduction of a stop codon and truncation of the coding sequence of PpMADS-S. PPTIM2 and PPTIM7 were not copied during this step or their duplicates were subsequently lost.

Step 3

A second, large-scale segmental duplication resulted in copying of PPM2 and PpMADS3 to give rise to PPM1 and PpMADS2. During this step or subsequently, a lengthy DNA segment containing PpMADS2 was inverted. PpMADS2 lost Intron 11 at some point. PPMA9 either was not duplicated or its duplicate was lost, possibly during this inversion. PpMADS-S was duplicated giving rise to PPMC5. PPMA11 and PPM4 were also copied to produce PPMA5 (which subsequently lost Intron 10) and PPM3. PPTIM2 was duplicated resulting in PPTIM1, which subsequently diverged in sequence such that it is no longer recognisable as an Mα gene. PPTIM7 was duplicated giving rise to PPTIM6, which degenerated, becoming a pseudogene.

Step 4

In a third segmental duplication, possibly the polyploidisation proposed by Rensing et al. (2007), PPMA9 and PpMADS3 were duplicated, giving rise to PPM6 and PPMA8. PPM1 was copied to produce PpMADS1. PpMADS2 was duplicated giving rise to PPMA12. In addition, PPMA11 was duplicated to give rise to PPMA10. PPMC5 was copied, producing PPMC6, which subsequently lost Intron 5. PPMA5 was duplicated to produce PPM7. Later, PPMA5 lost Exon 2 and deteriorated further, becoming a pseudogene through the introduction of frameshifts caused by indels. PPTIM2 was copied to give rise to PPTIM3 and duplication of PPTIM1 produced PPTIM5. PPTIM7 was duplicated giving rise to PPTIM8.

Step 5

PPTIM5 gave rise to PPTIM4 by means of a recent tandem duplication (not shown in Fig. 5).

Our phylogenetic tree of MIKC* genes does not display a node representing a hypothetical gene that is ancestral to PpMADS3 and PPM4 but not to PPMA9 (Fig. 3a) and, thus, does not support our model with respect to Step 2. In sharp contrast, the close linkages of PPMA9 to PpMADS3 and PPMA11 to PPM4 as well as the presence of an intron in PPM4 and PPM3, which is also present in PpMADS3 and its proposed descendants but is lacking in PPMA9 and its putative descendants do support Step 2.

This model of MADS-box gene expansion in the Physcomitrella lineage from 4 to 26 genes is parsimonious, requiring only five steps: a tandem duplication followed by three multigene segmental duplications (the last of which is consistent with polyploidisation) and a recent event, in which a single gene was copied. Overall agreement of evidences from phylogenetic trees, chromosomal linkages and gene architectures indicates that our model is robust. A small discrepancy is that the K s values for one MIKC C and two paralogous PPTIM gene pairs generated in Step 4, namely PPM1-PpMADS1 (K s = 0.48), PPTIM7-PPTIM8 (K s = 0.40) and PPTIM1-PPTIM5 (K s = 3.0) fall, respectively, very slightly, slightly and significantly outside the range of values corresponding to the polyploidisation period proposed by Rensing et al. (2007). However, all other MIKC C and MIKC* paralogues and the PPTIM2-PPTIM3 gene pair produced in Step 4 have K s values within the expected range. Moreover, the K s value for PPTIM4-PPTIM5 is low (0.49), consistent with this paralogous gene pair being produced by a recent duplication as proposed in our model. It should be noted that K s values >1.0 are generally interpreted cautiously because they are error-prone due to the occurrence of multiple synonymous substitutions at each synonymous site (Blanc and Wolfe 2004).

This model is seductive because of its simplicity and since it implies that Step 3 may represent a second polyploidisation. If it was not a polyploidisation, the ancestral genes duplicated within it must have been linked, or duplicated more or less simultaneously during a burst of transposon activity (which we think is unlikely), in order for Step 3 to be considered a single event. A less attractive option is that Step 3 is a collection of non-simultaneous but sequentially equivalent duplications, which have in common only that they preceded the polyploidisation hypothesized by Rensing et al. (2007).

Based on an estimate of 172 million years for the age of the Funariidae (Newton et al. 2007) and chromosome numbers reported for Funaria (4, 14, 21, 28, 42, 56) and Physcomitrella (14, 27, 28), Rensing et al. (2007) proposed that independent polyploidisation events have occurred in the Funariaceae and that the whole genome duplication in the Physcomitrella lineage probably occurred after speciation among the Funariaceae. However, pairwise orthology between eleven MIKC* genes in Funaria hygrometrica and eleven MIKC* genes in Physcomitrella (Zobell et al. 2010) provides compelling evidence that expansion of the MIKC* gene complement to 11 genes occurred before divergence of Funaria and Physcomitrella. If whole genome duplication occurred in the Physcomitrella lineage after speciation, 22 MIKC* genes should have resulted. The pseudogene PPMA5 might be the product of subsequent deterioration of one of these genes, still leaving 10 genes unaccounted for. Therefore, we presume that the polyploidisation proposed by Rensing occurred in a common ancestor of the two moss genera.

Functional significance of gene duplication in Physcomitrella

Sequence conservation

Type II MADS-box genes in Physcomitrella are highly conserved in both sequence and architecture. Clustal W alignments reveal identity of 35 amino acid residues in the MADS domain and 14 in the K domain of Physcomitrella type II proteins (Fig. 1). In sharp contrast, amino acid residues are identical at seven positions in the MADS domain of type II proteins in Arabidopsis and sequence identity is not found at any position outside the MADS domain (not shown). A possible explanation of these observations is that gene conversion has occurred frequently within the MADS-box gene family in Physcomitrella. However, recombination detection software failed to provide evidence that this is the case.

Conservation of gene sequences and EST evidence suggest that the majority of MADS-box genes in Physcomitrella are functional. However, the significance of retention of highly similar MADS-box gene homologues is unclear since, in general, duplicate transcription factor genes appear to have been preferentially retained following whole genome duplications in Arabidopsis, but not in Physcomitrella (Rensing et al. 2007 and references within).

According to the gene dosage hypothesis, duplicate genes that are retained in a genome provide an enhanced gene dosage effect that is beneficial to the organism (Kondrashov and Koonin 2004). Alternatively, the gene balance hypothesis predicts that duplicates of genes that encode interacting proteins are preferentially retained after a large-scale duplication event such as polyploidisation to preserve the stoichiometry of interaction (reviewed in Birchler and Veitia 2007). However, the results from gene knockouts, albeit involving a limited number of MIKC genes, suggest that gene dosage and/or gene balance cannot be the only explanations for retention of duplicated MIKC genes in the Physcomitrella genome (Singer et al. 2007; Singer and Ashton 2009). A third possibility is that the retention of duplicated gene copies may contribute to robustness (Gu et al. 2003; Félix and Wagner 2008). Evidence exists for selective retention of the SEPALLATA 1 (SEP1)-SEP2 and SHATTERPROOF 1 (SHP1)-SHP2 duplicate pairs of MADS-box genes in Arabidopsis (Moore et al. 2005). In addition, retention of duplicate genes may allow for the evolution of differential expression and/or an expanded repertoire of protein complexes, thereby contributing to morphological elaboration (Kaufmann et al. 2005; Veron et al. 2007). Future investigation of MADS domain protein interactions in Physcomitrella will be particularly interesting since it holds the prospect of revealing parallels between retention of pairs of duplicated genes and patterns of co-expression and protein dimerisation.

Evolution of the MADS-box gene family and the land plant body plan

The MADS-box complement of 26 genes in P. patens is intermediate (Rensing et al. 2008) between that found in green algae and angiosperms and similar to that found in the relatively simple vascular plant Selaginella (Gramzow et al. 2012) (Fig. 6). This suggests a possible relationship between expansion of the MADS-box gene family and elaboration of both the gametophytic and sporophytic plant body plans. Our analysis provides strong evidence that much of the expansion of the MADS-box gene family in Physcomitrella to its current size occurred within the lineage leading to Physcomitrella after its divergence from the tracheophyte lineage. A striking difference between the MADS-box gene family in Physcomitrella and that in vascular plants is the preponderance of MIKC* genes in Physcomitrella. Therefore, expansion of MIKC* genes in the moss lineage may have been related to elaboration of the gametophytic plant body plan, as has been suggested by Gramzow et al. (2012).

Fig. 6
figure 6

Histogram of the numbers of type I and type II MADS-box genes in the genomes of Ostreococcus tauri (Ostta), Physcomitrella patens (Phypa), Selaginella moellendorffii (Selmo) and Arabidopsis thaliana (Arath)

Type I MADS-box genes, MIKC C genes and MIKC* genes have all been implicated in seed plant reproductive development, and some MIKC C genes have a reproductive function in Physcomitrella (gametangia formation) (Quodt et al. 2007; Singer et al. 2007) and in charophycean algae (haploid reproductive cell differentiation) (Tanabe et al. 2005). It is plausible, therefore, that an ancestral regulator–target relationship between MADS-domain transcription factors and effector gene regulatory elements has been conserved during land plant evolution while expansion and divergence of the MADS-box family has paralleled elaboration of both gametophytic and sporophytic body plans.

Further progress in understanding the evolution of MADS-box genes in Physcomitrella will require continuing the functional characterisation of MIKC genes, and extending it to include type I genes. While this is necessary, it is also daunting since the high level of sequence conservation within each of the three groups of Physcomitrella MADS-box genes raises the prospect that single gene knockouts will be rendered useless for determining gene functions because of functional redundancy as has been shown already for the three MIKC C genes in the PPM2-like clade (Singer et al. 2007; Singer and Ashton 2009). This study provides information about gene sequences, phylogenetic relationships and chromosomal linkages that can guide the choice of optimal subsets of genes for multiple gene targeting experiments and thereby maximise the likelihood of successfully determining MADS-box gene functions in this bryophyte.