Introduction

Understanding the evolutionary history and relationship of early land plants particularly the mosses, liverworts, hornworts and lycophytes is critical as they are the connecting link between the aquatic and land plants, and represent the pioneering group of green plants that colonized the terrestrial milieu. This was accompanied by evolution of several vital adaptations, namely alternation of generation between gametophytic (haploid) to sporophytic (diploid), development of elaborate gametophyte, specific gametangia, and spore walls with peculiar characteristics of desiccation tolerance (Nickrent et al. 2000; Renzaglia et al. 2000; Mosquna et al. 2009; Okano et al. 2009). A recent study on the model bryophyte, Funaria hygrometrica, revealed that the qualitative and quantitative differences in gene expression across the sporophytic and gametophytic generations are lesser than that observed in Arabidopsis thaliana (Szövényi et al. 2011). In addition, comparative analysis of generation-biased gene expression between Funaria and Arabidopsis showed limited conservation. However, a few genes displayed sporophyte-biased expression in both the model species and may be associated with pathways associated with adaptations towards the terrestrial habitat (Szövényi et al. 2011). Comparative expression analyses among plants from diverse phylogenetic lineage have suggested the involvement of conserved regulatory mechanisms involved in growth and development of reproductive tissues in the higher plants and in the early land plants, regardless of their morphological diversity. This comparative study between Funaria and Arabidopsis indicated a drastic change in the function of genes and their products during the course of evolution (Szövényi et al. 2011).

Evolution of gene families is driven by several forces, including whole genome-, segmental- and local duplications giving rise to paralogous copies. Genome-level polyploidization events (auto- and allo-polyploids) cause the formation of multiple paralogs which through genome fractionations form homeologs, and can undergo functional diversification (Cheng et al. 2018; Gu et al. 2002; Zhang 2003). This process varies significantly between species and gene families (Lespinet et al. 2002). It is well established that evolution of protein functions of a protein is a consequence of natural selection that leads to sequence diversity (Hughes 2002; Bornberg-Bauer and Albà 2013).

The OVATE FAMILY OF PROTEINs (OFPs) derives its name from the ovate locus that was postulated as being associated with fruit shape in tomato and a recessive condition responsible for pear shape (Hedrick and Booth 1907; Price and Drinkard 1908). Subsequently, the ovate locus was mapped, cloned and shown to be associated with fruit shape in tomato (Ku et al. 1999, 2001; Liu et al. 2002). Ovate family proteins (OFPs) are plant-specific, multigene family members usually with a conserved OVATE domain; homologs of OFPs have been found reported so far in higher plants, mosses and lycophytes (Wang et al. 2016). Inspite of OFP family being reported across a wide taxonomic spectrum, functional analyses of only few OFPs have been undertaken. For instance, only a small number of AtOFPs family members have been experimentally characterized and shown to be involved in the regulation of plant growth and development; information about rest of the members is not available or is limited (Wang et al. 2011). Hitherto, few studies across various plant species, namely Arabidopsis, tomato, pepper, rice and banana, suggested that they are conserved, act as transcriptional repressors, and regulate several aspects of plant growth and development (Liu et al. 2014; Schmitz et al. 2015; Wang et al. 2016). A recent study in rice has proposed the role of OFP also in brassinosteroid signaling (Yang et al. 2016). Nevertheless, only fragmentary information is available about OFP, and an obvious lack of knowledge about the genomic status, organization, evolution, and function/s of members of OFP gene family in early land plants.

We, therefore, sought to investigate the diversity of OFPs amongst the early land plants, the bryophytes and lycophytes in the present study. We identified and examined OFPs in Marchantia polymorpha (Mp), Physcomitrella patens (Pp), Selaginella moellendorffii (Sm) and Sphagnum fallax (Sf). To the best of our knowledge, this is the first exhaustive and elaborative report on genome-wide analysis of OFPs in these plants and their diversity. The results of the present investigation about OFPs with evolutionary perspectives may provide insights into their functional diversity and contribution in evolution from early land plants to higher plants, and pave a way for functional analysis.

Materials and Methods

Genome-Wide Identification of OFPs in Early Land Plants

OVATE family proteins encoded in Marchantia polymorpha, Physcomitrella patens, Selaginella moellendorffii and Sphagnum fallax genomes were retrieved from phytozome database (https://phytozome.jgi.doe.gov/pz/portal.html). Members of AtOFPs as well as keyword search were employed as query to identify putative OFP orthologs in early land plants by the Basic Local Alignment Search Tool (BLASTp) using BLOSUM62 comparison matrix with default parameters. High scoring pair scores were manually examined and an output list was prepared. Subsequently, to select unique sequences for further investigation, redundant entities were manually removed.

Identification of Genomic Localization, Conserved Domains and Motifs

The distribution of OFPs in the genome of all the four selected early land plants was mapped using MapChart (Voorrips 2002). Conserved domain sequences in proteins were identified using NCBI SMART (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml) and Pfam (http://pfam.sanger.ac.uk) databases. The conserved motifs in proteins were identified using Multiple EM for Motif Elicitation (MEME) (http://meme-suite.org/tools/meme) keeping parameters of maximum number of ten motifs with the width of 5–50 amino acids (Table 1).

Table 1 Comparison between structural properties of OFP in early land plants

Structural and Physicochemical Characterizations of Putative OFPs

The details of length and location of OFPs were retrieved from Phytozome database. The gene structures of OFPs were determined using NCBI Splign tool (https://www.ncbi.nlm.nih.gov/sutils/splign/splign.cgi). Molecular weight and isoelectric point were calculated by the compute pI/MW tool in ExPASy server (http://web.expasy.org/compute_pi/). RAPID (Regression-based Accurate Prediction of protein Intrinsic Disorder content) software was used to determine the protein disorderness (http://biomine.cs.vcu.edu/servers/RAPID/). GRAVY (grand average of hydropathicity) score was calculated using sequence manipulation suite (http://www.bioinformatics.org/sms2/protein_gravy.html).

Subcellular Localization

Localization of proteins was determined using BLAST2GO analysis (http://www.blast2go.com) as well as LocTree3 (http://www.rostlab.org/services/loctree3/). Nuclear export signals were analyzed by NetNES1.1 prediction server (http://www.cbs.dtu.dk/services/NetNES/) (La Cour et al. 2004).

Phylogenetic Analysis of OFPs

Protein sequences of Marchantia polymorpha, Physcomitrella patens, Selaginella moellendorffii and Sphagnum fallax OFPs were aligned using ClustalW as implemented under MEGA 6.06. The alignment was used to estimate the phylogenetic relationship using neighbor-joining method and 1000 bootstrap replicates were performed to measure the stability of the branch nodes (Kumar et al. 2008). Divergence time was studied using the RelTime method in MEGA 6.06 through the maximum likelihood method (Schwartz and Dayhoff 1979; Tamura et al. 2012, 2013). The synonymous (Ks) and non-synonymous (Ka) substitution rates and their ratio (Ka/Ks) in case of all OFPs were calculated using Ka/Ks calculator (http://services.cbu.uib.no/tools/kaks). Aligned coding sequences of all OFPs were used along with the phylogenetic tree to calculate the ratio for each node. Gene duplications were predicted through Plant Genome Duplication Database (PGDD) (http://www.chibba.agtec.uga.edu/duplication/) using Locus search tool. The results were then analyzed manually.

Tertiary Protein Structure Prediction

Protein sequences of conserved OVATE domain were subjected to the Phyre2 protein-modeling server (http://www.sbg.bio.ic.ac.uk/*phyre2) to predict their tertiary structures. Results obtained in PDB format were analyzed manually. To scrutinize the unique structure of Sm69626 from Selaginella, and Sphfalx0027s0004 from Sphagnum, tertiary structures of representative of each plant, i.e. Marchantia, Physcomitrella, Selaginella and Sphagnum were aligned based on the presence of similar domain using MatchMaker tool in Chimera (Pettersen et al. 2004; https://www.cgl.ucsf.edu/chimera/).

Spatiotemporal Expression of OFP Genes in Physcomitrella and Selaginella

To study the transcriptional activity via spatio-temporal expression pattern of putative PpOFPs, individual sequences were used to identify the IDs from cosmoss genome web portal (http://www.cosmoss.org) and used to search for transcriptome data of PpOFPs (Ortiz-Ramírez et al. 2016). The mean FPKM values of ten different stages were employed in the web tool ClustVis (http://biit.cs.ut.ee/clustvis) (Metsalu and Vilo 2015) to generate an expression heat map. For spatiotemporal expression of SmOFPs, FPKM values of RNA seq data were used and the heat map was generated with ClustVis (Huang and Schiefelbein 2015).

Results

Identification of Ovate Family Proteins

BLASTp search using amino acid sequences of 19 AtOFPs was performed along with keyword search in phytozome database v12.1 (http://www.phytozome.net) against the genomes of Marchantia, Physcomitrella, Selaginella and Sphagnum to identify members of OFPs. Four genes predicted to encode OFPs were identified in Marchantia, 19 in Physcomitrella, 6 in Selaginella and 3 in Sphagnum, (Fig. 1; Supplementary Table S1). The sizes of genomic DNA of OFPs ranged from 2286 to 7184 bps (3.1-fold range) in Marchantia, 3582–4747 bps in Sphagnum (1.3-fold), 309–6812 bps in Physcomitrella (22-fold), and 189–6690 bps in Selaginella (35-fold). A comparative analysis of the length, range and fold-change of transcript length, coding sequence (CDS) length and protein length revealed that in both Marchantia and Sphagnum the fold changes ranged between 1.3- and 1.6-folds (transcript and CDS, respectively) in Sphagnum, to 2.03- and 1.46-folds (transcript and CDS, respectively) in Marchantia. In contrast the fold change ranged from 18.5- and 10.8-folds (transcript and CDS, respectively) in Physcomitrella to 27.11- and 27.08-folds in Selaginella (transcript and CDS, respectively). Of interest is the near similarity in length of transcript and CDS in Selaginella implying the presence of extremely short UTRs. Analysis of exon–intron organization revealed that majority of genes in Marchantia and Selaginella are intronless, whereas nearly 65% of genes of both Sphagnum (2 out of 3) and Physcomitrella (12 out of 19) contain introns, albeit few, and not exceeding 4. Of the four members in Marchantia, only one member (Mapoly0001s0515) showed the presence of a single intron, while exceptions include a single gene with 22 introns in Selaginella (Sm437676). Interestingly, in all the cases it was found to be independent of the size of genome and transcript (Fig. 1; Table 1; Supplementary Table S1).

Fig. 1
figure 1

Genomic structure of OFPs present in the genomes of early land plants. OFPs present in the genomes of early land plants Marchantia, Selaginella, Physcomitrella and Sphagnum. Pink boxes represent the exons and the black lines represent the introns. (Color figure online)

Genomic Location of OFPs

Genomic locations of the members of OFP family were mapped using MapChart and the location of most of OFPs was observed to be towards the ends of the chromosomes (Fig. 2). In Physcomitrella, the 19 OFPs are distributed on 14 out of 27 chromosomes. Except chromosome 2 (with three OFPs), and chromosomes 1, 10 and 21 (two OFPs each), all other chromosomes, viz. 3, 5, 6, 7, 11, 14, 17, 18, 23 and 26, have single OFP (Fig. 2a). The chromosomal locations of the OFPs could only be mapped on scaffolds and not on the chromosomes in Marchantia, Sphagnum and Selaginella due to the lack of complete genome information (Fig. 2b–d). In all the three genomes, each scaffolds harbored a single OFP. The four OFPs in Marchantia are individually located on scaffolds 1, 56, 142, and 170; three OFPs in Sphagnum are distributed in Super_27, _39 and _60; and six OFPs from Selaginella are located on scaffold 0, 7, 9, 14, 35, and 57 (Fig. 2 b–d).

Fig. 2
figure 2

Chromosomal distribution of OFP genes. Diagrammatic representation of physical location of OFP genes on chromosomes of Physcomitrella patens (a), and on scaffolds of Marchantia (b), Sphagnum (c) and Selaginella (d)

Physicochemical Attributes of OFPs

The OFPs across the four species were analyzed for their molecular mass and pI. The molecular mass of the OFPs was distributed between 7.8 and 188.9 kDa, with majority of proteins (~ 69%) exhibiting a molecular mass ranging between > 50 and 100 kDa. The molecular weight of the MpOFPs ranged between 63.8 and 94.4 kDa, whereas those of Sphagnum were in the range of 63.1–99.3 kDa. In Physcomitrella and Selaginella, the molecular weight showed a wide range and varied 11.6–123.9 kDa, and 7.8–188.96 KDa, respectively. In Selaginella, most of the OFPs encode low-molecular-weight proteins (< 30 kDa), except Sm437676 (1707 amino acids; ~ 189 kDa; Fig. 3a; Supplementary Table S1), and represented the two extreme with size ranging from 189 kDa (Sm437676) to 7.86 kDa (Sm69626). Analysis of pI characteristics of the OFPs revealed that most of the proteins (60%) across the four species have pI > 7 (Fig. 3b; Supplementary Table S1), and the pI of the remaining 40% lie between 4 and 7 (Fig. 3b). The most dynamic range of pI was also detected in Selaginella (4.43–9.03) and Physcomitrella (5.27–9.34), whereas all the members of OFP from Marchantia had pI in the narrow range of 8.58–9.94. Approximately 58% of OFPs of Physcomitrella displayed pI in ranges of > 7 with remaining ~ 42% in ranges of 4–7. OFPs from Sphagnum fall in the pI range of 7–9.34 (Fig. 3b; Supplementary Table S1).

Fig. 3
figure 3

Physicochemical attributes of OFPs. Physicochemical attributes of OFPs in early land plants were predicted using computational tools and include molecular weights (a), isoelectric points (b), GRAVY scores (c) and percentages of intrinsic disorderness (d)

To investigate other physicochemical attributes of the OFPs, entire protein dataset was analyzed with respect to their hydrophobicity (GRAVY values) and intrinsic disorder. All the OFP members across the early land plants showed GRAVY values in the negative range, suggesting their hydrophilic nature. Among these, Mapoly0170s0026 and Sphfalx0060s0089 are predicted to be the most hydrophilic with GRAVY value of − 1.04 and − 1.7, respectively (Fig. 3c; Supplementary Table S1). Furthermore, to predict the status of intrinsic disorder, all OFPs were analyzed through RAPID. The analysis revealed that Marchantia OFPs were in the range of ~ 31–46% intrinsic disorder content. On the other hand, Physcomitrella, Selaginella and Sphagnum showed their intrinsic disorder content in the ranges of ~ 13–45%, ~ 15–35% and ~ 31–45%, respectively (Fig. 3d; Supplementary Table S1).

Subcellular Localization of OFPs

Annotation and subcellular location of protein/s is vital to predict their function and possible biological significance. Subcellular localization of the OFP members was predicted through Blast2GO analysis along with LocTree3 algorithm. The analysis predicted their localization mostly in the nucleus; few of them were predicted to be localized in other subcellular compartments, namely extracellular matrix, cytoplasm and endoplasmic reticulum. Three out of 4 OFPs in Marchantia, 18 out of 19 in Physcomitrella, and all the 3 OFPs of Sphagnum were predicted to be nuclear localized. In a sharp contrast to these, of the six OFPs in Selaginella, only one was predicted to be localized in the nucleus (Fig. 4a; Supplementary Table S1). While the presence of the majority of OFPs in nucleus supports its role as transcription factors, localization of other OFPs in different subcellular compartments suggested their dynamic properties that are yet unknown or undiscovered, and assorted functions depending upon the subcellular localization.

Fig. 4
figure 4

Subcellular localization of the OFP members. Subcellular localization of the OFP members of the early land plants was predicted through Blast2GO analysis along with LocTree3 algorithm. Pie chart showing the overall subcellular localization of the OFP members in Marchantia, Physcomitrella, Selaginella and Sphagnum (a). Twelve OFP members across the species showed the presence of nuclear export signal (NES) motifs in protein sequences (b)

We examined the presence of an NES, taken as an indirect evidence of proteins showing localization to nucleus, as well as in case of dual localization to other subcellular compartments via NetNES1.1 prediction tool (La Cour et al. 2004; Schmitz et al. 2015). The analysis revealed the presence of NES signal within and around the highly conserved OVATE domain in 12 OFP members (Fig. 4b; Supplementary Table S2). Most of the OFP members exhibited one or more residues predicted to be part of a NES motif within the OVATE domain or just outside the domain towards the N-terminal (Supplementary Table S2). The consensus sequence LEELLQ/F/CYLSL was found to be present across all the early land plants studied; diversity of nuclear export signal sequence was observed within the PpOFPs, viz., Pp3c21_12190, Pp3c11_15470, Pp3c2_26490 and Pp3c21_16580. Marchantia, and Selaginella displayed the presence of similar NES motif when compared to Physcomitrella, whereas divergence was observed in Sphagnum (Fig. 4b).

Phylogenetic Distribution of OFPs in Early Land Plants

To analyze the evolutionary relationship between OFPs from Marchantia, Physcomitrella, Selaginella and Sphagnum, an unrooted NJ tree was constructed from the alignment of full-length amino acid sequences of OFPs using MEGA 6.06 (Fig. 5). Phylogenetic reconstruction did not lead to organism specific grouping of all the OFP members. The analysis revealed that all the proteins are distributed in two main clades; two members of PpOFP (Pp3c6_3500 and Pp3c5_25760) formed a separate clade. Clade one (I) includes 12 proteins, whereas clade II comprises 18 proteins. Clade I can be further subdivided into three subclasses and comprises majorly of PpOFPs (Pp3c1_110, Pp3c7_9900, Pp3c11_15470, Pp3c2_26490, Pp3c10_19270, Pp3c23_6720, Pp3c3_14980, Pp3c10_12900 and Pp3c26_5410); however, some of these share their immediate ancestry with MpOFP (Mapoly0001s0515 and Mapoly0056s0035) and SfOFP (Sphfalx0060s0089). Clade II is comprised of all SmOFPs (Sm117843, Sm69627, Sm437676, Sm69626, Sm407613 and Sm417707) as well as the remaining MpOFPs (Mapoly0142s0001 and Mapoly0170s0026) besides the eight PpOFPs (Pp3c17_18070, Pp3c14_21750, Pp3c1_37090, Pp3c2_4871, Pp3c21_12190, Pp3c18_15340, Pp3c21 16580 and Pp3c2_8210) (Fig. 5).

Fig. 5
figure 5

Phylogenetic analysis of OFP members in four early land plants. Phylogram of OFP members in Marchantia, Physcomitrella, Selaginella and Sphagnum constructed through an unrooted neighbor joining (NJ) method showed two major clades. The numerical on branches represents bootstrap values. The presence or absence of OVATE domain and conserved motifs is arranged on the right margin against each homolog. The presence of OVATE domain is indicated by filled black box, and the two motifs through filled yellow and green boxes. (Color figure online)

We observed several pairs of PpOFP that form paralogs/clades with higher bootstrap value, and may have evolved through recent duplication events. A closer inspection showed that the protein sequences of Pp3c1_37090 and Pp3c14_21750 share 74% similarity, while sequences of Pp3c2_4871 and Pp3c14_21750 showed 64% similarity; Pp3c2_26490 showed 55% sequence similarity with Pp3c11_15470, whereas sequence of Pp3c3_14980 exhibited higher similarity of 80% with Pp3c10_12900 (Phpat.010G050400). To understand the evolutionary history and possible cause of paralogy of the members of OFP, a comparative analysis of the Physcomitrella genome structure in Plant Genome Duplication Database (PGDD), chromosomal location and phylogenetic clustering among the OFP homologs within P. patens was undertaken which revealed segmental duplication being responsible for origin of the paralogous relationships (Fig. 6; Supplementary Figure S1).

Fig. 6
figure 6

Chromosomal location of segmentally duplicated PpOFPs. Segmental duplication involving large chromosomal segments was confirmed through analysis of PGDD and the chromosomal location of these was plotted using MapChart showing the origin of the paralogous relationships between the OFPs of Physcomitrella genome

Topology and Characteristics of OFPs

Amino acid sequence analyses were performed using Multiple EM for Motif Elicitation (MEME) and SMART. The analysis using SMART indicated the presence of OVATE domains, whereas MEME analysis revealed two conserved amino acid stretch in OFPs. Members of OFP family could be divided into two major groups depending on the presence or absence of the OVATE domain (Figs. 6, 7). All OFPs possessing OVATE domain are clustered in group I; whereas group II contains proteins without OVATE domain (Supplementary Figure S2). Group I is further divided into two subgroups, one with two conserved motifs (subgroup 1), whereas second group harbors a single motif (subgroup 2). Subgroup 1 includes most of the PpOFPs, all SfOFPs, SmOFPs (except Sm69627), and MpOFPs (except Mapoly0056s0035); only Sm69627 with single motif in OVATE domain comprises subgroup 2. Similarly members without OVATE domain of group II can also be divided into two different subgroups. Subgroup 1 of group II includes those proteins having single motif. This subgroup includes Pp3c10_19270, Pp3c23_6720, Pp3c11_15470, Pp3c7_9900, Pp3c2_26490, Pp3c3_14980, Pp3c26_5410 and Pp3c10_12900. The remaining three OFPs form subgroup 2, i.e. Pp3c21_16580, Pp3c2_8210 and Mapoly0056s0035, which showed the absence of any motif (Figs. 6, 7).

Fig. 7
figure 7

Identification of OVATE domain and motifs. Domain and motif predictions by amino acid sequence analyses of the OFP proteins using SMART and MEME show clear demarcation along phylogenetic lines. Red line represents the amino acid sequences. The presence of OVATE domain is marked with black lines and distribution of the conserved motifs in OFPs is shown as yellow and green boxes. A NJ phylogram made from MEGA in the right panel shows their evolutionary relationships. (Color figure online)

In general, the presence of the conserved OVATE domain in proteins was a characteristic feature of the OFPs, although several members that are classified as OFPs (Mapoly0056s0035, Pp3c21_16580, Pp3c10_19270, Pp3c23_6720, Pp3c11_15470, Pp3c7_9900, Pp3c2_26490, Pp3c3_14980, Pp3c26_5410, Pp3c2_8210 and Pp3c10_12900) were found lacking this characteristic OVATE domain. In almost all cases examined, OVATE domain is present at the C-terminus. Across the four species analyzed, OFPs exhibit extensive length polymorphism that ranges from 63 to 1707 amino acids (Fig. 7; Supplementary Table S1). In Marchantia, the length variation in proteins is 1.46-fold and ranges between 578 and 849 a.a., whereas OFPs from Sphagnum, Physcomitrella, and Selaginella showed 1.62- (557–903 a.a.), 10.90- (102–1112 a.a.) and 27.09-fold (63–1707 a.a.) variations in length, respectively. Interestingly, Selaginella moellendorffii genome encodes the smallest (Sm69626; 63 a.a.) and largest (Sm437676; 1707 a.a.) OFP proteins among the genomes surveyed. Of these, two of them, namely Sm69626 (63 a.a.), and Sm69627 (67 a.a.), showed the absence of canonical N-terminus. The longest OFP, Sm437676, from Selaginella displayed a non-canonical structure wherein alpha amylase inhibitor, tRNAsynt_1g, Anticodon_1 and zfFPG_IIeRS domains are present at the N-terminus and middle region of this protein with the OVATE domain located at C-terminus (Fig. 7).

Divergence of OFPs During Evolution

To understand the phylogenetic relationship among the OFP members within and between the species, it is pertinent to evaluate their divergence along the time. Availability of the sequence data enabled us to study the estimated divergence times of the species. We estimated the divergence times and a linearized time tree was generated using the RelTime method with molecular clock approach. The tree in Fig. 8a has two multi-furcating nodes, each representing members of PpOFPs along with few members of SfOFP, MpOFP and SmOFP. The analysis revealed that PpOFPs (Pp3c17_18070, Pp3c14_21750, Pp3c1_37090, Pp3c2_4871, Pp3c21_12190 and Pp3c18_15340) diverged around the same time followed by Sm437676 and Sm69626. At or around the same time SfOFP Sphfalx0027s0004 also diverged, followed by SfOFP Sphfalx0039s0113 and Sm117843; Sm417707 and MpOFP Mapoly0142s0001 were found to have diverged before Sphfalx0039s0113 and Sm117843. Some PpOFPs diverged at the same time with that of SfOFP and MpOFP (Fig. 8a). Altogether, divergence time of the species was to some extent in accordance with their phylogeny as the OFP members of the same clades showed similar time of divergence irrespective of species (Figs. 5, 8a).

Fig. 8
figure 8

Divergence time among OFP members of the four early land plants. A linearized time tree was generated to analyze the divergence time among the OFP members across the species of early land plants. The divergence time was estimated using the scale at the bottom (a). Synonymous (Ks), non-synonymous (Ka) substitution rate and Ka/Ks ratio were calculated through Ka/Ks calculator (b). Mean Ka value (c), mean Ks value (d) and the mean Ka/Ks ratios (e) for three groups had been calculated and represented in graph

Synonymous (Ks) and non-synonymous (Ka) substitution rate, and Ka/Ks ratio, the measure of selection pressure of all OFPs, were computed through Ka/Ks calculator (Fig. 8b). Our study demonstrates that the mean of Ks and Ka is lowest in group I than that of other groups (Fig. 8c, d). Furthermore, Ka/Ks values were calculated and graph was plotted for all three groups which revealed their value below 1.0 indicating purifying selection (Fig. 8e). Group I and II include the proteins which possess OVATE domain along with both motifs, showing the Ka/Ks ratio of 0.49 and 0.6, respectively. Group III contains those OFPs which have only one motif without the OVATE domain, and exhibit mean Ka/Ks ratio of 0.62 (Fig. 8e). Ka/Ks ratio was found to be lowest in group I which indicates that the genes experienced stronger selection pressure and evolve with slow rate in comparison with group II. Group III showing the highest ratio demonstrates the weaker selection pressure and in turn evolution with faster rate. Collectively, Ka/Ks ratio suggests that the genes of the all three groups evolve with different rates and selection pressure (Fig. 8b–e).

Prediction of Protein Architecture

To elucidate the structural features of OFPs and impact of observed length and sequence diversity, the structure of conserved OVATE domain was predicted and analyzed through the Phyre2 protein-modeling server. The analysis revealed considerable structural similarity among the OFPs to the programmed cell death 4 (PDCD4) C-terminal MA-3 domain. Majority of the OVATE domain contains three α-helices joined by coils with hydrophobic amino acids facing the inner surface; the helices and the coils were formed by ~ 10–15, and 3–4 amino acids, respectively (Fig. 9a). Interestingly, the two outer helices contain higher number of turns than those present in the middle helix.

Fig. 9
figure 9

Protein architecture of OVATE domain. Protein architecture of conserved OVATE domains was analyzed through the Phyre2 protein-modeling server (a). To understand the structural similarities and divergence, one representative member from each group was compared by sequence-based structure superimposition with the help of Chimera using MatchMaker tool. The near superimposable structures of the representative members were compared with that of structural anomaly of Sm69626 and Sphfalx0027s0004 (b, c)

All the OFPs across the four species were found to have nearly similar architecture with the exception of Sm69626 and Sphfalx0027s0004. These two share three α-helices as rest of the OFPs and have some additional unique features, and thus deviated structurally. In case of Sm69626, domain and structure prediction had revealed the presence of domain showing similarity to CO-type nitrile hydratase alpha subunit along with PDCD4 C-terminal MA-3 domain as was found in other OFPs. Therefore, to understand the impact, we selected one representative member from each group—Mapoly0142s0001 from Marchantia, Pp3c17_18070 from Physcomitrella and Sphfalx0060s0089 from Sphagnum, and compared it with Sm69626 by sequence-based structure superimposition with the help of Chimera using MatchMaker tool (Pettersen et al. 2004; https://www.cgl.ucsf.edu/chimera/). The structures of Mapoly0142s0001 and Pp3c17_18070 showed maximum overlap in structures (Fig. 9b); in contrast, the structure of Sm69626 showed divergence with that of the representative OFPs (Fig. 9b). The superimposition of the predicted secondary structures with a RMSD value of 0.424 Å show that they exhibit a lower similarity due to the tertiary structure of Sm69626. Superimposition of the predicted structures revealed a protruded α-helix at C-terminal side in the secondary structure of Sm69626. This is perhaps due to the presence of two consecutive lysine residues at the starting of the coil (K18 and K19) which connects this helix with that of the other two helices which is supported by sequence alignment of Sm69626 with that of other three representative OFPs (Fig. 9b). This is likely to be due to the charge and steric property of the lysine as a result of long side chain. Prediction of secondary structure of Sphfalx0027s0004 showed the presence of MIF4G domain unlike PDCD4 C-terminal MA3 domain present in other OFPs (Fig. 9c). In spite of the presence of MIF4G domain instead of PDCD4 C-terminal MA3 domain, the predicted secondary structures superimposed with the OVATE domain in other representative OFPs showed near-perfect symmetry; except at the coil connecting the helices at C-terminal side forming a kink. This structural anomaly might be due to the presence of glutamate and valine (E39 and V40) residues towards the C-terminal in Sphfalx0027s0004, which is the part of coil connecting the helices. Sequence alignment indicated that in other members of OFP Val 40 is replaced either with alanine or proline. Therefore, perhaps it is the cumulative effect of the charge of Glu 39 and the steric property of the following amino acid (Valine/alanine/proline) which is playing a decisive role pertaining to their structural dissimilarity (Fig. 9c).

Spatiotemporal Expression Pattern of PpOFPs and SmOFPs

Spatiotemporal expression pattern of genes is a reliable indicator and can be used to predict gene function/s. We extracted the expression data of PpOFPs, from the transcriptome dataset of Ortiz-Ramírez et al. (2016), that comprises global expression profile of ten developmental stages of Physcomitrella patens, namely chloronema and caulonema from protonemal stage, gametophore, rhizoids, archegonia, spores, and four additional developmental stages of sporophyte (Ortiz-Ramírez et al. 2016). Five of these stages—chloronema and caulonema are from protonemal stage, gametophore, rhizoids, archegonia represent haploid phase, whereas the rest five—S1–S3, SM and spores represent the diploid phase of P. patens life cycle. A heatmap was generated employing ClustVis software (Metsalu and Vilo 2015) that revealed expression pattern of different PpOFPs (Fig. 10a). Among the ten stages of life cycle represented, we could identify 1 OFP (Pp3c11_15470) that had highest expression in chloronemal cells, 11 OFPs in caulonemal cells, 2 in archegonia (Pp3c1_110 and Pp3c3_14980), 1 OFP each with highest expression levels in S2, S3 and spores (Pp3c6_3500, Pp3c5_25760, and Pp3c2_4871, respectively), and 2 OFPs with highest levels in mature sporophyte (SM; Pp3c2_8210 and Pp3c10_12900).

Fig. 10
figure 10

Spatiotemporal expression analyses of OFP genes in Physcomitrella and Selaginella. Expression data of PpOFPs from the transcriptome dataset were extracted and global expression profile of ten developmental stages was compared in 19 PpOFPs (a). Graphical representation of expression of each PpOFP was shown in different stages (b). Similarly, Selaginella expression of six SmOFPs was examined in root transcriptome (c)

Eleven out of 19 OFPs from P. patens, (Pp3c21_12190, Pp3c7_9900, Pp3c2_26490, Pp3c14_21750, Pp3c10_19270, Pp3c21_16580, Pp3c23_6720, Pp3c17_18070, Pp3c1_37090, Pp3c26_5410 and Pp3c18_15340) showed highest levels of expression in caulonema. Similarly, Pp3c1_110 and Pp3c3_14980 showed highest level of expression in archegonia and Pp3c2_4871 was most expressed in spores. Pp3c6_3500 shows considerably high level of expression in S1–S3 stages; whereas, Pp3c1_110, Pp3c26_5410, and Pp3c2_4871 exhibited stage-specific expression. The rest of the OFPs such as Pp3c5_25760, Pp3c2_8210, and Pp3c10_12900 were found to be expressed throughout the entire life cycle and at all stages analyzed (Fig. 10b).

The expression profile data of SmOFPs were extracted from the transcriptome of S. moellendorffii roots derived from meristematic zone (MZ), and a combined dataset of elongation zone (EZ) and differentiation zone (DZ) nomenclatured as EDZ and analyzed (Huang and Schiefelbein 2015). This was imperative because of the overlapping characteristics of EZ and DZ in Selaginella root. Of the six OFPs identified in S. moellendorffii, expression of Sm69626 was equally high as seen in dataset of both the samples; Sm11783 showed relatively higher transcript accumulation in the MZ, and Sm417707 showed reverse pattern; the rest three SmOFPs exhibited nearly equal level in both datasets (Fig. 10c). However, these data are restricted to roots and complete information pertaining to the expression of SmOFPs in other tissues will provide a broader view about their possible function/s.

Discussion

OVATE family proteins (OFPs) are plant-specific transcriptional repressors that play significant roles in several biological processes during plant growth and development (Wang et al. 2011). OFPs were identified in different plant genome but fragmentary information is available about the OFPs present in the genomes of the early land plants. In this study, we investigated various genomic features of OFPs present in Marchantia, Physcomitrella, Selaginella, and Sphagnum. We identified 4 OFPs in Marchantia, 19 in Physcomitrella, 6 in Selaginella and 3 OFPs in Sphagnum (Supplementary Table S1). A comparison of the genome size with that of the copy number of OFP did not reveal a direct correlation; e.g. M. polymorpha, S. fallax and P. patens with 1C genome of 0.29 pg, 0.44 pg, and 0.53 pg have 4, 3 and 19 copies, respectively (Schween et al. 2003; Temsch et al. 1998, 2010; http://data.kew.org/cvalues/); S. moellendorffii with a genome of 0.09 pg/1C has six copies of OFP (Little et al. 2007). It is known that P. patens is a paleo-polypoloid and an ancient genome duplication is responsible for expansion and abundance of genes (Rensing et al. 2007). A comparison of the average length of transcript and protein across the four plants shows a general trend towards reduction from bryophytes to pteridophyte (S. moellendorffii). We, however, also did not observe any direct correlation between genome sizes to size range of gene/transcript/CDS, and intron number. In S. moellendorffii the small genome size may correspond with the presence of intron in only one out of six OFPs. It is also well established that the presence of introns allows a large transcriptome diversity to be generated through alternative splicing. Whether the genome size reduction is accompanied by loss of numbers of OFP members, loss of introns in both OFPs other coding regions, and a reduction in transcriptome diversity of OFPs in specific needs further examination. It would also be interesting to examine the role of introns in generating transcript diversity of OFPs through alternative splicing in the four early land plants. The phenomena of alternative splicing has been previously reported in M. polymorpha (Nagaoka et al. 2017; Nishiyama et al. 1999), P. patens (Fesenko et al. 2017), S. fallax (Jonathan Shaw et al. 2016), and S. moellendorffii (Zhu et al. 2017). In recent past, it has been reported that the sizes of transcript and coding sequence (CDS) are in negative correlation with their expression levels (Caldwell et al. 2015; Smith and Eyre-Walker 2002).

Reconstruction of phylogenetic relationship based on NJ method and clustering/tree-building revealed that the expansion of OFP family in P. patens is genome specific as most of the members formed Physcomitrella-specific sub-clades (Fig. 5; labeled Physcomitrella). Our analysis of the genome of P.patens confirmed at least seven PpOFPs which are the result of segmental duplication (Fig. 6; Supplementary Figure S1). Large-scale duplication, including possibly whole genome duplication (WGD), has been suggested to cause expansion of gene families in P. patens genome (Rensing et al. 2007, 2008). Ks values of paralogs identified from the P. patens genome ranged between 0.5 and 0.9, and provides the evidence for WGD; and the WGD has been dated to occur between 30 and 60 MYA based on analysis of ESTs (Rensing et al. 2007, 2008). Since the complete genome information is still fragmentary in case of other early land plants studied, therefore, time scale of WGD could not be confirmed accurately. Recent phylogenomics studies have established previously undiscovered associations between the early land plants, (Kenrick and Crane 1997; Zhong et al. 2015). Our estimation of Ka/Ks ratio indicates differential rates and selection pressures on the OFPs of the four genomes. Previous studies have established that under relaxed purifying selection, genes can exhibit novel expression pattern and lead to phenotypic diversity (Hunt et al. 2011). However, rates of substitution are dependent on the fact that whether a gene is duplicated, and maintained subsequent to duplication events. Complex interplay of gene structure, duplicated status and chromosomal location together with environmental factors influence selection pressure which in turn influences expression state. In contrast to P. patens, evidence of WGD is lacking in M. polymorpha although locally tandemly arrayed/genes (TAGs) are present indicating local duplication events (Bowman et al. 2017). The genome of Selaginella is also not known to have experienced any shared WGD that is widespread among the other vascular land plants, and shows extremely low level of variation in genome size even among species given the length of time for evolution of genome size expansion (Baniaga et al. 2016; Banks et al. 2011; Jiao et al. 2011; Li et al. 2015).

Domains and motifs of a protein are the major determinants and contribute to functional and structural features of protein and protein family (Forslund and Sonnhammer 2012; Moore et al. 2008). On account of their overall significance, domains and motifs are presumed to be under a higher degree of negative/purifying selection and do not allow mutations to accumulate that affects “fitness” of the protein (Camps et al. 2007; Neduva and Russell 2005). Notwithstanding, major changes including loss and gain of domains, and their rearrangements play a key role in generating protein diversity among homologs, either within a genome or across genomes (Bornberg-Bauer and Albà 2013; Kersting et al. 2012). The OVATE family of proteins is characterized by the presence of a ca. 70 amino acid OVATE domain at the C-terminal end, a bi-partite NLS, and two putative Von Willebrand factor type C (VWFC) required for protein–protein interaction (Liu et al. 2002; Wang et al. 2016). In the present study, we observed that the presence of the conserved OVATE domain in the proteins was not universal across the species and was lacking in some of the OFPs (Mapoly0056s0035, Pp3c21_16580, Pp3c10_19270, Pp3c23_6720, Pp3c11_15470, Pp3c7_9900, Pp3c2_26490, Pp3c3_14980, Pp3c26_5410, Pp3c2_8210 and Pp3c10_12900) (Fig. 7; Supplementary Table S1; Supplementary Figure S2). When the distribution and presence of OVATE domain among the members were analyzed using the phylogenetic framework obtained previously, we observed that eight homologs of P. patens that are part of a single clade lack the OVATE domain. Neither the OVATE domain, nor the two motifs were detected in the two OFPs, Pp3c21_16580 and Pp3c2_8210 that form paralogous pairs. A loss in their common ancestor prior to expansion through WGD and/or segmental duplication seems to the most parsimonious explanation for these observations. Several homologs such as Mapoly0056s0035, Ppc3c21_16580 and Ppc3c2_8210 lack OVATE domain and the two motifs, motif-1 and motif-2 (Fig. 7; Supplementary Figure S2). To the best of our knowledge, this is the first exhaustive study of the presence of different motifs in the OFP protein sequences. The history of such loss or retention of domain and motifs will require complete genome sequences from several more bryophytes and pteridophytes species to be available so that accurate estimation of phylogenetic relationship along with ancestral state reconstructions can be performed as has been reported for other plant and gene families (Rathore et al. 2016; Singh et al. 2018). One of the OFPs from S. moellendorffii Sm437676 was found to contain additional domains that are not generally associated with the OFPs. New domains are known to be acquired by existing genes through several mechanisms and spread in other members of the gene families (Dangwal et al. 2013; Malik et al. 2012; Marsh and Teichmann 2010).

Being a family of plant-specific transcription factor, OFPs are thought to play indispensable role in different plant subcellular compartments. Transcription factors act by binding to specific DNA motifs present on cis-elements of genes, or through interacting with other protein. Detection of subcellular localization signal/s revealed that most of OFPs are putatively localized in nucleus which is in accordance with previous studies (Jian-ping et al. 2012; Wang et al. 2011; Yu et al. 2015). Several transcription factors have been shown to be localized in non-nuclear milieu due to roles played during developmental processes and on environmental cues (Mishra et al. 2017; Subba et al. 2013). Transcription factors, therefore, not necessarily possess nuclear localization signal only, and may have signals for localization in other subcellular compartments. A number of reports suggest that transcription factors exhibit dynamic behavior in localization patterns or nucleocytoplasmic shuttling attribute (Vandromme et al. 1996; Xu and Massagué 2004). Therefore, the OFPs were also analyzed for the presence of NES, and some of the OFPs were predicted to contain both NES and also predicted to be localized in cytoplasm, extracellular matrix and endoplasmic reticulum (Fig. 4; Supplementary Table S2). In contrast to previous suggestion by Schmitz et al. (2015) that the OVATE domain itself harbors the localization signal, we observed that the localization signals are randomly distributed throughout the protein (Schmitz et al. 2015).

Spatiotemporal expression analysis of P. patens OFPs revealed diverse roles played by OFPs during various stages of life cycle. For instance, higher expression of Pp3c5_25760 and Pp3c7_9900 especially in the reproductive tissues indicates putative role of OFPs during reproductive stages of life cycle; mixed expressions of some of the PpOFPs (Pp3c3_14980, Pp3c10_12900, Pp3c21_16580 and Pp3c26_5410) across the developmental stages imply their assorted functions; several OFPs were found to be upregulated in caulonemal stage in comparison with other stages (Fig. 10a, b). Lack of expression analysis datasets in S. moellendorffii limited our survey to only two tissue stages with limited observations (Fig. 10c). Previous studies have analyzed the role played by OFPs in growth and development in several plants such as rice (Ma et al. 2017; Wang et al. 2007; Yang et al. 2016), tomato (Huang et al. 2013; Liu et al. 2002), Arabidopsis (Hackbusch et al. 2005; Wang et al. 2011; Zhang et al. 2016). Role of various other genes and gene family has also been analyzed that regulate development and adaptation in Physcomitrella and Selaginella (Dangwal et al. 2014; Malik et al. 2012; Weng et al. 2005).

Protein function is dependent on its three-dimensional conformation, which in turn is governed by amino acid composition, molecular weight, isoelectric point (pI), hydrophilic/hydrophobic nature and several other factors (Aftabuddin and Kundu 2007; Brown et al. 2010). A comparative analysis of gene families across various genomes allows spotting evolutionary trends, if any, with respect to protein size across the phylogenetic tree. In the present study, we observed that the molecular weight of the OFPs was distributed between 7.8 and 188.9 kDa, with majority of proteins (~ 69%) exhibiting a molecular mass of > 50–100 kDa (Fig. 3a; Supplementary Table S1). Furthermore, analysis of pI attributes of the OFPs revealed that most of the proteins (60%) across the species were alkaline, while the remaining 40% were between the pI ranges of 4–7. None of the OFP member showed pI in the extremely acidic range, i.e. < 4 (Fig. 3b). The interrelationship amongst the protein length and pI, as well as occurrence of different proteins with different pI in proteomes, has been found to be strongly influenced by selection pressure (Kiraga et al. 2007). GRAVY values were determined to analyze the hydrophobic and hydrophilic characteristics of the OFP members. We detected that SmOFPs exhibited lesser hydrophilic nature when compared with the majority of the others. This strengthens the earlier finding which suggested that all amino acids and hydrophobic networks endure the hierarchical signature (Lipman and Wilbur 1991). Disordered proteins deviate from the well-established, universal paradigm of sequence-to-structure, and structure-to-function as they do not form fixed or canonical three-dimensional structure (Brown et al. 2010). Several reports establish that these proteins devoid of any distinctive structure are implicated in important biological processes such as gene regulation, cell cycle control, signal transduction, to name a few (He et al. 2009; Radivojac et al. 2007; Vucetic et al. 2007). Our study of the OFPs derived from the four early land plants revealed a range of intrinsic disorderness which differed across the species. OFP members of Marchantia and Selaginella showed highest and lowest range of disorderness, respectively, whereas Physcomitrella and Sphagnum exhibited a moderate range (Fig. 3d; Supplementary Table S1). Indeed, all these physicochemical characteristics play a crucial role in species-specific or genotype-specific diversity, and are perhaps evolutionarily conserved as has been suggested recently (Shekhar et al. 2016).

Analysis of protein architecture revealed the presence of several non-canonical domains, such as PDCD4 C-terminal MA3 domain-like structures in most of the OFPs in the early land plants (Fig. 9; Supplementary Table S1). PDCD4 protein is acknowledged as an inhibitor of translational and tumor promoter-induced neoplastic transformation in animals, and comprises two MA3 domains. The MA3 domain shows homology with the eukaryotic translation initiation factor eIF4G, which helps different initiation factors to assemble 40S ribosomal subunit on mRNA through acting as a scaffold protein (Cheng et al. 2013). In higher plants, PDCD4 MA3 domains containing proteins have been associated with ethylene-mediated signaling and abiotic stress responses, and are found to be unique due to the presence of four MA3 domains (Lei et al. 2011). The emergence of a distinctive form of PDCD4 in plants was found to be concomitant with the evolution of the modules of the ethylene signaling pathway (Cheng et al. 2013). Some of the OFPs showed the presence of CO-type nitrile hydratase alpha subunit and MIF4G domain-like structure. Nitrile hydratase is suggested to be present in ancestor of all eukaryotes, has photoactivation properties, and may represent an ancestral state of development of photosynthetic capabilities in early land plants from non-photosynthetic prokaryotes (Kubiak and Nowak 2008). Nitrile hydratase enzymes are implicated in the production of ammonia and organic acids using nitrile group-containing compounds as substrate, and are example of an exceptional metalloenzymes that use trivalent cobalt or iron as cofactor. However, these are generally present in prokaryotes, and only recently their occurrence has been reported in few eukaryotes, and is believed to be the product of lateral gene transfer from prokaryotes (Marron et al. 2012). MIF4G domain-like structure is known to be involved in translation initiation and its function is similar to that reported for PDCD4 C-terminal MA-3 domain-like structures (Virgili et al. 2013). The presence of several domains in OFPs indicate that acquisition of novel domains leading to structural variability has played a significant role in functional properties of OFPs during the course of evolution (Hackbusch et al. 2005; Liu et al. 2002; Schmitz et al. 2015).

Conclusion

In the present study, comprehensive in silico analyses of OFPs from four different early land plants were performed that demonstrated the dynamic range in copy number, gene structure, presence of introns, protein size, pI, domain composition and protein structure. Sequence and phylogenetic analysis provided evidences for the existence of two major groups based on the presence of motifs within the OVATE domain. Physicochemical attributes also substantiated an intimate association between the development of functional diversity and evolution across the species. Non-synonymous and synonymous ratio suggests that the OFPs in all model plants are under purifying selection. Subcellular localization and expression studies indicated the putative role played by OFPs in various growth and developmental stages. This analysis, to the best of our knowledge, provides a first comprehensive overview of the OFP gene family in the early land plants, including hints towards its putative functional role, and will provide an invaluable platform to understand the developmental transitions and underlying diversifications during the evolution of land plants. This study can be used further as a basis to understand the evolution in higher plants as well as for future research targeting the functional characterization of multigene family proteins.