Introduction

The non-specific lipid transfer proteins (nsLTPs) constitute a large protein family specific for plants. This protein family evolved when plants colonized land, as they are found in all land plants, but not in green alga (Edstam et al. 2011). In the nsLTPs eight conserved cysteines are localized in a motif with the general form C–Xn–C–Xn–CC–Xn–CXC–Xn–C–Xn–C (José-Estanyol et al. 2004). The cysteines form four disulphide bonds that stabilize the tertiary structure of the proteins, making them very resistant to heat denaturation and proteolytic digestion (Lindorff-Larsen and Winther 2001). The compact structure consists of four to five α-helices with a central hydrophobic cavity that is suitable for binding hydrophobic ligands (Shin et al. 1995; Lascombe et al. 2008). The nsLTPs that have been examined for lipid binding are promiscuous and bind many different hydrophobic or amphiphilic molecules, including alkanes, fatty acids, fatty acyl-coenzym A and phospholipids (Sodano et al. 1997; Zachowski et al. 1998; Guerbette et al. 1999).

Initially nsLTPs were thought to be involved in intracellular lipid trafficking, but this has later been excluded due to the fact that nsLTPs possess an N-terminal signal sequence leading them to the extracellular space. Their exact in vivo functions have not yet been clarified, even though the nsLTPs have been known for almost three decades (Kader et al. 1984). One nsLTP from Arabidopsis has been suggested to be involved in long-distance signaling during pathogen defense and there are also several examples of nsLTPs showing antifungal or antibacterial properties in vitro (Nielsen et al. 1996; Maldonado et al. 2002; Wang et al. 2004; Kirubakaran et al. 2008). Further, there are several papers reporting an involvement in formation of the protective cuticle and also some reports suggesting roles in reproduction, e.g. pollen tube adhesion and pollen wall development (Sterk et al. 1991; Thoma et al. 1994; Park et al. 2000; Cameron et al. 2006; Zhang et al. 2008; DeBono et al. 2009; Lee et al. 2009). Some nsLTPs are also expressed abundantly during seed germination and possibly have a role in lipid recycling (Edqvist and Farbos 2002; Eklund and Edqvist 2003).

We have previously divided the nsLTPs into four major and several minor types according to sequence similarity, intron position and spacing between the cysteine residues (Edstam et al. 2011). In one of the major types, Type G, the transcripts encode a C-terminal signal sequence in addition to the N-terminal one, leading to a posttranslational modification where a glycosylphosphatidylinositol(GPI)-anchor is added to the protein. The GPI-anchor attaches the protein to the extracellular side of the plasma membrane. GPI-anchored proteins are found in all eukaryotic organisms and are involved in different functions including cell-to-cell interactions, immune recognition and polarized cell growth (Wang et al. 2002; Ahmad et al. 2003; Ghiran et al. 2003). In plants, proteins with GPI-anchors are involved in many different processes, like callose deposition and metabolism in bud dormancy release, in cell-to-cell communication and in polarized pollen tube growth (Lalanne et al 2004; Simpson et al. 2009; Rinne et al. 2011).

Systematic functional analysis of the nsLTPs have been hampered due to the fact that they are encoded by large gene families and that the genes are likely to be functionally redundant. This complicates the usage of genetic tools, such as T-DNA insertion mutants. In this work, we decided to build a platform for further research of the biological function of these enigmatic proteins by using microarray data to investigate coexpression patterns. Coexpression of genes may indicate an involvement in the same biological processes. Therefore the identification of coexpression networks may open for discoveries of gene function. Here, we focused our attention to the Type G nsLTPs (LTPGs). This selection was done to limit the number of genes in the investigation, but also due to the fact that there are reports that associate a phenotype, less cuticular wax or less wax components, with lowered expression of LTPGs in Arabidopsis (DeBono et al. 2009; Lee et al. 2009; Kim et al. 2012). We constructed modules of coexpressed LTPGs in rice and Arabidopsis. From the Arabidopsis modules we built extended networks by searching the whole Arabidopsis transcriptome for genes coexpressed with each LTPG-module. The networks were analyzed for enrichments in Gene Ontology terms in order to obtain clues to biological function of the LTPG modules. Our data suggest that the LTPGs are involved in the accumulation of cuticular waxes as indicated previously. However, we also show that the LTPGs may be involved in the biosynthesis or deposition of suberin and sporopollenin. We also characterized the splicing pattern of the Arabidopsis and rice LTPG transcripts and show that many undergo alternative splicing which leads to transcript isoforms with or without the GPI-anchor attachment signal.

Materials and methods

Sequences and Sequence Tools

Previous studies have identified 34 LTPGs from Arabidopsis and 27 from rice (Boutrot et al. 2008; Edstam et al. 2011). These sequences were initially used in this study. The Arabidopsis sequences were retrieved from The Arabidopsis Information Resource (TAIR, version 10) and the rice sequences from Rice Genome Annotation Project (TIGR RGAP, version 6.1) (Rhee et al. 2003; Ouyang et al. 2007). Basic Local Alignment Search Tool (BLAST v 2.2.18) was used locally to search for additional putative LTPGs among downloaded Arabidopsis and rice protein sequences (Altschul et al. 1990). All known sequences from each organism were used as bait and all settings were left as default (Matrix BLOSUM62, gap penalties: Existence 11 and extension 1). The cut off value was set to 0.0001. Results were manually investigated and false hits removed.

BLAST searches were also performed online at The Arabidopsis Information Resource (TAIR, database TAIR10 transcripts) and Rice Genome Annotation Project (RGAP, database Rice full length cDNAs in Genbank). Expressed Sequence Tag (EST) databases were searched using genomic sequences as bait, in order to find introns and reveal alternative splicing. PredGPI (available at http://gpcr2.biocomp.unibo.it/gpipe/pred.htm) was used to predict presence of sites for the post translational addition of a GPI-anchor (Pierleoni et al. 2008). PredGPI was used for all expressed isoforms of each sequence. TargetP 1.1 was used to predict subcellular targeting (Emanuelsson et al. 2007).

A phylogenetic tree was constructed to visualize relations between the LTPGs in Arabidopsis and rice. The tree is based on multiple alignments done with the ClustalW method (Thompson et al. 1994), using the program ClustalW2 v2.0.7 (Larkin et al. 2007). The alignments were done using the slow/accurate method and the protein matrix Gonnet. The gap extension penalty and gap opening penalty were set to 0.01 and 10, respectively. Only the cores of the mature proteins including the conserved Cys residues were used in alignments; the GPI-anchor and the link to the anchor were excluded.

Manually refined alignments were used as input in ProtTest v2.4 (Abascal et al. 2005), which was run with all candidate models, a BIONJ tree and the slow optimization strategy. LG +I +G was predicted as the best model and thus used to construct a Maximum Likelihood phylogenetic tree using the program Phyml v3.0 (Guindon and Gascuel 2003). All other settings were left as default, but with 100 replicates for bootstrapping. Three LTPGs from Physcomitrella patens were used as outgroup (Edstam et al. 2011). Additionally, searches for overrepresented motifs in the promoter regions (0–2,000 bp upstream the start codon) of all genes in each module were performed. The web tool elefinder at Matt Hudson Lab was used for this purpose (http://stan.cropsci.uiuc.edu/tools.php).

Expression and coexpression networks

The eFP browser for Arabidopsis at The Bio-Array Resource for Plant Biology were used to retrieve expression data from different tissues and developmental stages, during stresses, hormone and chemical treatment (Schmid et al. 2005; Kilian et al. 2007; Winter et al. 2007; Goda et al. 2008). Further, coexpression networks of LTPGs from Arabidopsis and rice were constructed. Pairwise Pearson correlation coefficients between the LTPGs were obtained using the web tool Cornet (https://cornet.psb.ugent.be/main/tool) for Arabidopsis and the coexpression analysis tool from the Rice Oligonucleotide Array Database (http://www.ricearray.org/) for rice (Jung et al. 2008). The correlation coefficients were obtained from six predefined datasets in Arabidopsis (All, Development, Whole Plant, Hormone treatment, Biotic Stress and Abiotic Stress) and three in rice (General, Biotic Stress and Abiotic Stress). A correlation between two genes was considered present when the coefficient was higher than 0.7. A network consisting of three or more LTPGs genes with correlation coefficients above 0.7 was considered as a module. In Arabidopsis, each module was then used to make genome wide networks of genes coexpressed with the LTPGs genes. To expand the modules of LTPGs to genome wide coexpression networks, every gene with a correlation coefficient higher than 0.7 towards any of the LTPGs in the module was connected to that module. This was performed in the dataset All. The software Cytoscape was used for visualization of the resulting networks of coexpressed genes (Shannon et al. 2003).

Cluster analysis

As an additional method to identify groups of coexpressed genes in the data set we used a fuzzy clustering algorithm (Kaufman and Rousseeuw 2008) as opposed to hard clustering. For our analysis we used the “fanny” function in the R-package “cluster”. The parameter m defines the degree of fuzzification allowed between clusters. As m approaches 1, the fuzzy clusters become hard clusters, where each data point belongs to only one cluster. As m approaches infinity the clusters become completely fuzzy, and each point will belong to each cluster to the same degree (1/K, where K is the number of clusters analyzed) regardless of the data. Usually m = 2 is initially chosen, and this is also the value we use here. However, we evaluated our choice of value of m using an exhaustive grid-search varying m and K (Futschik and Kasabov, 2002). In order to find the number of clusters K that describes the best partitioning of the data set, one usually executes the clustering algorithm with different numbers of the expected number of clusters K (K m  < K < K M ). Then a quality index Q K is computed for each value of K tested and the K giving the “best” value Q K is chosen. We used the R-package “clValid” to validate the best number of clusters using all three validation measures included in the validation parameter “internal”.

Gene Ontology Enrichments

To search for gene ontology enrichments in the constructed genome wide networks for Arabidopsis, the plugin BiNGO in Cytoscape was used (Maere et al. 2005). As a statistical test the hypergeometric test was used, together with the Benjamin and Hochberg False Discovery Rate (FDR) correction. The significance level was set to 0.05 and whole Arabidopsis annotation was used as a reference set. Three different sets of ontology terms were used separately: Biological Process, Molecular Function and Cellular Component (Berardini et al. 2004).

RNA analysis

Four of the genes that were predicted to be alternative spliced were further investigated in planta. The Arabidopsis thaliana ecotype Col-0 was used for all experiments. Seeds were sown on agar plates containing ½ strength Murashige and Skoog medium supplemented with 1 % sucrose. After 14 days the seedlings were transferred to a mixture of soil and vermiculite. The plants were grown either under a light cycle of 16 h light and 8 h dark in a growth chamber, or under constant light. RNA from leaves, flowers, siliques and roots was extracted using the RNeasy Plant Minikit (Qiagen, Hilden, Germany) according to the manufacturers’ protocol. The extracted RNA was treated with DNase and then used as template in a cDNA synthesis. For each reaction 1 μg RNA was used. Oligo(dT)18 primer was used to avoid amplification of immature mRNA. RevertAid Reverse Transcriptase (Fermentas, Vilnius, Lithuania) was used for first strand cDNA synthesis according to the manufacturers’ protocol. To ensure that no traces of genomic DNA were contaminating the samples an additional cDNA synthesis was performed without the Reverse Transcriptase, as a negative control. The synthesized cDNAs were used for PCR with gene specific primers (Online Resource 1). DreamTaq DNA Polymerase (Fermentas) was used for the PCR, according to the manufacturers’ protocol. The PCR was performed as follows: 3 min of initial denaturation at 95 °C, followed by 35 cycles of 30 s denaturation at 95 °C, 30 s annealing at 55 °C and 1 min elongation at 72 °C. After the cycling, there was a final elongation step at 72 °C for 7 min. The PCR products where run on an agarose gel (2 %), fragments excised and DNA recovered using a QIAquick gel extraction kit (Qiagen). Extracted DNA was sent to Eurofins MWG Operon (Ebersberg, Germany) for sequencing.

Results

LTPGs in Arabidopsis and rice

We initiated this study by identifying the complete set of LTPGs in Arabidopsis and rice (Tables 1, 2). Previously, 34 LTPGs genes in Arabidopsis and 25 LTPGs in rice have been identified in genome-wide analyses (Boutrot et al. 2008). During this study two additional rice genes were identified giving a current total number of 27 GPI-anchored LTPGs in rice. The occurrence of the GPI-anchors in nsLTPs is mainly based on predictions. However, GPI-anchors have been shown experimentally for AtLTPG1, AtLTPG11, AtLTPG12 and AtLTPG31 (Borner et al. 2003; Elortza et al. 2003). When transcriptome databases were searched for transcripts of LTPG genes, three of the Arabidopsis genes and two of the rice genes lacked a corresponding transcript in the databases (Tables 1, 2). The genes lacking matching transcripts were considered as putative pseudogenes and were removed from the remaining investigation of the expression profiles. However, we do not exclude the possibility that transcripts from all or some of these five genes could possibly be identified during conditions not yet investigated. The intracellular localizations of the identified LTPGs were investigated using the subcellular predictor TargetP (Emanuelsson et al. 2007). As expected, most of the proteins are predicted to be secreted. More surprisingly there are three Arabidopsis proteins and two rice proteins that by TargetP are assigned to other localizations, such as chloroplast and mitochondria (Tables 1, 2). However, most of these predictions to chloroplast and mitochondria show low reliability scores and the localization of LTPGs to organelles should be experimentally verified.

Table 1 The LTPGs in Arabidopsis
Table 2 The LTPGs in rice

Coexpression of LTPGs

We reasoned that there are probably functional groups of LTPG genes that are involved in related biological processes. Further, if we could identify functional groups it would be helpful for the rational design of experiments aiming at elucidating the biological role of these proteins. LTPG genes involved in the same process are likely to share correlated expression profiles. Therefore, to identify functional groups of LTPGs we turned our attention to Arabidopsis and also rice microarray datasets. For Arabidopsis, we used six different microarray datasets: All, Whole Plant, Development, Hormone, Biotic and Abiotic stresses (De Bodt et al. 2009). Expression data for 26 LTPG genes were available in these datasets. We treated the microarray datasets separately, to learn if identified coexpression patterns would be based on for instance stress responses or developmental programs.

The coexpression between the LTPGs was obtained as Pearson correlation coefficients (R). At first, coexpressed LTPG genes were identified using an arbitrary threshold of R > 0.7. This cut off was selected since R > 0.7 is generally considered a true correlation and used in various analysis (Lee et al. 2004; Ren et al. 2005; Zheng et al. 2008). In each dataset we could identify 3–5 groups of Arabidopsis LTPG genes, which according to the definition above, were coexpressed (see Online Resources 2–7). After identifying the groups of coexpressed genes in each dataset, we next placed the LTPGs in composite coexpression modules. To be included in a composite coexpression module a gene had to be a part of a specific coexpression group in at least four of the six investigated microarray datasets. The composite coexpression modules therefore reflect the stability of the coexpression groups in a larger number of samples. Using this arbitrary threshold approach, 14 of the 26 Arabidopsis LTPG genes could be distributed in three different composite coexpression modules (Table 3). The remaining 12 genes were not showing strong enough coexpression to other LTPGs to be placed in any of the composite modules. However, these unplaced genes could show an R > 0.7 to a gene within the modules in some datasets, as described below.

Table 3 Composite modules of co-expressed genes in Arabidopsis and rice

According to the arbitrary threshold approach, three genes, AtLTPG1, AtLTPG2 and AtLTPG6, are placed in the composite module AtI (Table 3). These three genes are coexpressed in all six datasets. The only exception is that AtLTPG6 is not reaching the cut off R > 0.7 in the Biotic Stress dataset. However, in this dataset the highest correlation coefficient for AtLTPG6 is 0.68, and thus very close to the cut off. None of the other LTPG genes are found in this module in any of the datasets (see Online Resources 2–7). Module AtII is the largest composite module with seven genes; AtLTPG5, AtLTPG15, AtLTPG16, AtLTPG17, AtLTPG20, AtLTPG22 and AtLTPG30. Five of the genes are found in this module in every dataset (AtLTPG15, AtLTPG16, AtLTPG20, AtLTPG22 and AtLTPG30). AtLTPG17 is below the cut off value in the Whole Plant dataset, with a highest correlation coefficient of 0.58 towards another gene in AtII. AtLTPG5 is missing the cut in both the Whole Plant and the Biotic Stress datasets, but the highest correlation coefficient is not far below in any of the cases (0.67 and 0.65, respectively). Three LTPG-genes (AtLTPG7, AtLTPG11 and AtLTPG33) outside the composite AtII-module are coexpressed with the module in two datasets each. However, in the other datasets the correlation is weaker, although in some cases just below the threshold (see Online Resources 2–7). The composite module AtIII consists of four genes; AtLTPG3, AtLTPG4, AtLTPG23 and AtLTPG26 (Table 3). Only AtLTPG4 is found in this module in all six datasets, the others are below the cut off in one or two datasets each. AtLTPG3 is just below in the Whole Plant and Biotic Stress datasets (highest coexpression coefficients 0.62 and 0.69). AtLTPG23 is excluded when using the dataset All and AtLTPG26 when using Development (highest coexpression coefficients 0.62 and 0.63 respectively). Four genes outside the module (AtLTPG9, AtLTPG24, AtLTPG29 and AtLTPG34) show significant coexpression with module AtIII in two or three of the datasets (see Online Resources 2–7). However, AtLTPG9, AtLTPG24 and AtLTPG34 show much weaker correlation to the AtIII-module in the other datasets. The same is true for AtLTPG29 although in the Whole Plant dataset, this gene has a correlation to module AtIII which is only just below the cut off.

In summary, the arbitrary cut off approach results in three modules of coexpressed LTPG genes in Arabidopsis. The module AtI is the most stable over all the tested datasets. In module AtII five of the genes are consistently coexpressed, according to the given definition, over the six datasets, whereas there are a few genes that show a correlated expression to the module only in some datasets. The stability of module AtIII is weaker with only one gene fitting to the module in all datasets.

In rice, 13 of the LTPGs could be placed in either of three composite coexpression modules (Table 3). The rice genes that were placed in expression modules showed a coexpression pattern in at least two of the three investigated microarray datasets: General, Biotic and Abiotic (Jung et al. 2008). Three coexpressed genes (OsLTPG10, OsLTPG12 and OsLTPG22) were grouped in module OsI, eight genes (OsLTPG7, OsLTPG8, OsLTPG9, OsLTPG14, OsLTPG17, OsLTPG18, OsLTPG26 and OsLTPG27) were assigned to module OsII and another three genes (OsLTPG1, OsLTPG2 and OsLTPG24) were placed in module OsIII. Eight of the rice genes did not fit in any module, and thus seem to lack significant coexpression to other genes encoding LTPGs, at least in more than one of the investigated datasets.

A potential problem of using an arbitrary threshold value is by setting this threshold too high, important relationships can be lost. For example, we have with the approach described above identified several genes which are above the threshold in some datasets but below in the majority of the datasets (see Online Resources 2–7). With the approach and cut off threshold used in this study, these genes could not be assigned to any of the expression modules. On the other hand, setting the threshold too low could result in connections that are very weak or possibly false positive results. Therefore, we also used a clustering algorithm to identify groups of coexpressed Arabidopsis LTPG genes in the datasets. Clustering techniques seek to partition a given data set into a set of disjoint groups so that objects within groups are more similar to each other than objects in separate groups (Kaufman and Rousseeuw 2008). The rationale is that many coexpressed genes are co-regulated and important groups can then be revealed with the usage of cluster analysis (Domany, 2003). We here used a fuzzy clustering algorithm (Kaufman and Rousseeuw 2008) as opposed to hard clustering. For hard clustering the clusters are mutually exclusive. Fuzzy clustering, on the other hand, allows data points to belong to several clusters simultaneously. The partial membership is presented as a probability of a data point i belonging to cluster k. In many data sets fuzzy clustering is more natural compared to hard clustering (Do and Choi 2008) since data points on the boundaries between several clusters are not forced to belong to one of them, but rather are assigned a partial membership between 0 and 1. For a fixed observation the membership probabilities sum to 1.

The clustering was done on the all 26 AtLTPG genes that were available in the microarray datasets. In some datasets (Abiotic Stress, Biotic Stress) the analysis indicated three clusters, while in other datasets (Hormone, All) there were support for two clusters. Further, in the remaining datasets (Development, Whole Plant) the analysis did not indicate any particular number of clusters that best describe the partitioning of the datasets. In general, silhouette width scores over 0.6 is considered as significant. In the case of the AtLTPGs the average silhouette width scores were low for k 2–5 and did not give strong support for the partitioning in any of the datasets, as shown for datasets All and Development in Fig. 1a. Twelve genes were not placed in the coexpression modules with the threshold approach. Probably, these genes with a low expression correlation to other LTPGs reduce the probability to obtain well-defined clusters. Next, we followed the partitioning in clusters of the genes we previously assigned to expression modules At1, AtII and AtIII with the arbitrary cutoff approach. When k = 3 was used in the analysis we noted that these genes were showing the highest silhouette width scores and therefore most strongly associated with each of the three clusters, as shown for dataset All and Development in Fig. 1b. This could also be concluded by visualizing the clustering in two dimensions shown in Fig. 2 for datasets Development, Whole Plant and Abiotic Stress, where the AtLTPGs designated previously to modules AtI, AtII and AtIII were found in three separate clusters.

Fig. 1
figure 1

Fuzzy C-means clustering of AtLTPGs. The graph in (a) shows the average silhouette widths (y-axis) for 2–5 clusters (x-axis) for microarray datasets All and Development. The graphs in (b) show the silhouette width for each AtLTPG with selection for three clusters. Dataset All is in the left panel and dataset Development is in the right panel. In both panels the numbers to the right indicate the number of genes in each cluster (left of the vertical line) and the average silhoutte width for each cluster (right of the vertical line). In (c) is the fuzzy C-Means plots for three clusters. The plots illustrate to which probability (from 0 to 1) each AtLTPG belongs to each of three clusters. Dataset All is in the left panel and dataset Development is in the right panel. The genes in module AtI are green, genes in module AtII are red and genes placed in AtIII are blue

Fig. 2
figure 2

Fuzzy clustering of AtLTPGs in the datasets Development, Whole Plant and Abotic Stress. The genes in module AtI are green, genes in module AtII are red and genes placed in AtIII are blue. Clustering were done with k = 3

The cluster partitioning was further examined by evaluation of the fuzzy C-Means plots (Fig. 1c and Online Resource 8). The AtI-genes (AtLTPG1, AtLTPG2 and AtLTPG6) are in all six datasets found in the same cluster. However, in the dataset Hormone, all three genes show an equal probability for membership in another cluster. The AtII-genes (AtLTPG5, AtLTPG15, AtLTPG16, AtLTPG17, AtLTPG20, AtLTPG22 and AtLTPG30) are also found together in the same cluster in all six datasets. In the dataset Whole Plant the AtII-genes AtLTPG5, AtLTPG17, AtLTPG22 and AtLTPG30 also show a lower probability (0.25–0.50) for a second cluster. Further, in the Biotic Stress dataset AtLTPG5 and AtLTPG20 have an equal probability for membership in three or two different clusters, respectively (Online Resource 8). As described previously, AtLTPG7, AtLTPG11 and AtLTPG33 showed an R > 0.7 to genes in the AtII-module in some datasets. In the fuzzy clustering, these genes and also AtLTPG14 are found in the same clusters as the AtII-genes in all datasets. The fuzzy clustering located the AtIII-genes (AtLTPG3, AtLTPG4, AtLTPG23, AtLTPG26) in all six datasets to a third separate cluster. In dataset All AtLTPG3, AtLTPG4, AtLTPG26 have a lower probability for memberships in another cluster, whereas in Hormone all four genes show this tendency (Fig. 1c and Online Resource 8). AtLTPG24 and AtLTPG34 showed R > 0.7 to the genes in module AtIII in several datasets. In the fuzzy clustering, these genes were found in the same cluster as the AtIII-genes in all datasets. Moreover, they showed a similar promiscuity in the All and Hormone datasets.

In summary, the fuzzy clustering approach confirmed the coexpression modules identified with the arbitrary threshold approach. Moreover, the clustering revealed tendencies for larger networks consisting of more LTPG genes then could be identified with the chosen arbitrary threshold R > 0.7. It also clear from both the fuzzy clustering and the arbitrary threshold approach that there are a number of LTPGs that do not correlate strongly in terms of expression profiles with other LTPG genes. For instance, according to the fuzzy clustering AtLTPG9, AtLTPG12, AtLTPG21, AtLTPG29, AtLTPG31 and AtLTPG32 show a probability of at least 0.25 to associate with two or more clusters in most datasets (Online Resource 8).

Expression patterns of the modules

We continued this investigation by screening the microarray datasets for the detailed expression pattern of the Arabidopsis expression modules AtI, AtII and AtIII (Table 3). We focused on these genes since both the arbitrary cut-off method and the fuzzy clustering showed a connection between the genes within each module. In comparison with the other modules, AtI have a high expression baseline. In the Developmental dataset all three genes in AtI have their highest expression levels in flower and seed, and the lowest in root, cauline leaf, senescing leaf and mature pollen (Fig. 3). In the Abiotic Stress dataset downregulation were shown for drought, heat and UV-B (Online Resource 9). Wounding causes an upregulation after 1 and 3 h, and then a downregulation after 6 and 12 h. In the Biotic Stress dataset there are no significant upregulations, but several downregulations (Online Resource 10). Interestingly, treatment with the photosynthesis inhibitor N-octyl-3-nitro-2,4,6-trihydroxybenzamide (PNO8) causes a large decrease in the expression of the genes in module AtI (Online Resource 11).

Fig. 3
figure 3

The expression pattern of members in modules AtI (A), AtII (B) and AtIII (C) during different developmental stages of Arabidopsis. The genes are indicated in the top right corner of each panel. The Y-axis shows the expression levels of LTPG transcripts. Standard deviation is shown as error bars

In the Developmental dataset all genes in AtII have an expression peak in the roots of both adult plants and seedlings (Fig. 3). The expression of most of the genes in AtII also peak in hypocotyls and seeds, and a few of the genes are also upregulated in flowers. In the abiotic and biotic stress (Online Resources 12–13) there is only one condition that gives a significant change in all AtII-genes; 1 h of drought leads to a downregulation of gene expression. Moreover, all of them show an increased expression 3 h after addition of abscisic acid (ABA), although in different degrees and not in all cases significant (Online Resource 14). Module AtIII is highly expressed in flowers and mature pollen, but at much lower levels in other tissues (Fig. 3). In the Hormone dataset there are no big differences in the expression, since the experiments are conducted on seedlings and not flowers. The similar situation is found in the other datasets, resulting in no significant changes in the Chemical dataset or stress datasets (Online Resources 15–18). In summary, the most important points from the characterization of the expression patterns are that AtI transcripts are present in most aerial parts, AtII is found in roots, although is not restricted to underground tissues, and AtIII is restricted to reproductive tissues.

In rice, OsI has an expression peak in mature leaves, which distinguishes OsI from the other composite rice modules (Fig. 4). This module also has peaks during inflorescence stage P5 and seeds stage S5. Inflorescence stage P5 corresponds to the vacuolated pollen stage (15–22 cm height), and seed stage S5 corresponds to 21–29 days after pollination (dap); during development of dormancy and desiccation tolerance as previously defined (Itoh et al. 2005). OsII has a clear expression peak in roots (Fig. 4) which discriminates OsII from the other modules. OsII also reaches high levels in inflorescence stage P5 and seeds stage S4, corresponding to embryo maturation 11–20 dap. OsIII shows very low levels of expression in both roots and mature leaves. The genes in this module reach their highest levels in inflorescence, where OsLTPG1 and OsLTPG3 peaks at P5, while the OsLTPG2 transcript show higher levels at P2, corresponding the meiotic stage. OsLTPG2 is also abundantly expressed in seeds at stages S4 and S5. Thus, also in rice there are one module, OsI, with a broad expression pattern in aerial parts, another module, OsII, that is expressed in, but not restricted to roots, and a third module OsIII which is expressed in reproductive tissues.

Fig. 4
figure 4

The expression pattern of members in modules OsI (A), OsII (B) and OsIII (C) during different developmental stages of rice. The genes are indicated in the top right corner of each panel. The Y-axis shows the expression levels of LTPG transcripts. Standard deviation is shown as error bars

Gene ontology enrichments

The three expression modules from Arabidopsis were used in genome wide searches for coexpression, leading to greatly expanded gene networks. These networks were then checked for enrichments in gene ontology (GO) terms. Only results from the microarray dataset All for each module are presented here. The 20 terms with lowest p value for each ontology file are found in Tables 4, 5, 6. Extended lists restricted by p value <0.01 are given as supplementary information (Online Resources 19–21). In the Biological Processes ontology, the network for module AtI is most significantly enriched in the parent term photosynthesis with its children terms light harvesting, chlorophyll biosynthetic process, nonphotochemical quenching and several other photosynthesis related terms (Table 4). The enriched GO terms also include the parent term response to abiotic stimulus with enriched children terms response to radiation, response to light stimulus and also response to cold. Cuticle development, wax biosynthesis and very long-chain fatty acid metabolism are other enriched terms. The most significantly enriched term of the Molecular Function ontology is chlorophyll binding. The enriched terms in the Cellular Component ontology are mostly related to chloroplasts, such as thylakoid, but apoplast and cell wall are also represented.

Table 4 Enriched gene ontology terms for module AtI
Table 5 Enriched gene ontology terms for module AtII
Table 6 Enriched gene ontology terms for module AtIII

The network based on module AtII is in the Biological Processes ontology most significantly enriched in the terms cell wall organization or biogenesis, secondary metabolic process and response to chemical stimulus. The parent term secondary metabolic process is followed by enriched children terms phenylpropanoid metabolic process, phenylpropanoid biosynthetic process and suberin biosynthetic process (Table 5). The ancestor term root system development with children terms root development and root morphogenesis are also significantly enriched in the AtII-network. In the Molecular Function ontology, some of the most significantly enriched terms in the AtII network are oxidoreductase activity, heme binding, peroxidase activity and tetrapyrrole binding. In the Cellular Component ontology the terms cell wall, external encapsulating structure and extracellular region are enriched in the AtII-network. In module AtIII, some of the most significantly enriched Biological Process terms are pollen wall assembly, pollen exine formation and sporopollenin biosynthetic process (Table 6). In the Molecular Function ontology the enriched terms include hydrolase activity, hydrolyzing O-glycosyl compounds, lipase activity and nutrient reservoir activity. To summarize, the GO-ontology enrichments give indications that the AtI-module could be involved in cuticle development, AtII in suberin biosynthesis and AtIII-in pollen exine formation.

Overrepresented promoter motifs

The occurrence of overrepresented motifs in the promoter regions of the Arabidopsis expression modules were examined in order to get further clues about the factors involved in the transcriptional regulation. In module AtI, three of the found motifs are involved in light-regulated gene expression, four in ABA-signaling and stress responses and three related to different developmental stages (Table 7). The occurrence of motifs involved in light-regulation and leaf development fits well with the significant enrichment of many photosynthesis related GO-terms in the AtI-network. The finding of motifs related to ABA-signaling is not surprising either since there were also significant enrichment for several abiotic stress related terms, such as response to radiation and response to cold. However, there was no direct evidence for ABA-regulated expression of AtI in the microarray datasets. For module AtII there are two overrepresented motifs related to light-regulated gene expression, three involved in other stresses, one in ABA response and three related to different developmental stages. In addition to these, there is one overrepresented cis-element that is related to transcription of phenylpropanoid biosynthetic genes. This motif is particularly interesting since there was a significant enrichment of the GO-terms phenylpropanoid metabolic process, phenylpropanoid biosynthetic process and suberin biosynthetic process in the AtII-network. The results for module AtIII includes four stress related motifs, two related to developmental stages and one involved in light regulated gene expression. Further, there are one motif connected to regulation of histone genes and two CIRCADIAN CLOCK-ASSOCIATED 1 (CCA1) binding motifs. The CCA1 binding motifs are present in the promoters of many day-phased genes (Wang et al. 1997; Michael and McClung 2003). The occurrence of CCA1 binding motifs in AtIII-promoters suggests that the expression of these genes may have a circadian regulation. Interestingly, the promoters of all three expression modules are enriched for RAV1-A binding site motifs. RAV1 is a transcription factor that is considered to be a positive regulator of leaf senescence in Arabidopsis (Woo et al. 2010). The finding of RAV1-A binding site motifs in the LTPG promoters suggests that Arabidopsis LTPG may play a role in leaf senescence. The LTPGs could have an important role in remobilization of break-down products from lipid-containing cell components. As a part of the degradative process in leaf senescence, hydrolytic enzymes such as proteases are induced. Previously, it has been shown that some nsLTPs have a proteolytic activity. It is possible that this protease activity of the nsLTPs may be involved in leaf senescence.

Table 7 Overrepresented promoter motifs in the genes of the LTPGs from arabidopsis

Alternative splicing in Arabidopsis

When the RNA sequences were aligned to genomic sequences, it was revealed that 28 out of 31 expressed Arabidopsis LTPGs possess one or more introns. The in silico analysis of the transcripts further showed that some of these intron-containing genes have several transcript forms. The differences between the various transcripts were found to be associated with the presence or absence of introns. Actually, the in silico analysis indicated that nine of the Arabidopsis genes are alternatively spliced (Table 1). When performing a similar in silico analysis of the rice transcriptome we found that at least six of the rice LTPG genes are undergoing alternative splicing (Table 2). The alternative splicing results in that five of the genes in Arabidopsis and four of the genes in rice have one transcript form encoding the GPI-anchor signal and another transcript form lacking the signal. To confirm or reject, the presence of alternative splicing in planta, the transcripts from AtLTPG1, AtLTPG8, AtLTPG11 and AtLTPG29 were amplified and analyzed. At least two primer combinations were used for each gene (Fig. 5). None of the primer combinations resulted in any amplicons for the negative control, where the reverse transcritptase had been omitted from the cDNA-synthesis step. Thus, there was no contamination of genomic DNA in the RNA-samples (Online Resource 22).

Fig. 5
figure 5

Maps showing the gene structures of AtLTPg1 (a), AtLTPg8 (b), AtLTPg11 (c) and AtLTPg29 (d). The primer combinations used for PCR analysis of alternative splicing are shown

For AtLTPG1 the in silico analysis indicated two different isoforms, one with the intron removed and one with the intron retained. During growth in long day conditions only the isoform with the intron removed was found in leaf and root, while both isoforms were found in flower and none in silique (Fig. 6). In plants grown under constant light, both AtLTPG1-transcript forms were detected in flower and leaf. In siliques, only the AtLTPG1-isoform without intron was detectable (Fig. 6). Both isoforms of AtLTPG1 transcripts could be confirmed by sequencing of PCR products extracted from gels. In the isoform with the retained intron there is an in-frame stop codon upstream of the GPI-anchor signal. Due to this stop codon, proteins translated from this isoform would lack the GPI-anchor signal.

Fig. 6
figure 6

Agarose gels showing PCR products from four genes that are putatively alternative spliced: AtLTPg1 (a, b), AtLTPg8 (c, d, e), AtLTPg11 (f, g, h) and AtLTPg29 (i, j). The cDNA used for PCR were synthesized from RNA isolated from plants grown either under long day conditions (a, c, d, f, g, i) or constant light (b, e, h, j)

According to the in silico analysis of AtLTPG8-transcripts there are two isoforms present, one with both introns removed and one with intron 1 removed but intron 2 retained. We investigated the splicing patterns of both introns in this gene with three different primer combinations, At8.1, At8.2 and At8.3 (Fig. 5). During long day conditions there was no detectable expression of AtLTPG8 in leaves. In flower there were three isoforms present, one with both introns removed, one with both introns retained and one with only intron 2 retained (Fig. 6). In root, we detected the isoform with both introns retained, as well as the isoform with intron 1 retained. In silique, the isoform with both introns removed as well as the isoform with both introns retained were identified (Fig. 6). In plants grown under constant light, expression of AtLTPG8 was only detected in flower. The three isoforms that were found in long day conditions were also seen in the samples from constant light (Fig. 6). All three isoforms found was confirmed by sequencing of gel extracted PCR products. Both isoforms with retained introns would after translation yield proteins without the GPI-anchor, due to in-frame stop codons in the introns.

In silico analysis of AtLTPG11 transcripts revealed the similar isoforms as in AtLTPG8; one with both introns removed and one with intron 1 removed but intron 2 retained. To investigate the splicing patterns of both introns three primer combinations were used; At11:1, At11:2 and At11:3 (Fig. 5). During long-day conditions AtLTPG11 was found to be expressed in flower and root, but not in leaf or silique (Fig. 6). The amplified At11.1 fragment was slightly larger than expected (149 bp) if both introns would have been removed, but smaller than expected for a fragment with a retained intron 1. In plants grown under constant light two products were detected with At11.1. One that corresponded well to a fragment with both introns removed (122 bp), whereas the other was similar to the 149 bp fragment detected during long day conditions. Sequencing of the larger 11.1 fragment revealed a partial tandem duplication of 27 bases in exon 2. Further investigations are needed to reveal if this is an artifact or an actual modification of the mRNA.

In AtLTPG29 the in silico analysis predicted an exon skipping event. In this case exon 3, containing a stop codon, is skipped and the alternative exon 4 is reached. Only the isoform without exon 3 contains the GPI-anchor signal, due to a stop codon in exon 3 leading to much shorter polypetide. Two primer combinations, At29:1 and At29:2, were used to investigate the splicing pattern in planta. Expression of AtLTPG29 was detected in flowers during long-day conditions and in flowers and siliques during constant light (Fig. 6). In all cases where expression was detected both isoforms of AtLTPG29-transcripts were found. Thus, also for AtLTPG29 there are transcripts encoding the GPI-anchor attachment signal, but also transcripts lacking the in frame GPI-anchor signal.

To conclude, it was verified in planta that there are alternative splicing of several LTPG transcripts in Arabidopsis. The occurrence of alternative splicing in the LTPG genes varies between different tissues and we could also note that the splicing patterns sometimes differ between plants grown under long-day and in constant light. Interestingly, there are in three of these four tested genes one transcript form that encode a protein with the GPI attachment signal and another transcript that should not yield a protein with a GPI anchor. Thus, it seems that alternative splicing could play a role in regulating the cellular localization of LTPGs.

Phylogeny of Arabidopsis and rice LTPGs

A phylogenetic analysis of the Arabidopsis and rice sequences are shown in Fig. 7. Members of the same modules are distributed all over the phylogenetic tree. However, within all modules there are some putative paralogs that are clustered together, such as OsLTPG26 and OsLTPG27 and AtLTPG4 and AtLTPG23, suggesting that the modules have expanded through duplications after the separation of monocotyledons and dicotyledons. A more striking finding is that some rice LTPGs and Arabidopsis LTPGs from equivalent modules are putative orthologs found on the same branch of the tree. This is shown for AtI and OsI, such as between AtLTPG1 and OsLTPG22, between AtLTPG6 and OsLTPG10 and between AtLTPG2 and OsLTPG12 and also for AtII and OsII in the case of the cluster OsLTPG8, OsLTPG17, AtLTPG16 and AtLTPG20. The phylogenetic tree therefore indicates that the gene expression patterns, manifested in the expression modules, were established before the separation of rice and Arabidopsis. The genes that undergo alternative splicing are not located to specific branches of the phylogenetic tree (Fig. 7). However, the alternative splicing of the conserved rice and Arabidopsis genes OsLTPG22 and AtLTPG1 shows that there is at least one example where the evolution of the alternatively splicing event possibly pre-dates the separation of monocots and dicots.

Fig. 7
figure 7

A maximum likelihood phylogenetic tree of the LTPGs from Arabidopsis and rice, with LTPGs from the moss P. patens (PpLTPg2, PpLTPg3 and PpLTPg4) as outgroup. Pseudogenes are not present. Only bootstrap values above 50 are shown. Members of each expression module are shown as uniformly colored protein names, while proteins not considered as members of a module are written in black. Genes with two or more transcript isoforms are indicated by the label AS behind the name. If the GPI-anchor signal is lost in one of the transcript forms, a crossed out anchor is shown

Discussion

The aim with this study was to find groups of LTPG genes that are involved in related biological processes. We reasoned that the identification of such functional groups is important for further systematic investigations into the biological roles of this enigmatic family of proteins. Here, we have identified coexpressed LTPG genes in both rice and Arabidopsis. Among the coexpressed genes we could identify three different expression profiles. The coexpressed genes were therefore placed into three separate groups or modules. The Arabidopsis module AtI is built from the three genes AtLTPG1, AtLTPG2 and AtLTPG6. The GO analysis of the AtI expression network resulted in many significantly enriched terms related to photosynthesis. Further, the search for regulatory elements identified that three promoter motifs, GATA, Ibox and SORLREP3, associated with light-regulated gene expression (Hudson and Quail 2003; Reyes et al. 2004) are overrepresented in the promoters of the AtI-module genes. Light is one of the factors that have been demonstrated to increase the wax deposition, as revealed from comparisons of light- and dark grown plants (reviewed in Shepherd and Wynne Griffiths 2006). The light regulated expression and the coexpression with photosynthesis genes therefore support that the genes in the AtI module have their main function in the deposition and biosynthesis of the cuticular waxes or cutin.

Our results are further supported by functional reports of the genes in module AtI (DeBono et al. 2009; Lee et al. 2009; Kim et al. 2012). Decreased AtLTPG1 expression in Arabidopsis resulted in that less wax was loaded on the stem surface (DeBono et al. 2009). However, when AtLTPG1 was disrupted in another study there were no significant alterations found for the wax load (Lee et al. 2009). Rather, Lee et al. demonstrated a 10 % reduction of the C29 alkane (nonacosane) which is the major component of cuticular waxes in the stems and siliques. Although, the data from these studies show some contradictions, the results indicate that AtLTPG1 is involved in cuticular lipid accumulation. More recently, it was shown that AtLTPG2 is functionally redundant or overlapping with AtLTPG1 since the wax load in stems and siliques was reduced with about 10 % also in an AtLTPG2 insertion mutant (Kim et al. 2012). Our data suggest that AtLTPG6 is functionally overlapping with AtLTPG1 and AtLTPG2. Possibly, the wax load and C29 alkane-levels would be further reduced in a triple mutant knocked out for AtLTPG1, AtLTPG2 and AtLTPG6.

Module AtII is the largest expression module with seven genes. The GO analysis of AtII revealed significant enrichment of the term biosynthesis of phenylpropanoid and its daughter term biosynthesis of suberin. Suberin consists of an aliphatic cutin-like and an aromatic lignin-like domain (Bernards 2002) and is deposited for example in the endodermis and hypodermis of roots, the bundle sheaths of leaves, in seed coats and in the periderm of shoots and roots. Suberin is deposited as a lamella on the inner surface of the cell wall, thus separating the cell wall from the plasma membrane (Pollard et al. 2008; Schreiber 2010). The GO analysis suggest that module AtII may be involved in suberin biosynthesis and deposition in roots. This is also supported by the expression pattern where several genes of AtII reach their highest transcript levels in roots. Furthermore, the MYB binding site motif that is significantly enriched in the AtII promoters is known to enhance the transcription of phenylpropanoid biosynthetic genes (Sablowski et al. 1994). The phenylpropanoid biosynthetic pathway provides precursors for the synthesis of suberin. These results open up for directed investigations aiming at elucidating the role of the LTPGs in suberin accumulation. So far, there are to our knowledge, not yet any experimental evidence published that link the function of nsLTPs to suberin deposition (Ranathunge et al. 2011).

Module AtIII is highly expressed in flowers and seeds and show GO enrichments that suggest a role for this module in sporopollenin biosynthesis or deposition. Sporopollenin is a major component of exine walls of pollen grains and contributes to the remarkable resistance of the pollen wall to abiotic and biotic stresses, such as dehydration, UV irradiation, and pathogen attack. The chemical composition of sporopollenin is not exactly known, due to its unusual chemical stability. Recent investigations show that sporopollenin is not a homogeneous macromolecule but is instead made up of complex biopolymers derived mainly from saturated precursors such as long-chain fatty acids or long aliphatic chains. It has been suggested recently that nsLTPs may have a role in sporopollenin synthesis (Ariizumi and Toriyama 2011), although our study are, to our knowledge, the first to provide data pointing in such directions. Two CCA1 binding motifs are present in the AtIII-promoters, which indicate that the genes are regulated according to the circadian clock. The circadian clock is known to regulate the development of reproductive organs, the flower opening required for efficient pollination and the production of volatile compounds giving the signature scent of the plant (Yakir et al. 2007; Troncoso-Ponce and Mas 2012). One would assume that the maturation of pollen would coincide with these events and subsequently also be controlled by the circadian clock.

To summarize, we suggest that module AtI is involved in light regulated deposition or synthesis of cutin or cuticle waxes, that module AtII may have a role in the synthesis and deposition of suberin in roots and seed coats, while module AtIII could be involved in sporopollenin biosynthesis and deposition in pollen grains. The cuticular waxes, suberin and sporopollenin are all polymers built from long-chain fatty acids or long aliphatic chains. Their synthesis requires at least four steps: (1) the de novo synthesis of polymer precursors (2) secretion from the lipid bilayer to the apoplastic compartment (3) transfer of the precursors through the apoplastic compartment or the cell wall and (4) polymerization (Ariizumi and Toriyama 2011; Ranathunge et al. 2011). Thus in step (3) above, once the hydrophobic lipid polymer compounds are exported, they have to pass through a highly hydrophilic environment, such as the cell wall, on their way to the polymerization site. How this transport is achieved is still unknown, but it is not unlikely that the LTPGs are involved in the delivery of the polymer precursors.

We included rice in our investigation to see if our findings from Arabidopsis could be relevant also in monocots. Interestingly, we could note that in both rice and Arabidopsis there are one expression module which are predominant in aerial parts (AtI and OsI), another in roots (AtII and OsII), and a third module with an expression pattern restricted to reproductive tissues (AtIII and OsIII). Further, in both rice and Arabidopsis the root abundant modules (AtII and OsII) contain the largest number of LTPG genes, with 7 genes in Arabidopsis and 8 genes in rice. In conclusion, according to the expression patterns, the number of members and the distribution in the phylogenetic tree, the modules found in Arabidopsis and rice appear to be functionally equivalent. The identification of equivalent expression modules in dicots and monocots indicates that the LTPG expression profiles were established before the separation of monocots and dicots. This evolutionary conservation renders further support that our approach is useful for deducing the function of LTPGs in flowering plants.

This study of the LTPGs is to our knowledge, one of the first cases suggesting alternative splicing as a potential regulator of the GPI-anchoring process in plants. However, there are several similar mammalian examples where alternative splicing generates transcript isoforms with the anchoring signal and other isoforms lacking the signal (Patel et al. 2000; Grahnert et al. 2005; Kikuchi et al. 2008). For several LTPG genes the alternative splicing results in one transcript form with the GPI-anchoring signal and another form that is without the GPI-anchoring signal. This indicates that protein isoforms both with and without a GPI-anchor are produced from these genes. These isoforms may have different properties, for instance such as that the isoform without anchor is not functional, or perhaps more likely, that the isoforms have different localizations in the cell or the organism. In case of the LTPGs it is possible that the versions lacking the GPI-anchor is unattached to the plasma membrane and located to the apoplastic space, where they could be involved in the downstream transportation of lipids from GPI-anchored LTPGs to the plant surface. It seems plausible that alternative splicing has evolved as a mechanism to control the activity of at least some of the LTPGs. The observed alternative splicing further brings an evolutionary and functionally explanation to the conservation of an intron at a position between the last, most C-terminal, of the conserved cys and the GPI-anchor signal (Edstam et al.2011). If the alternative splicing is a regulatory mechanism it is likely that each isoform are predominant during certain conditions. Now it will be of special interest to obtain knowledge about when and where the different transcript and protein isoforms are accumulating. If we succeed in determining the localization of the LTPG isoforms we may get further important clues to the function of these proteins.

We have previously identified that the genes encoding the LTPGs likely evolved in plants soon after the colonization of land, since the genes are present in early diverging land plants, such as liverworts, but not identified in streptophyte algae (Edstam et al. 2011). The first land plants faced numerous challenges that included increased exposure to UV radiation, desiccation, and temperature stress when they adapted to a life on land approximately 470 million years ago. Sporopollenin and cuticular waxes are present in liverworts and mosses as well as in highly diverged plants like Arabidopsis (Neinhuis and Jetter 1995; Cook and Graham 1998). We speculate that the LTPGs may have been selected for during land plant evolution due to the fact that their gene products are involved in the defense against radiation and desiccation.