1 Introduction

Rice (Oryza sativa) is one of the most important crops in the world. The rice genome has been sequenced and found to comprise more than 32,000 genes (Gojobori 2007). The future challenge in this regard lies in identifying the biological functions of these genes. Rice full-length cDNAs have been collected for comprehensive analysis of the functions of rice genes (Kikuchi et al. 2003). The full-length cDNA over-expressing (FOX) gene hunting system, which employs a gain-of-function approach, has been used for investigating gene function (Ichikawa et al. 2006; Kondou et al. 2009). When the FOX hunting system is applied to individual transgenic plants, the dominant phenotype of a mutation is caused by the overexpression of the full-length cDNA introduced. This technique can be applied to the identification and characterization of rice genes using Arabidopsis thaliana as the heterologous host of transgenes. The overexpression of rice genes results in not only visible phenotypes but also invisible phenotypes, including remarkable alterations in the metabolite composition (metabotype) (Hall 2006; Raamsdonk et al. 2001).

Many analytical technologies based on gas chromatography-mass spectrometry (GC-MS), liquid chromatography (LC)-MS, capillary electrophoresis (CE)-MS, nuclear magnetic resonance (NMR) spectroscopy, or fourier transform-infrared (FT-IR) spectroscopy have been developed for evaluating the metabotypes (Allwood et al. 2006; Bauer et al. 2006; Fiehn et al. 2000; Grata et al. 2008; Sato et al. 2004; Ward et al. 2003; Yang and Yen 2002). Since MS-based techniques have high selectivity and sensitivity for the identification and quantification of metabolites, they have been extensively used for metabolite profiling. On the other hand, NMR and FT-IR spectroscopy have low selectivity but can be used to discriminate between biological samples on the basis of differences in their metabolite composition. Therefore, these techniques are often used for metabolite fingerprinting (Ellis et al. 2007; Fiehn 2002). However, these techniques have limited dynamic range, and they are time consuming to prepare samples for analysis.

Recently, Fourier transform-near-infrared (FT-NIR) spectroscopy has been widely used for quality assessment of industrial materials and natural products, owing to its simplicity and rapidness (Hall and Pollard 1992; Ikeda et al. 2007; Rodriguez Otero et al. 1997). Unlike a typical NMR and FT-IR techniques, FT-NIR spectroscopic analysis does not entail destructive preparation steps such as homogenization or extraction using organic solvent, thereby enabling the use of large sets of seed samples and a greater number of assays with the same sample. Absorption in the NIR spectral region (4000–10000 cm−1) allows the detection of overtones and combinations of the fundamental vibrations derived from the stretching and bending of NH, OH, and CH groups (Weyer and Lo 2002; Workman 2000), while absorption in the mid-IR spectral region (400–4000 cm−1) can detect the fundamental vibrations of organic molecules, which enables to classify the differences in macromolecule composition such as cell wall structures (Mouille et al. 2003). FT-NIR can be used to determine metabolite levels such as the content of amino acid and fatty acid composition in seed sample (Kovalenko et al. 2006; Sato et al. 1998). Therefore, FT-NIR spectra indicate the metabolite composition that is available for metabolite fingerprinting (Munck et al. 2001).

The metabolite fingerprint dataset was statistically analyzed by chemometric approaches, including principal component analysis (PCA) and orthogonal projection to latent structures-discriminant analysis (OPLS-DA), for evaluating biological alteration (Pohjanen et al. 2007). In addition, a recent study employed multivariate regression using orthogonal projections to latent structures (O2PLS) to a combination of “omics” datasets (Bylesjo et al. 2007; Bylesjo et al. 2009; Rantalainen et al. 2006). This methodology can be used to extract the joint variation from different analytical platforms for the interpretation of complex biological process. In this context, O2PLS can be applied to the integration of FT-NIR datasets and other omics datasets such as those obtained from gas chromatography-time-of-flight/mass spectrometry (GC-TOF/MS) to elucidate the association between FT-NIR spectral data and the metabolite levels observed in GC-TOF/MS analysis.

In this study, we developed a non-destructive analytical method using FT-NIR spectroscopy for screening seed samples of rice-Arabidopsis FOX lines; these samples were obtained from transgenic A. thaliana lines that overexpressed rice full-length cDNA. Next, re-transformants of candidate genes were analyzed using OPLS-DA for clearly assessing alteration in metabolite fingerprints. Moreover, GC-TOF/MS was used to confirm the change in metabolite profiles. Finally, the predictive metabolites obtained in FT-NIR analysis were studied in greater detail by applying the O2PLS method.

2 Materials and methods

2.1 Plant material

For the evaluation of discrimination abilities, seed samples of five various Arabidopsis ecotypes (Col-0, Ws, Ler, Nossen, and C24) were used. Arabidopsis transgenic lines expressing rice full-length cDNA (rice FOX lines) under the control of the CaMV 35S promoter were constructed using the Agrobacterium tumefaciens in planta transformation method (Clough and Bent 1998). We screened seed samples of the T2 generation that did not exhibit any visible phenotypes. A total of 3003 lines (74 batches) were analyzed by FT-NIR spectroscopy. FOX lines in each batch comprised samples that were maximum 51 lines harvested during the same growth period with a cultivation container system (Arasystem, Gent, Belgium). To diminish the environmental effect, we analyzed FOX seeds of each batch separately. Subsequently, genomic DNA of each candidate line exhibiting alteration of metabolite fingerprints was isolated from the corresponding seed sample. Then, rice cDNA was sequenced for the annotation of gene function. For annotation of the inserted rice cDNA, the sequences of the candidate genes were analyzed on the basis of information available in the Knowledge-based Oryza Molecular biological Encyclopedia (KOME) and the Rice Annotation Project (RAP) databases. The re-transformants (five individual transgenic plants for each candidate) were constructed to confirm the reproducibility of the metabotype for each line. The details of the method have been described by Kondou et al (2009). Arabidopsis seeds (Col-0), which were harvested at the same time as the re-transformants, were used as the control. A total of 26 re-transformants (127 samples) were analyzed by FT-NIR spectroscopy. Among them, seven lines (34 samples) exhibited changes in their metabolite fingerprints; these were analyzed by GC-TOF/MS.

2.2 FT-NIR analysis

For the screening of FOX lines, 200 seeds were placed in a 0.3 ml glass tube for each line, and the FT-NIR spectra of the seeds of each line were directly measured six times. FT-NIR spectra were measured with a Nicolet 6700 FT-IR equipped with a Smart Near-IR UpDRIFT, CaF2 beamsplitter, and cooled InGaAs detector (Thermo Electron Corporation, Madison, USA). The mirror velocity at a resolution of 4 cm−1 was 1.2659 cm/s. The total number of data points in the range of 4500–7500 cm−1 was 1556 for each spectrum. The diffuse reflectance spectrum was obtained by ratioing the single beam spectrum against the background spectrum using spectralon (LabSphere, Inc.). Each spectrum was recorded as an average of 32 scans using OMNIC 7.2a (Thermo Electron Corporation, Madison, USA). Sample information and the raw spectral dataset of rice-Arabidopsis FOX seeds are available through Rice FOX database (http://ricefox.psc.riken.jp/). The obtained FT-NIR spectra were transformed with multiplicative signal correction (MSC) (Geladi et al. 1985) to minimize the variations in sample path length that are caused by the light scatter effect resulting from the differences in individual seed shape. Then, overlapping absorption peaks were clarified with the 25-point polynomial-fit Savitzky-Golay second derivation (Savitzky and Golay 1964).

2.3 GC-TOF/MS analysis

To assess the metabolite profiles of the candidate lines that showed altered their metabolite fingerprints in FT-NIR analysis, 200 seeds of each of the candidate lines were extracted at a concentration of 10 mg/ml, derivatized, and then analyzed by GC-TOF/MS as described in Kusano et al (2007). A total of 266 metabolite peaks were extracted for each seed sample. Of them, 67 peaks were identified or annotated as known metabolites, 186 peaks were of unknown metabolites, and 13 peaks were annotated as mass spectral tags (MSTs) (Schauer et al. 2005).

2.4 Statistical analysis

Before multivariate analysis, the corrected FT-NIR spectral datasets were mean-centered and the GC-TOF/MS dataset were scaled to unit variance following log10-transformation. The multivariate models were calculated using PCA, OPLS-DA, and O2PLS implemented in SIMCA-P + version 12 (Umetrics AB, Umeå, Sweden). The ellipse in the PC score plot represents the confidence region of the model based on Hotteling’s T2 statistic (Hotelling 1931; Mason et al. 2001). The significance level of the confidence region was defined at 0.05, and the data that fell outside the ellipse were determined to belong to candidate lines. These models were validated using 7-fold cross-validation or analysis of variance of cross-validated predictive residuals (CV-ANOVA) (Eriksson et al. 2008). Cross-validation is an internal predictive validation method for determining the number of significant components by calculating the total amount of explained X-variance (R2X), Y-variance (R2Y), and cross-validated predictive ability (Q2Y). A component is significant when Q2Y is positive value. Additionally, the variance related to class separation (R 2P X) was calculated by OPLS-DA. CV-ANOVA is based on an ANOVA assessment of the cross-validatory predictive residuals of the models. The statistical Welch’s t test was performed and false discovery rate (FDR), which have been proven to be reliable for determining the significance of multiple testing (Storey 2002), were calculated using Microsoft Office Excel 2003 software. Q-value for FDR less than 0.05 was regarded as significant.

3 Results and discussion

3.1 Metabolite fingerprinting of seeds of various Arabidopsis ecotypes by FT-NIR spectroscopy

To evaluate discrimination abilities of metabolite fingerprints of Arabidopsis seeds by FT-NIR spectroscopy, five various Arabidopsis ecotypes were analyzed. The FT-NIR spectral data of the five ecotype seeds showed broad peaks and extensive overlapping of NIR absorption bands derived from complex chemical components in the sample (Fig. 1a). For elimination of baseline shift and enhancement of shoulder peaks, MSC and second derivation were applied to the spectral dataset (Fig. 1b). After spectral correction, multivariate analysis was performed to evaluate the corrected spectra. PCA was performed to visualize the strongest varying components of the spectra obtained for various Arabidopsis ecotypes whose metabolite fingerprints were altered. In the case of Arabidopsis seeds, each sample was distributed according to the ecotypes showing different metabolite fingerprints in the PC score scatter plot (Fig. 1c). Furthermore, the acquisition time per analysis (30 s) in FT-NIR spectroscopy is short; therefore, this method is beneficial for large-scale screening.

Fig. 1
figure 1

Typical FT-NIR spectra of seed samples of five Arabidopsis ecotypes. Two hundred seeds were placed in a glass tube, and the FT-NIR spectra were measured 20 times. a Raw spectral data of the seed samples. b The corrected spectra by MSC and 2nd derivation. c Discrimination abilities of metabolite fingerprints of Arabidopsis seeds. The plot of the principal component 1 (PC1) versus principal component 2 (PC2) is presented. Each colored symbol represents an ecotype

3.2 Screening of seeds of rice-Arabidopsis FOX lines by FT-NIR spectroscopy

In order to screen rice-Arabidopsis FOX lines that show specific metabolite fingerprints (i.e., metabotype) in seeds, FOX seeds were analyzed using FT-NIR spectroscopy. In this study, FOX seeds in the T2 generation were used. The corrected spectra of these lines were applied for PCA to filter candidate lines.

The candidate lines showed different distribution patterns when compared with the major distribution patterns of the other lines in the PC score scatter plot (Fig. 2). Using FT-NIR-based metabolite fingerprinting, 3,003 FOX lines were analyzed. From the result of the analysis, 30 lines that showed altered metabolite fingerprints were selected as the candidate lines. Among them, 26 lines—in which the rice full-length cDNA was correctly inserted—were used for further analysis.

Fig. 2
figure 2

PC score plot of the FT-NIR spectral dataset for the first batch containing 51 lines. The plot of the principal component 1 (PC1) versus principal component 2 (PC2) is presented. Each colored symbol represents an individual line. The ellipse represents the confidence region of the model based on Hotteling’s T 2 statistic (α = 0.05)

3.3 Assessment of the metabolite fingerprints of re-transformants

We assessed the changes in the metabolite fingerprints of re-transformants harboring rice full-length cDNA; 26 candidate lines fell in this category. Their metabolite fingerprints were compared with those of the wild type by FT-NIR. For clearly assessing the effect of candidate genes, the differences between metabolite fingerprints were confirmed by OPLS-DA. OPLS-DA uses information on categorical response Y (wild type or the re-transformants) to decompose spectral data into a predictive matrix related to biological alteration and the Y-orthogonal matrix (Pohjanen et al. 2007). This strategy allows for a more realistic interpretation of metabolite fingerprints than well-known method such as liner discriminant analysis.

Among the 26 candidate lines, the seven lines listed in Table 1 showed differences in their metabolite fingerprints compared with the wild type without overfitting in each OPLS-DA model (Fig. 3). The other 19 lines showed no significant difference with regard to discrimination from the wild type. The loading spectra of the predictive component shown in Fig. 4 indicate the importance of absorption bands for the discrimination of the re-transformants from the wild type. The absorption band from 6950 to 7400 cm−1 has been attributed to a combination of the first overtones of the C–H stretch. The first overtone of the C–H stretch was located from 5600 to 6150 cm−1. The absorption band at around 4850 cm−1 corresponded to combinations of O–H or N–H. The band near 5200 cm−1 has been assigned to the second overtone of the C=O stretch. The combination of the C–H stretch and C–C stretch derived from the benzene moiety was located at around 4675 cm−1. Other minor absorption bands were found to overlap in each spectral region (Weyer and Lo 2002; Workman 2000). The shape of the loading spectra showed a specific pattern for each candidate line. This result suggests that the overexpression of the seven rice genes that were introduced into A. thaliana have various effects on their metabolite fingerprints, respectively.

Table 1 Candidate lines exhibiting alteration of metabolite fingerprints
Fig. 3
figure 3

Discrimination of re-transformants using OPLS-DA based on FT-NIR analysis. OPLS-DA was performed using the FT-NIR spectral dataset for each re-transformant. The plot of predictive component (t P) versus orthogonal component 1 (t O) is presented. The black symbols represent the wild type, while the red symbols represent re-transformants. Individual transgenic lines are represented by different symbols. Seven re-transformant lines could be clearly discriminated from the wild type

Fig. 4
figure 4

Importance of absorption bands for discriminating the re-transformants from the wild type in OPLS-DA. Each absorption band indicates overtones and combinations of the fundamental vibrations of organic molecules. Loading of the predictive component in OPLS-DA showed that the absorption bands were important for discriminating the re-transformants from the wild type. The shape of the loading spectra indicated the various effects of the introduced rice genes, which brought about the change in metabolite fingerprints

3.4 Assessment of metabolite profiles by GC-TOF/MS

For the assessment of rice gene functions, it is necessary to further analyze the metabolite profiles obtained. To identify the metabolites whose levels differed among the re-transformant lines, we analyzed seven candidate lines by GC-TOF/MS. OPLS-DA was carried out to classify and interpret the metabolite profiles. Samples that were misclassified in the OPLS-DA model were excluded for clear assessment of metabolite profiles. Candidates 1–6 could be clearly discriminated from the wild type, but Candidate 7 was overfitted in the OPLS-DA model (supplementary Fig. 1). The loading weights of the OPLS-DA model, fold change, and p-value of the t test for significantly altered metabolites in each candidate line are listed in Table 2. For Candidate 6, the known metabolites did not show any alteration, but 13 unknown compounds were found to have significantly altered (data not shown).

Table 2 Significant changes in the levels of known metabolites, as determined by GC-TOF/MS analysis

The changes in the metabolite profiles were unique for each candidate line. In relation to Candidate 1, it has been reported that T-DNA insertional Arabidopsis mutants of the ETFβ gene showed significant accumulation of several amino acids, isovaleryl CoA, and phytanoyl CoA during dark-induced carbohydrate deprivation (Ishizaki et al. 2006). It is expected that the changes in the metabolite profile of Candidate 1 are influenced by the function of ETF. In Candidate 3, resistance to Pseudomonas syringae DC3000 was confirmed by Mori et al in other study (Kondou et al. 2009). Further metabolomic analyses of different tissues at different stages of growth and under different stress conditions would enable advanced investigation of rice gene functions.

3.5 Relationships between absorption in the FT-NIR spectra and the metabolite levels determined by GC-TOF/MS

The changes in metabolite fingerprints were specific for each candidate line, and information about individual metabolites could have been obtained using the FT-NIR spectra. Here, however, we used the O2PLS multivariate regression method to identify the predictive metabolites in FT-NIR analysis. Spectroscopic and chromatographic techniques entail systematic variations such as baseline shift and background noise. With O2PLS, such irrelevant variations can be removed, and joint variation related to biological alteration can be extracted from metabolomics datasets (Bylesjo et al. 2007; Bylesjo et al. 2009; Rantalainen et al. 2006). Joint variation obtained from FT-NIR spectra and GC-TOF/MS datasets of known metabolites is useful for understanding the relationships between absorption in the FT-NIR spectra and metabolite levels. Here, the dataset of Candidates 1–7 and the wild type were used for constructing the model.

The O2PLS model was constructed with three predictive components that account for 81% of the total variation in the FT-NIR dataset and 21.1% of the variation in the GC-TOF/MS dataset. Moreover, we identified one orthogonal component (10.9%) in the FT-NIR dataset that was not present in the GC-TOF/MS dataset and one unrelated component (12.4%) in the GC-TOF/MS dataset that was not available in the FT-NIR dataset. To filter predictive metabolites in the joint variation of the O2PLS model, CV-ANOVA was used as the significance test. In addition, the q-values for FDR were calculated using the p-values of CV-ANOVA. The threshold of q-value for significantly predictive metabolites was defined at 0.05. We found 21 metabolites to be predictive in the O2PLS model (supplementary Table 1). The joint variation of predictive metabolites explained the chemical relationship between FT-NIR spectroscopy and GC-TOF/MS. The O2PLS loading plot shown in Fig. 5 indicates how the absorption bands of the FT-NIR spectra relate to the predictive metabolites. The relative loading weights indicate the strength of relatedness. Predictive metabolites were clustered on the basis of similarities in their chemical structure; the clustering revealed that compounds with similar structure were predicted by similar absorption patterns in their FT-NIR spectra.

Fig. 5
figure 5

Overview of the relationships between the absorption bands of FT-NIR spectra and predictive metabolites. The O2PLS loading plot obtained by integration of the FT-NIR and GC-TOF/MS datasets for Candidates 1–7 and the wild type is shown here. Loading of the FT-NIR and GC-TOF/MS datasets was concatenated to one vector (joint loading vector). The red symbols represent predictive metabolites, and the gray symbols, wave numbers in the FT-NIR spectra. Variables that are near each other are positively correlated, and those situated opposite are negatively correlated

The variation related to specific metabolite changes which can consider to be caused by rice gene function was not extracted in the joint variation; however, it can be explained by unique variation in the GC-TOF/MS dataset (supplementary Fig. 1). On the other hand, it is also expected that the variation in the orthogonal component of the FT-NIR spectra can explain the other aspects of the metabotypes in each candidate line. These features enable the application of FT-NIR spectroscopy to the screening of a variety of metabolites.

4 Concluding remarks

We have developed a non-destructive screening method using FT-NIR spectroscopy for the analysis of seeds of rice-Arabidopsis FOX lines. This method is timesaving in that it can be used to detect the metabolite fingerprints of seed material without pretreatment. A simple and rapid method is required for the screening of rice-Arabidopsis FOX lines; thus, FT-NIR spectroscopy was suitable for this research. Moreover, OPLS-DA based on the GC-TOF/MS dataset revealed the changes in metabolite profiles in greater detail. In addition, the O2PLS methodology provided additional information about predictive metabolites in the FT-NIR analysis. The advantage of FT-NIR spectroscopy is that it can be used to detect the composition of a variety of metabolites. FT-NIR spectroscopy combined with chemometrics can be used for large-scale screening of gain-of-function mutant resources. Moreover, this technology will be effective in the analysis of useful plant gene functions using loss-of-function mutants such as knock-out (TILLING or insertion mutation) or knock-down (RNAi) resources.