1 Introduction

Tomato (Solanum lycopersicum L.) is an important horticultural crop world-wide and the second most consumed non-cereal vegetable after potato (http://faostat3.fao.org/). Its fruits are a source of essential minerals and nutrients for the human diet, such as vitamins A and C and other antioxidants (Willcox et al. 2003; Fitzpatrick et al. 2012).

Cherry tomato [S. lycopersicum var cerasiforme (Alef.) Voss.] was probably domesticated from S. pimpinellifolium L. and is likely the ancestor of cultivated big-fruited tomato. Domestication was initiated by indigenous people of the Andes who kept and propagated seeds from wild plants with bigger and tastier fruits. During this process 186 sweeps were selected representing 8.3 % of the genome (Lin et al. 2014).

Tomato breeding, on the other hand, began in Europe when improved cultivars were generated to meet several needs including fresh market and processing industries (Foolad 2007). At the beginning of the 20th century, public institutions from USA and new private companies became interested in this practice. Since then until now, vast research and economic efforts have been invested in tomato improvement. It was recently reported that selection during improvement process affected 4,807 genes, representing 7 % of the genome, of which 1 % has also undergone double selection because of previous domestication process (Lin et al. 2014).

Over the history of tomato breeding four clear periods can be defined according to the major target of improvement (Bai and Lindhout 2007). Initially, efforts were focused on yield (i), which increased seven-fold in processing tomato between the 20 and 30 s. This was due to incorporation of diseases resistances, a good performance of the selected F1 hybrids to fertilization, the use of pesticides (Warren 1998), enhanced tolerance to abiotic stresses and an increase in fruit sugar content (Foolad 2007). In fresh tomato, yield enhancement was achieved by crosses with wild relative species (Swamy and Sarla 2008). On this regard, many quantitative trait loci (QTL) for yield and related traits as fruit weight, total soluble solids or lycopene content were found non-randomly distributed in the genome (Fulton et al. 1997; 2000).

The second period began in the 1980s, when shelf life (ii) became the most important issue for fresh market tomatoes. In that time, the effort was focused on elucidating the mechanism of ripening since this process directly affects shelf life. Many studies were concentrated in identifying the principal components involved in fruit ripening and softening including the role of ethylene (Alba et al. 2005) and the enzyme polygalacturonase (Giovannoni 2001). As a result, several ripening-related genes and QTLs were characterized and mapped. One of them, rin (ripening inhibitor) has been used in marker assisted selection (MAS) programs with promising results (Foolad 2007). In the 1990s, flavor (iii) became the new target for breeding programs. This is a very complex trait mainly because it is determined by quite a lot of genetic and non-genetic factors, not all of them are identified or well characterized (Causse et al. 2002; Klee 2013). It has been postulated that the ratio between sugars and organic acids is the main determinant of tomato flavor (Bucheli et al. 1999) but several aromatic volatile compounds also have major influences on this trait (Goff and Klee 2006; Tieman et al. 2012; Rambla et al. 2014). Even though the great effort to improve fruit flavor, only increments in soluble solid and decreases of acidity were obtained in few cases (Foolad 2007). Finally, in the last and current stage, the breeding goal has been re-orientated to the fruit nutritional quality (iv) (Bai and Lindhout 2007).

Over the history of tomato breeding diverse techniques have been applied. Initially they were based on phenotypic selection and progeny testing and later, with the advent of molecular markers and linkage maps, MAS improved the efficiency and reduced the time of the traditional programs (Yang et al. 2004; Foolad 2007) and QTL allow identifying complex traits in F2 and backcross populations (Paterson et al. 1988). However, these materials resulted disadvantageous for genetic mapping (Foolad 2007). This problem was overcome a few years ago with the development of recombinant inbred lines (RILs) and introgression lines (ILs) that harbor specific genomic regions of wild relative species. Nowadays, with the full tomato genome sequence available (Sato et al. 2012), candidate genes for important traits are being rapidly identified (Causse et al. 2004; Price 2006; Bermúdez et al. 2008; Kamenetzky et al. 2010; Goulet et al. 2012; Lee et al. 2012; Sauvage et al. 2014) and breeders may obtain novel and improved varieties by introgressing useful alleles from wild relative species to the cultivated varieties. Since these species represent a source of natural genetic variation, of which only 5 % is harbored by cultivated tomato (Miller and Tanksley 1990; Bretó et al. 1993), introgression lines represent powerful tools for crop improvement maximizing the potential of wild germplasm. Moreover, RILs constitute an original germplasm source for exploiting favorable new genetic combinations involved in fruit quality through the generation of Second Cycle Hybrids (SCHs) (Liberatti et al. 2013). Because S. pimpinellifolium displays high sweetness and vitamin C contents, disease-resistance and improved stress tolerance (Foolad et al. 1998; Foolad 2007; Tigchelaar 1986), and also its genome has been recently sequenced (Sato et al. 2012); the comprehensive metabolic profiling of RILs derived from this species is of major interest to assess fruit nutritional quality.

Metabolite profiling alone or combined with other approaches has been used to identify key compounds involved in development, stress tolerance, and nutritional metabolites in many agricultural important plants (Hu et al. 2014). Those metabolites are useful and helpful for plants and either human health and diets (Hall et al. 2008; Gechev et al. 2014). This approach was likewise applied to explore natural variability in wild related species in order to find valuable source for the improvement of agriculturally important traits (Schauer et al. 2005; Rambla et al. 2014). It was also used to discover enzyme function, reconstruct important pathways and define it regulation (Bermúdez et al. 2014; Araújo et al. 2012). Additionally, metabolic profiling coupled with GWA studies allow the identification of 44 SNP loci associated to 19 metabolic traits and provide 5 candidate genes involved in the genetic architecture of fruit metabolic traits (Sauvage et al. 2014).

Molecular breeding, on the other hand, has been adopted in most biotechnological strategies to develop new crops (Moose and Mumm 2008; Saito and Matsuda 2010). For functional understanding of phenotypes it is essential to integrate genomic information to characterize gene-to-metabolite associations (Carreno-Quintero et al. 2013). Metabolic analysis has been increasingly and successfully used to assist elite germplasm selection (Fernie and Schauer 2009; Rao et al. 2014). In tomato, the first study using metabolite profiling showed that metabolic traits correlated with phenotypic traits such as yield or harvest index, exposing the challenge to use metabolites as biomarkers (Hermann and Schauer 2013). In this regard, the major aim of this work was to characterize metabolic based (Wahyuni et al. 2012) and related agronomic traits of 18 select tomato RILs (S. lycopersicum × S. pimpinellifolium) to provide valuable information to improve fruit quality and metabolic-based traits. To achieve this goal we firstly applied a well-established gas chromatography coupled to mass spectrometry (GC–MS) platform, examining polar extracts of tomato fruit pericarp (Roessner-Tunali et al. 2003) complemented with proton nuclear magnetic resonance (1H NMR) metabolite profiling method (Mattoo et al. 2006; Sorrequieta et al. 2013). Secondly, we evaluated trait relations using clustering methods and network analyses (Guimerà and Nunes Amaral 2005).

2 Materials and methods

2.1 Plant material

Plant material consisted of S. lycopersicum (cv Caimanta), the red-fruited wild species S. pimpinellifolium (LA722), F1 hybrids between both species and eighteen tomato RILs. RILs were obtained after seven generations of selfing and five cycles of antagonistic and divergent selection for fruit shelf life and weight (Zorzoli et al. 2000; Rodríguez et al. 2006); they represent a novel source of public germplasm.

Seeds of the 18 RILs, their parents and F1 were germinated in seedling trays at the end of June and transplanted to greenhouse after a month according to a completely randomized design. Plants were grown at the Experimental Station “José F. Villarino” (33° SL and 61° WL), Argentina. Plant density in the field was approximately 4 plants per square meter (70 cm between grooves, 40 cm between plants into each groove).

2.2 Fruit phenotype analyses

Twelve agronomic, both morphological and biochemical traits were measured according to Gallo et al. (2011). Six plants and ten fruits per plant were evaluated for morphological fruit traits including: diameter, weight, height, shape (height/diameter ratio), pericarp thickness, locule number, firmness, shelf life, color index (a/b) and reflectance percentage. Biochemical traits: pH, titratable acidity, soluble solids and soluble solids/acidity ratio, were measured in three independent fruit pools per line harvested from six plants. All characters were measured at ripe stage (when fruit has fully acquired its final color) except for shelf life that was evaluated at breaker stage (10 % of fruit surface has acquired red (or final) color).

2.3 Metabolite profile analyses

Metabolite profiles were obtained by both GC-time of flight-(tof)-MS and 1H NMR analyses (as described below) from pericarps of six fruits harvested from six independent plants at ripe stage. Fruits were collected from the 2nd and 3rd floors.

2.3.1 Sample preparation, extraction and GC–MS analyses

The relative levels of metabolites were determined from frozen pericarp samples following the protocol established by Roessner-Tunali et al. (2003) for tomato tissue. Fresh tomato tissue pericarps were harvested at ripe stage, rapidly frozen in liquid nitrogen and stored at −80 °C until analysis. For sample extraction ~250 mg of frozen pericarps were manually grounded in liquid nitrogen. All the powder obtained was extracted in 3000 μl of cold methanol and 120 μl of internal standard (0.2 mg/ml ribitol in water) was added for quantification. The mixture was incubated for 15 min at 70 °C, mixed vigorously with 1500 μl of water and centrifuged at 2200 g. The methanol/water supernatant was reduced to dryness under vacuum. Samples were stored at −80 °C until GC–MS analysis. The dried extract was re-dissolved and derivatised for 120 min at 37 °C (in 60 μl of 30 mg/ml methoxyamine hydrochloride in pyridine) followed by a 30 min treatment at 37 °C with a mixture of 100 μl of N-methyl-N-[trimethylsilyl] trifluoroacetamide and 20 μl of retention time standard mixture composed by 0.4 ml ml−1 of the same 13 fatty acid methy esters (FAMEs) used by (Lisec et al. 2006). Sample volumes of 1 μl were then injected in the GC–MS using a splitless mode and a hot needle technique.

The GC-tof–MS system was composed of an AS 2000 autosampler, a GC 6890 N gas chromatographer (Agilent Technologies, Santa Clara, CA, USA), and a Pegasus III time-of-flight mass spectrometer (LECO Instruments, St. Joseph, MI, USA), provided with an Electron Impact ionization source. GC was performed on a MDN-35 capillary column, 30 m in length, and 0.32 mm in inner diameter, 0.25 mm in film thickness (Macherey–Nagel). The injection temperature was set at 230 °C, the interface at 250 °C, and the ion source adjusted to 200 °C. Helium 5.0 was used as the carrier gas at a flow rate of 2 ml/min. The analysis was performed under the following temperature program: 2 min of isothermal heating at 80 °C, followed by a 15 °C per min ramp to 330 °C, and holding at this temperature for 6 min. Mass spectra were recorded at 20 scans per sec with a scanning range of 70 to 600 m/z. The experimental set was composed by 126 samples (18 RILs, both parents and F1 hybrid 6× replicates each), they were separated in three runs of 54 samples each one, including 6 RILs, parents and F1 hybrids for comparison in each run. To homogenize variation through a single run, samples were injected in the following order: replicate #1, #2…#6 of all lines. A sample of Arabidopsis leaves were included in each run, randomly distributed to check machine sensitivity, one of the most important GC–MS quality control (see Lisec et al. 2006). Both chromatograms and mass spectra were evaluated using ChromaTOF software, version 3.00 (LECO Instruments, St. Joseph, MI, USA),.peg files were exported to .cdf using a baseline off set of 1(‘just above the noise’), an average of 5 points for smoothing, a peak width of 10 and a signal to noise ratio of 10. Identification and semi-quantitation of the compounds detected in the GC-TOF–MS metabolite profiling experiment were performed with TagFinder 4.0 software (Luedemann et al. 2008). Then .cdf files were converted to .txt with the Pick Apex finding tool of TagFinder, considering a smooth width apex finder of 10 and an intensity threshold of 50. Retention time indexes (RI) were calculated using data of the standards included. Data matrix was obtained using a time scan width of 400 RI based on FAMEs (Kind et al. 2009) and a Max-Intensity aggregation of peaks. Mass pairs were automatically extracted with the pBuilder.MassPairFinder tool of TagFinder. Finally, metabolites were identified by comparison with spectral data from the public library GMD@CSB.DB (The Golm Metabolome Database; Kopka et al. 2005); this library includes RIs, molecular weight (m/z) and the associated MS spectra (see online resources Table S1 for metabolites annotation data).

The standardization of the complete metabolite profiling experiment was made in accordance with Lisec et al. (2006). All samples were measured in three independent GC–MS runs, since datasets measured at different times are not directly comparable because of varying tuning parameters of the GC–MS machine over time; we therefore normalized the data by using the S. lycopersicum parental species of each measured batch as a reference (Roessner et al. 2001).

2.3.2 Sample preparation and 1H NMR analyses

Absolute levels of metabolites were determined from frozen pericarp samples following the protocol published by Sorrequieta et al. (2013). One gram of fresh weight of pericarp frozen in liquid nitrogen without peel was extracted in 0.3 ml of 1 M phosphate buffer (pH 7.4) prepared in D2O. The solution was centrifuged at 13,500 g for 15 min at 4 °C and the supernatant filtered to remove any insoluble material. One mM of internal standard (TSP: 3-(trimethylsilyl) propionic-2,2,3,3-d4 acid sodium salt) was added to the resulting transparent soluble fraction and the solution was subjected to spectral analysis at 600.13 MHz on a BrukerAvance II spectrometer. Proton spectra were acquired at 298 K by adding 512 transients of 32 K data points with a relaxation delay of 5 s. A 1D-NOESY pulse sequence was utilized to remove the water signal. The 90° flip angle pulse was always ~10 μs. TSP was used for both, chemical shift calibration and quantitation, that is, proton spectra were referenced to the TSP signal (δ = 0 ppm) and their intensities were scaled to that of TSP. Spectral assignment and identification of specific metabolites was established by fitting the reference 1H NMR spectra of several compounds using the software Mixtures, developed ad hoc as an alternative to commercial programs (Abriata 2012). Briefly, the programme allows easy visualization and basic editing of spectra. It also provides a wizard that aids in fitting spectra from a database of known compounds to the signals in the spectrum. Fitted signals are integrated and integrals are exported to a standard spreadsheet file for further analysis (see online resources Table S1 and Fig. S1 for metabolites annotation data). Further confirmation of the assignments for some metabolites was obtained by acquisition of new spectra after addition of authentic standards. Analysis of the 1H NMR data of the pericarp of different mature fruits were performed as previously described (Sorrequieta et al. 2013).

2.4 1H NMR and GC–MS methods comparison

A comparative analysis between 1H NMR and GC–MS data was performed in order to evaluate the confidence between both technologies. The normalized values for the 16 compounds quantified in common by both methods were correlated by applying Pearson’s coefficient (Table 1). Although both methods use different extractions solvents the general correlation was a moderate and highly significant (r = 0.60, p < 0.0001). Few metabolites, namely fructose, glucose, glutamine, sucrose and tryptophan, displayed non-significant correlation between both technologies. Particularly, tryptophan, sucrose and glutamine show low 1H NMR signals (see online resources Fig. S1); this makes difficult the calculation of these metabolites contents by 1H NMR. Additionally, the use of phosphate buffer in the 1H NMR protocol may result in a residual neutral invertase activity, modifying sucrose and fructose/glucose levels. This could explain the lack of correlation between GC–MS and 1H NMR measurements. We then decided to keep data from 1H NMR and GC–MS separately and evaluate the association between those metabolites with the *omeSOM tool (see below).

Table 1 Comparative analysis of metabolite profiles from tomato pericarps obtained by GC–MS—polar extracts—and 1H NMR methods

2.5 Statistical analyses

In order to compare agronomic and metabolic trait differences between RILs and the parental line S. lycopersicum (cv. Caimanta), collected data were analyzed using the t test algorithm embedded into Microsoft Excel software (Microsoft Corporation, Redmond, WA, USA) considering a p value <0.01 as significant.

2.6 Data integration

Agronomic and metabolic data were firstly analyzed by HC (Hierarchical clustering) and visualized using MeV software (Saeed et al. 2006). Secondly, they were integrated into a self-organizing map for *omic data (*omeSOM) developed by Milone et al. (2010). Self-organizing maps (SOM) represent a special class of neural networks that use competitive learning, which is based on the idea of units (neurons) that compete to respond to a given subset of inputs. Each neuron corresponds to a cluster and is associated with a prototype or weight vector. Given an input pattern, its distance to the weight vectors is computed and only the neuron closest to the input becomes activated. Results were visualized using the software´s graphic interface (*omeSOM, available on: http://sourcesinc.sourceforge.net/omesom/). This tool trains a two-dimensional SOM for clustering, allowing representing complex high-dimensional input patterns in the form of a simple low-dimensional discrete map. Therefore, SOMs can be appropriate for cluster analysis when looking for underlying or so-called hidden patterns in data. Once trained, the software allows the visualization of coordinated variations of all integrated elements in order to easily reveal relations among the different kind of included data. A visualization neighbourhood (Vn) can be set, which defines the radius of adjacent neurons that will be considered as a unique group. Data matrix was constructed with all agronomic and metabolic data. Each data point, the ith agronomic or metabolic trait for the jth RIL, was normalized by mean (µ) and standard deviation (σ) as follows:

$$\tilde{x}_{ij} = \frac{{x_{ij} - \mu_{i} }}{{\sigma_{i} }}$$

Metabolite data obtained by both 1H NMR and GC–MS technology are expressed relative to one of those measured in the parental S. lycopersicum (cv Caimanta). In order to reveal all the possible associations, directed and inverted patterns were used to train the *omeSOM model as described by Milone et al. (2010).

With the aim to define the Vn value to be considered as informative, obtain a better visualization of neurons that were related among them and also compare the SOM method (innovatively applied to breeding here) with the traditional used analysis (correlation), a network reconstruction was performed using NetDraw (Analytic Technologies, Lexington, KY). The same data matrix used for SOM analysis was employed to compute the correlations. Each nodes (agronomic—diamonds—or metabolic—circles—traits) were represented as edges and correlations (Pearson with p < 0.001) between traits were represented by connector lines. Correlation coefficient and significances between traits were calculated with InfoStat software (Di Rienzo et al. 2011). Positive and negative correlations and coefficient values are indicated by different color and thickness of the lines, respectively.

2.7 Mode of inheritance assessments

For both metabolic and agronomic traits measured in the parents and in their offspring (F1 hybrid), differences in mean values were analyzed by ANOVA and Tuckey test. Those traits showing significant differences (p < 0.05) were classified into the following mode-of-inheritance categories and classes defined by Lisec et al. (2011), in which the effect of the S. pimpinellifolium allele is compared with the S. lycopersicum allele: recessive (only S. pimpinellifolium is significantly different from S. lycopersicum whereas the offspring is similar to S. lycopersicum), additive (the F1 is between the parents, which are significantly different from each other), dominant (both the homozygous S. pimpinellifolium and the hybrid showed similar values but differed significantly from S. lycopersicum), or overdominant (the F1 is significantly higher or lower than both parents).

3 Results and discussion

3.1 Variation in the metabolic fruit composition in parental species, F1 and in the RIL population

The RIL population used in this study was obtained by divergent-antagonistic selection for fruit weight and shelf life traits (Zorzoli et al. 2000; Rodríguez et al. 2006). It was previously characterized with molecular markers (Pratta et al. 2011b) and evaluated for different traits related to tomato crop productivity (Gallo et al. 2011; Pratta et al. 2011a, b). Since many efforts have been invested in recording agronomical valuable information, it constitutes an excellent source of public germplasm for breeding programs. Here, we focus the analysis on variations of the metabolite content in mature fruits by applying two standardized methods (1H NMR and GC–MS) for metabolite profiles. A total of 60 different metabolic traits (16 with 1H NMR and 59 with GC–MS) were quantified in ripe fruits harvested from the two parental lines, their interspecific F1 hybrid and 18 selected RILs (see online resource Table S1). These metabolites correspond to amino (23) and organic acids (9), TCA cycle intermediates (5), soluble sugars (6), sugar alcohols (4), phosphorylated intermediates (3), few fatty acids (4), alkaloids (2), nucleotides (1), amides (1) and others (2) (for details see online resource Table S2).

A high level of divergence in terms of primary metabolism was evident when comparing metabolite contents between the parental lines. Most of the measured amino acids (9), all tricarboxylic acid (TCA) cycle intermediates (with the exception of succinate), dehydroascorbate, glucarate, nicotinate, turanose, xylose, erythritol, galactinol, inositol, glycerol-3P and the two measured alkaloids (calystegine A3 and calystegine B2) were significantly more abundant in S. pimpinellifolium than in cultivated tomato (see online resource Table S2, p < 0.01). By contrast, only the levels of the amino acids methionine, serine, threonine, tyrosine and galacturonate were significantly lower in the fruits of the S. pimpinellifolium than in cultivated species (see online resource Table S2, p < 0.01). It has been reported that these two tomato species have similar fruit water content (Schauer et al. 2005). Thus, the variations observed in the metabolite levels would indeed reflect differences in fruit composition. Our results are in good agreement with a previous report on the free amino acid composition over the same parental accessions but under different environmental conditions (Pratta et al. 2011a). This observation suggests strong heritability of these traits as was reported for many of the metabolite measured here by Schauer et al. (2008). Since, the existence of variation between parental species of a RIL population is a key pre-requisite for any breeding program; our results suggest that the chosen material and the approach selected is valid for improving nutritional quality.

A subset of metabolites displayed highly variable levels in the RILs with respect to their contents in the parental S. lycopersicum fruits (Fig. 1a, online resource Table S2). Among them, the amino acids β-alanine, GABA, glutamate, proline and 5-oxoproline showed the highest increases in a considerable number of RILs (Fig. 1b, online resource Table S2). The organic acids glucarate and quinate together with the TCA cycle intermediates malate, pyruvate and 2-oxoglutarate, displayed variable levels in the RIL population. In contrast, only two soluble sugars showed significant variations namely sucrose and, to a lesser extent, xylose. Besides, the contents of glycerol-3P, galactinol and erythritol also varied significantly among the RILs. Remarkably, the levels of calystegine B2, an alkaloid present in a wide range of Solanaceae species (Asano et al. 1997; Bekkouche et al. 2001), displayed a variation between four and 13-fold. Evaluating the proportion of every metabolite category varying in the RIL population (and also in the F1 hybrid) with respect to the cultivated parent (Fig. 1a, online resource Table S2), we found that five RILs (6, 7, 8, 9 and 10) displayed significant differences in all metabolite categories. These genotypes, together with the RILs 4, 17 and 18, showed the highest number of significant changes. The remaining lines also revealed significant variations; however, the proportion of altered metabolites in each category was considerably lower. Zooming into each category, the amino acid proline and the alkaloid calystegine B2 were the metabolites displaying the highest values of relative changes in all analysed lines, followed by the organic acid quinate and the sugar sucrose. In contrast, all fatty acids showed very similar values to those measured in the cultivated parent (Fig. 1b).

Fig. 1
figure 1figure 1

Metabolic variations in polar extracts of ripe tomato pericarps from a RIL population derived from a cross of S. lycopersicum (cv. Caimanta) × S. pimpinellifolium (LA722). a Proportion of significantly altered metabolites (increase or decrease relative to the cultivated species) for each line (and the wild parent and F1 progeny) for each chemical category. b Hierarchical cluster analysis (represented by a heat map) of the quantitative variation detected in all metabolic and agronomic traits measured in the RIL population. Amino acids are depicted in the three-letter code. DHA dehydroascrobate, PT pericarp thickness, SSC soluble solid contents, TA titratable acidity

3.2 Data integration and clustering using the *omeSOM model (SOM) to expose associations between traits

Having established the metabolic and agronomic variation into the studied material (online resource Table S2) we next focused our attention on integrating all data aiming to detect connections between yield-associated and metabolic traits. For this propose the *omeSOM model was used (Stegmayer et al. 2009; Milone et al. 2010). The first step for clustering with *omeSOM is the definition of the map size. In order to obtain the best value, we assayed a range of map sizes and evaluated the relative distance between the same metabolites measured by GC–MS and 1H NMR methods (Table 2). The relative distance is defined as the number of neurons between the position of a metabolite in the map measured with GC–MS and the same compound measured with 1H NMR, divided by the total number of neurons in the map. All map sizes evaluated (from 7 × 7 to 11 × 11) displayed similar relative distance values (Table 2, online resource Fig. S2). On the other hand, we explored the neuron components looking for well-established associations to find a biologically meaningful map size. The 9 × 9 map displayed neurons grouping all characters associated to fruit shape. This result is in agreement with previously reported analyses performed with the same RIL population (Pratta et al. 2011b). Moreover, malate content, fruit firmness and shelf life traits fell into the same neuron showing a relationship that is in accordance with the findings reported by Centeno et al. (2011). In addition, the 9x9 map also performed a high cohesion of the elements, meaning that the pattern of variation of the components into the same neuron was very similar. Consequently, the 9x9 map was chosen for further analyses.

Table 2 Relative distance (Vnr) between the same metabolite measured by GC–MS—polar extracts—and 1H NMR and analyzed by *omeSOM with different map sizes

The obtained map (Fig. 2, online resource Table S3) contained a total of 81 neurons grouping the 14 agronomic and 60 metabolic traits. We firstly evaluated if metabolites measured by both methods (1H NMR and GC–MS) clustered together or in close proximity. For those located in the same neuron (or in close proximity Vn ≤ 2; see online resource Table S3) and showing significant Pearson correlation coefficients (≥0.65, Table 1) both measurements were considered equally confident and thus, kept separately for the rest of the study. For the other compounds we choose the methodology showing the lowest standard deviation values among replicates of the same lines (see online resource Table S2).

Fig. 2
figure 2

*omesom model of 81 neurons grouping 60 different metabolic and 14 agronomic traits from the 18 analyzed RIL, the parent S. pimpinellifolium (LA722) and the F1 hybrid. Metabolites were measured in polar extracts from tomato pericarps. Directed and inverted relations are shown in the left and right quadrants respectively. Black neurons group metabolic and agronomic traits, blue and red neurons group only metabolic and only agronomic traits, respectively. Histograms showing components variation along the lines analyzed are presented for those neurons grouping at least one agronomic trait. References: Neuron 1: fruit diameter (inv, blue line), height (inv, green line), weight (inv, red line), pericarp thickness (inv, light blue line), locule number (inv, violet line), calystegine A3 (blue dotted line), calystegine B2 (green dotted line), dodecanoate (red dotted line), erythritol (light blue dotted line) and glutamate (1H NMR) (violet dotted line). Neuron 3: acidity (blue line), arginine (blue dotted line) and ethanol (green dotted line) (1H NMR). Neuron 4: pH (inv, blue line), soluble solids (SSC)/acidity (inv, green line), phenylalanine (blue dotted line), 5-oxoproline (green dotted line) and succinate (inv, red dotted line). Neuron 7: Fruit shelf life (blue line), Fruit firmness (green line), fructose (inv, blue dotted line), glycerate (green dotted line), malate (red dotted line) and malate (1H NMR) (light blue dotted line). Neuron 8: reflectance (blue line), color index (inv, green line), asparagine (blue dotted line) and threonine (green dotted line). Neuron 12: Soluble solids (blue line) and proline (blue dotted line). Neuron 31: Fruit shape (blue line) (Color figure online)

Analyzing the upper-left region of the map, 32 neurons integrated all measured characters. Within these, six were integrative neurons grouping metabolic with agronomic traits (Fig. 2, black squares); 25 neurons grouped only metabolic characters (Fig. 2, blue squares) and one neuron contained only a single agronomic trait (shape index) (Fig. 2, red squares). In order to identify putative metabolite regulators of the complex agronomical traits evaluated, we next concentrated our further analyses on integrative neurons and also on groups of neighboring neurons (Vn = 1 and 2), which also harbour informative associations between characters (Fig. 2; neurons 3, 4 and 12; neurons 7 and 8).

3.3 Metabolic and agronomic traits relations

From the evaluation of integrative neurons (Fig. 2) it could be proposed the existence of different associations between metabolic with agronomic traits. Although these links must be further investigated and experimentally validated, they represent the first step toward new biomarkers as selection tools in breeding programs. Inverse associations between fruit morphology traits (diameter, height, weight, pericarp thickness and locule number) and the primary and secondary metabolites erythritol, glutamate (1H NMR), dodecanoate, calystegine A3 and calystegine B2 are exposed in neuron 1. The group including neighbor neurons 3, 4 and 12 associated acidity, arginine and ethanol contents (neuron 3) with phenylalanine, 5-oxoproline, succinate, juice pH and soluble solids/acidity ratio, with the last three characteristics being inversely associated (neuron 4). Last neuron of this group (neuron 12) included soluble solids and proline contents. Another group, composed by neurons 7 and 8 showed strong association between fruit firmness, shelf-life, glycerate and malate contents (measured both by GC–MS and 1H NMR), and inversely with fructose contents (neuron 7). Finally, color index was inversely associated with reflectance, asparagine and threonine contents (neuron 8) (Fig. 2). This integrative analysis of the data might suggest novel relations between the components, exposing those traits which can be considered to design breeding programs. The SOM model applied appears as a valuable tool representing complex high-dimensional input patterns into a simpler low-dimensional discrete map, easing the results interpretation. Since it was the first time that this method is used, instead of the correlation analysis commonly performed to evaluate trait relations, we decided to compare both results –SOM and a network correlation analysis- in order to evaluate the consistency between them. Additionally, this comparison gives us a statistical frame to establish an informative neighborhood value (i.e. how many neighbor neurons that contain significant associations must be considered) and also simplify the visualization of the clusters. Correspondence between Pearson and Spearman correlation matrices was assayed with a Mantel test (999 permutations, Rxy = 0.91, p < 0.001). Since both matrices were highly correlated, we selected the most widely used Pearson correlation coefficient for the network reconstruction. Figure 3 shows the resulting unrooted network where nodes indicate metabolic (circles) and agronomic (diamonds) traits; the number inside the node designates the corresponding *omeSOM neuron number where this trait was located; and connector lines depict significant correlations between characters (p < 0.001). Different nodes colors denote metabolic pathways according to KEGG categories (http://www.genome.jp/kegg/) and classes of agronomic traits according to Guimerà and Nunes Amaral (2005).

Fig. 3
figure 3

Network analysis of metabolic and agronomic traits from the 18 analyzed RIL, the parent S. pimpinellifolium (LA722) and the F1 hybrid. Metabolites were measured in polar extracts from tomato pericarps. Circles and diamonds indicate metabolic and agronomic traits, respectively. Connections represent significant Pearson correlation (p < 0.001) between edges, blue and black lines depict negative and positive correlations, respectively. Line thickness is proportional to the correlation coefficient values (ranging between −0.94 and 0.97). Numbers inside the nodes indicate *omesom neuron number and Vn value indicate neighborhood between neurons calculated as described in Milone et al. (2010). Metabolic nodes are coloured according to KEGG pathways categories (www.genome.jp/kegg/) (Color figure online)

Both SOM and network analysis (Fig. 3) display highly similar results, since correlation among elements of the same and close neighbor neurons were significant. This indicates that *omeSOM is an accurate tool to reveal trait relations and also that in cases where Pearson correlation significances are not high, the SOM method can expose novel links between variables which are easy to visualize.

Interestingly, the topology of the constructed network is in agreement with the SOM map analyzed. But, it reveals that integrative neurons are not the only informative ones but they must be considered as clusters with their neighbors. Results of network reveals that neuron 1, instead considered as separated neuron, is highly correlated with the elements of neighbor neurons 2 and 10, indicating that Vn = 1 is a very informative neighbor value. They are included in one of the major clusters named C1-10-2. On the other hand, neurons 7, 8 and 9 should be integrated with their neighbors 18 and 27 (Vn = 2 and 3 respectively), since the elements of all of them are highly correlated. They compose the other major cluster named C7-8-9-18-27. On the contrary, elements in neurons 3, 4 and 12 displayed dissimilar results in SOM and correlation analysis. Even when they are in close proximity in the SOM map they do not form a cluster in the network, indicating that this group is not supported by both analytical methods.

This result indicates that informative Vn value is variable along the map, but the value threshold of 3 is the higher to be considered.

The most cohesive cluster, C1-10-2, represents a core-group with the strongest and most numerous relations between its components. It comprises those traits defining the morphology of the fruits (diameter, weight, height, locule number and pericarp thickness) strongly correlated (negatively) with the amino acids glutamate and aspartate, with the TCA cycle intermediate 2-oxoglutarate, with the fatty acid dodecanoate and also with the contents of the two measured alkaloids; calystegine A3 and B2; they are all putative regulators of morphology traits. Evidence about the influence of glutamate and 2-oxoglutarate in fruit development via GABA shunt have been proposed by Kisaka et al. (2006). Additionally, downregulation of the 2-oxoglutarate dehydrogenase has major consequences on this process which extend to the end of ripening (Araújo et al. 2012). On the other hand, calystegines exhibit selective inhibition of glycosidases which is universally required for normal cell function (Kvasnicka et al. 2008). Based on their structural similarities to sugars, it has been suggested that in mammal cells calystegines may interact with enzymes of carbohydrate metabolism and could be functional in diets preventing a steep increase in blood glucose after a carbohydrate-rich meal (Jocković et al. 2013). These hints toward mechanistic links, therefore suggest that the relation between these traits found in the current study, will likely provide a valuable tool for tomato breeding programs.

The other central cluster, C7-8-9-18-27, includes agronomic characters related to fruit attributes such as color, firmness and shelf life, traits for which this RIL population was initially selected. They showed strong connections between each other (Fig. 3) and with the vast majority of amino acids and notably with malate and glycerate contents. In this regard, Centeno et al. (2011) previously reported that changes in malate metabolism result in an early water-loss phenotype of the tomato fruits with a consequent effect on post-harvest shelf life. Transgenic lines displaying reduced levels of malate dehydrogenase and fumarase activities presented elevated levels of soluble sugars, suggesting that osmotic potential may be a contributing factor to the water-loss phenotype. Our finding although likely operating by a different mechanism adds to the link between malate metabolism and shelf life of tomato fruits. Additionally, association between asparagine and threonine (neuron 18) was also observed by Sauvage et al. (2014), who found that variations in the contents of these two amino acids were linked to the same SNP locus, annotated as a Copine-like protein. They proposed that this locus is close to one or several pleiotropic effect gene(s) directly involved in the biosynthetic pathway of these amino acids.

Somehow unexpected, fruit biochemical characters (juice pH, acidity and soluble solids) showed few connections with the metabolic complement of the fruits. However, the association of proline with soluble solids resulted particularly interesting because of nitrogenous compounds -in the form of amino acids, peptides and proteins, minerals and pectic substances- are also considered part of the soluble solids. Additionally, in grape, this amino acid is proposed as an indicator of the berry ripeness (Carnevillier et al. 1999; Mulas et al. 2011), therefore it can be also considered the use of proline as an indicator of fruit ripeness in tomato.

3.4 Assessment of the mode of inheritance of metabolic and agronomic traits

Given that a number of metabolic and agronomic traits differed significantly either between the parents or with respect to the F1 offspring used in this study, we then analyzed these differences aiming to assess the mode of inheritance of the measured traits. Based on these comparisons we classified each trait into the following categories: recessive, additive, dominant or overdominant (see “Materials and methods” section). Out of all 12 possible classes proposed by Lisec et al. (2011) nine were identified in our data set (see online resource Fig. S3). Sixty traits could be classified within the mentioned classes: 47 and 13 of the metabolic and agronomic traits measured, respectively (online resource Fig. S3). However, a minor proportion of the traits could not be classified because they showed no significant differences between both parents and the F1 hybrid. The mode of inheritance of all classified traits can be seen in online resource Table S2. A high number of metabolic traits appeared to be recessive (29), where the F1 mean value is equal to the mean value of the S. lycopersicum parent. This class includes a large number of amino acids and TCA cycle intermediates. By contrast, the vast majority of the agronomic traits exhibited either dominant or additive mode of inheritance (see online resource Table S2). No agronomic traits were detected to exhibit overdominance and only five metabolic traits showed this mode of inheritance: the amino acids alanine, aspartate, GABA (1H NMR) and glycine and the monosaccharide xylose (online resource Table S2).

Mode of inheritance of yield associated traits in tomato has been analyzed in detail by Semel et al. (2006). By using S. pennellii near isogenic lines (NIL) in heterozygosity these authors have shown that four traits exhibited heterosis (overdominance): seed number per plant, fruit number, total yield, and biomass. Other phenotypes, however, such as fruit weight, plant weight, soluble solids content and seed morphology, showed no heterotic effects. Our results are coincident with these findings, even when we use an interspecific F1 hybrid to evaluate the inheritance mode, which can expose more epistatic effects than those detected in NILs. Regarding the mode of inheritance of the metabolic traits, evidences for changes in metabolites in hybrids have been documented in Arabidopsis (Gärtner et al. 2009; Lisec et al. 2009), maize (Lisec et al. 2009; Riedelsheimer et al. 2012) and also in tomato fruits from the above mentioned NIL heterozygotes (Schauer et al. 2008). More than 60 % of the metabolic traits analyzed here shared the same mode of inheritance reported by Schauer et al. (2008), but notably none of them presented overdominance. Since our analysis involves a two-parental-hybrid system, the identification of overdominant traits could result from epistatic allelic interactions in the S. lycopersicum × S. pimpinellifolium hybrid, even considering that these two genomes are evolutionary closer than the donor parent (S. pennellii) of the NIL system used by Schauer et al. (2008) (Kamenetzky et al. 2010; Sato et al. 2012). On the other hand, the potential of tomato MAGIC populations as source of genetic diversity was recently reported (Pascual et al. 2014). Another study, using a genome-wide association approach, identifies putative candidate genes involved in the genetic architecture of fruit metabolic traits (Sauvage et al. 2014).

Attractively, from the extensive metabolic characterization conducted here, the link between different traits and the heritability evidenced in this paper, it can be proposed several candidate RILs for the generation of SCH. In this regard, the heat map presented in Fig. 4 displays a summary of the changes for each RIL relative to the cultivated tomato. When taking the analyzed traits displaying positive effects, RILs 1, 2, 3, 4, 7, 9, 10, 16, 17 and 18 can be considered as candidate lines. All of them displayed higher number of traits with positive effect than the S. lycopersicum × S. pimpinellifolium hybrid. In agreement with this selection, a recent work (Liberatti et al. 2013) showed that the SCH produced with three of these lines (RILs 1, 9 and 18) display longest shelf life and higher soluble solid values; among the few traits in common between both reports.

Fig. 4
figure 4

Differences in metabolic or agronomic traits relative to the domesticated parental line (S. lycopersicum cv. Caimanta). Metabolites were measured in polar extracts from tomato pericarps. Red and blue squares represent significant (p < 0.01 by t test) increase or decrease of each trait relative to S. lycopersicum values. Black squares indicate missing data. Values below the graph indicate the number of total traits presenting changes and those with positive effects (increments) (Color figure online)

On another hand, RIL 4 displayed increased levels of chlorogenate. This antioxidant is considered a very important compound for human health since it can limit low-density lipid (LDL) oxidation protecting against degenerative age-related diseases, and it also removes particularly toxic reactive species by scavenging alkylperoxyl radicals and may prevent carcinogenesis by reducing the DNA damage they cause (Niggeweg et al. 2004).

pH and titratable acidity contribute to tomato flavor intensity, being particularly responsible for the sourness of the fruit. Both measures are, however, very complex and uncertain since the profile and concentration of compounds defining them vary greatly among accessions and varieties (Fernandez-Ruiz et al. 2004). Even when the best pH value for fresh tomato consumers was not determined, commercial lines have a pH range from 4.0 to 4.6. Conversely, pH values below 4.5 are required for processing tomatoes, since they reduce pathogens growth preserving product storability (Bucheli et al. 1999; Foolad 2007). The majority of lines showed pH values lower than 4.5 (RILs 1, 4, 5, 7, 9, 10, 11, 13, 16, 17). Thus, any of them would be appropriate for processing industry.

Overall, assessments presented in this paper demonstrate that wild tomato species offer breeders a great potential for exploiting genetic variability and enhancing fruit weight, shelf life and chemical composition of the fruits. Moreover, *omeSOM analyses and network construction illustrate the power of these tools when applied in combination into tomato breeding programs. The approach used here opens new routes to the understanding of the nutritional traits which have been undervalued in classical breeding.

4 Concluding remarks

Here, integration of data from GC–MS and 1H NMR technologies permitted a “double check” of a limited number of metabolites adding to the accuracy of the profiles. Based on the combination of these technologies, we provide metabolic information from a group of selected RILs, evidenced links among metabolites and agronomic traits and established the inheritance of the traits evaluated. We have highlighted the potential of the metabolic analyses to assess agronomical important traits in tomato. Additionally, this study revealed that S. pimpinellifolium, a wild tomato species, introduced a broad metabolic variability which can be of importance for tomato breeding programs. All these information assessed with modern analytical tools, facilitated the identification of candidate lines for breeding programs towards the tomato nutritional quality improvement.