Introduction

Essential oils are composite mixtures of varying composition (20–60 components) and abundance [ranging from traces (ng/g) to fairly high concentrations (g/100 g)] with the major components belonging to the phenolic mono- and sesqui-terpenoid chemical groups. Essential oils have increasingly found applications in a diverse range of industries including: food processing, perfumery, cosmetics, pharmaceutical production, winery, etc., particularly due to their antioxidant and antimicrobial activity, attributed to the phenyl moieties in their structures (Bajpai et al. 2009). The evaluation of the essential oils’ composition profiles is crucial in determining the components responsible for their chemical/biological activity.

The gas chromatography (GC) technique is one of the most powerful tools in analytical chemistry and has been widely used (almost irreplaceably) in the analysis of essential oils (Adams 2001; Olivero et al. 1997; Zhao et al. 2009). The GC method produces a single parameter (retention index), which may be used for the identification of virtually any compound under well-defined conditions (Acevedo-Martínez et al. 2006). Chromatographic retention is a complex phenomenon in which various types of intermolecular forces are involved and these include: dipole–dipole (or Keesom) forces, dipole-induced dipole forces, London dispersion forces, electron donor–acceptor complexes, as well as hydrogen bonds. These forces collectively determine the partition of the solute between the mobile and stationary phases (Fritz et al. 1979; Ong and Hites 1991; Peng et al. 1988; Yancey 1994). The chromatographic retention profile for molecules can be measured using different parameters which include: retention time, linear-temperature programmed retention index, Lee retention index, boiling point correlation, equivalent chain length, Kováts retention distance, and the most popular one the Kováts retention index (RI) (Babushok 2015; Kováts 1958; von Mühlen and Marriott 2011). The RI is a relative retention parameter normalized with respect to n-alkane series as a standard. It thus has an advantage of being independent of individual chromatographic system characteristics, which explains its wide application in QSRR studies (Rohrschneider 1965).

Nowadays, gas chromatography systems hyphenated with mass spectrometry (GC–MS) are considered as standard analytical platforms, with the latter providing complementary information for structural identification. However, GC–MS data (retention times and mass spectra) do not always provide sufficient evidence for structural profiling and thus prediction models may be useful for ultimate structural verification. In addition, the identification of compounds is often performed by matching the GC peaks with a standard of the suspected chemical. The setback for this approach is that standard samples with the required degree of purity are sometimes unavailable, and in such cases, theoretical model for estimating the RIs may be a useful alternative (Hodjmohammadi et al. 2004). The identification of essential oils components is particularly challenging as many constituent terpenes provide identical MS, owing to the fact that they yield similar fragments upon ionization. Moreover, compounds not registered in the existing MS libraries are rather difficult to identify and often lead to erroneous assignments (false positives). In this sense, the integration of chemical information from theoretical retention indices allows for the elucidation of MS data, and consequently more accurate peak assignment of essential oil components. Indeed, MS libraries are currently incorporating retention index data for all registered compounds (Mondello 2015; NIST 2017).

The correlation between gas chromatographic retention indices and molecular parameters provides a significant information on the effect of the molecular structural characteristics on the retention time and on the possible mechanisms for absorption and elution (Körtvélyesi et al. 2001). Good correlations have been obtained between RIs and theoretically calculated data for molecules with different functional groups: alkanes (Görgényi et al. 1989), alkybenzenes and naphthalenes (Dimov et al. 1994), dialkylhydrazones (Kiraly et al. 1996), phenol derivatives (Kaliszan and Höltje 1982), azo compounds (Kortvelyesi et al. 1995), primary, secondary, and tertiary amines (Osmialowski et al. 1985), polycyclic aromatic hydrocarbons (Rohrbaugh and Jurs 1986), various aromatic compounds (Gautzsch and Zinn 1996), alcohols, esters, ketones, monoterpenes, di- and tricyclicmethyl esters, and monocyclic ketones (Duvenbeck and Zinn 1993), odor-active aliphatic compounds with oxygen-containing functional groups (Anker et al. 1990), stimulants and narcotics (Georgakopoulos et al. 1991a), anabolic steroids (Georgakopoulos et al. 1991b), etc., using several models ranging from linear ones (e.g., multiple linear regression, partial least squares, and principal component regression) to non-linear ones (e.g., artificial neural network, support vector machine, etc.) (Albaugh et al. 2009; Qin et al. 2013a, bQin et al. 2009; Skrbic and Onjia 2006).

Nonetheless, it is important to bear in mind that the ultimate objectives of constructing QSRR models are for their use in the prediction of properties of newly identified compounds whose chemical profiles are not known and for providing greater knowledge on the general mechanisms involved in solute–solute, solute–mobile phase, and solute–stationary phase interactions, respectively. To achieve this goal, it is imperative that the models are constructed over wide chemical space and have good generalizability. Unfortunately, the majority of the QSRR studies have been performed on small and/or congeneric data sets (Azar et al. 2011; Jalali-Heravi and Ebrahimi-Najafabadi 2011; Noorizadeh and Farmany 2010; Qin et al. 2009, 2013b; Yan et al. 2015), with a few exceptions in the literature (Dossin et al. 2016; Garkani-Nejad et al. 2004; Zhang et al. 2017). In the particular case of essential oils, despite the vast literature on the identification and characterization of novel constituent compounds, QSRR models have been generally built on data sets of sizes ranging from 25 to 169 compounds, with most of these belongings to a single chemical series. Certainly, such models have reduced utility as they do not cover an extensive structural space of known essential oils components. It is thus important that wide and diverse data sets for essential oils be constructed and QSRR models be built thereof to guarantee a wider applicability domain (AD) and thus generalization ability.

The aim of the present study is to develop a comprehensive data set of constituent components of essential oils and posteriorly develop statistical and artificial intelligence models relate the retention behavior of these components over the apolar stationary phase DB-5 [(5% Phenyl)-methylpolysiloxane] with DRAGON’s molecular structure characterizing parameters.

Experimental

Construction of the essential oils data set

An extensive literature review on theoretical and experimental studies on essential oils was performed and a data set comprised of 791 chemical structures with their corresponding average RI values was built. To guarantee the homogeneity and thus comparability of the RI values for the constituent components, only target Kováts RI values obtained under similar experimental conditions on standard non-polar 5% phenyl methyl polysiloxane stationary phase (DB-5 or HP-5) of GC–MS system were considered. This is a critical step as it minimizes the possibility of experimental outliers. The distribution of retention indices is shown in Fig. 1, which demonstrates the diversity of the constructed data set.

Fig. 1
figure 1

Distribution of retention index values for data set

The compounds and their corresponding retention indices are given as Supplementary Information, SI1 and SI2. For dissimilar RI values obtained under homogeneous conditions for the same compound, the average value was considered. In general, the measurement errors of the GC retention indices are in the range of ±2 standard deviations (s) of the RI values. Compounds whose RI values presented high standard deviations were not included in during the data set compilation. The molecular structures of data set were sketched using ChemDraw Ultra module of the ChemOffice software (Jaworska 2005). The sketched structures were exported to Chem3D module to create their 3D structures. The geometry optimization was done using semiempirical AM1 (Austin Model) Hamiltonian method and closed shell restricted wave function available in the MOPAC module.

Descriptor generation

A total of 3224 molecular descriptors (MDs) were computed for the constructed data set using the DRAGON software (Todeschini et al. 2007). Given the high dimensionality of the obtained data matrix (791 × 3224), we applied a simple variable selection procedure based on Spearman’s correlation coefficient (R), where for each pair of descriptors with R ≥ 0.95, only one is retained. Consequently, a lower dimensionality data matrix comprising of 1476 MDs was obtained.

Data set splitting and statistical analysis

The earnest predictive power of any model can only be assessed over a set of compounds not used in the model training, also known as the test set. In this sense, the essential oils data set is split into training and test sets, respectively, using the cluster analysis technique implemented in the STATISTICA 8.0 software (Statsoft 2001). First, hierarchical clustering was performed using the Euclidean distance and complete linkage, as the distance measure and linkage rule, respectively. The output dendrogram is comprised of a hierarchical cascade in which the base level represents each compound as belonging to a separate cluster, and for each level uphill, compounds (or clusters) with minimum distances are grouped together. To determine the optimum number of clusters, the distance corresponding to the steepest ascent in the amalgamation schedule is used as the cutoff. Posteriorly, k-means cluster analysis is performed; with k representing the number of clusters determined using hierarchical cluster analysis. Finally, the test set (representing approximately 25% of the data set) was selected by rank ordering the chemical compounds in each of the clusters according to the experimental RI values and sampled to span the entire property space. This splitting procedure guarantees that the structural and property spaces are broadly represented in the test for external prediction.

Multiple linear regression

To obtain the linear QSRR models for the gas chromatography Kováts retention index (GC–RI) of essential oils components, multiple linear regression coupled with the Genetic Algorithm (MLR–GA) (Devillers 1996; Kubinyi 1994; Leardi 1994) was used as the fitting method for the RI and variable selection strategy, respectively. The choice of MLR statistical technique is because of its simplicity, while the key advantage of the GA as a search strategy is that a set of optimum models are obtained with less computational effort, in the sense that a global maximum is achieved without exploring all combinations of variables in the data matrix space. The leave one out cross-validation parameter (\(Q_{\text{loo}}^{2}\)) was used as the objective function. For this study, the MOBY-DIGS program (Todeschini et al. 2004) was employed and the following configurations for the GA were considered: population size, 100; generations, 10 000; probability of mutation, 0.5; number of crossover, 5000.

Model validation techniques

The obtained models were tested for their robustness using the bootstrapping validation (\(Q_{\text{boot}}^{2}\)) procedure and Y-randomization [a (Q 2)] was used to check for fortuitous correlation. Other statistical parameters considered include: Fisher score (F), standard deviation error of prediction (SDEP), and standard deviation error in calculation (SDEC). Therefore, a multi-criteria approach was used to select the best model from the set of models obtained with the MLR–GA method. Posteriorly, the best models were assessed for their predictive power and using the external validation (\(Q_{\text{ext}}^{2}\)) procedure on the external set compounds. In addition, the Y-randomization test was carried out to check for fortuitous correlation; low intercept values [i.e., a (R 2) and a (Q 2)] are indicative of stability to this phenomenon.

Model applicability domain (AD)

The AD is a theoretical region in the chemical space defined by the model’s independent and response variables, and thus by the nature of the chemicals in the training set as represented by specific MDs (Gramatica 2007). A large and diverse training set contributes to a wide AD, although it is equally important to employ an inclusive structure description method that characterizes (explicitly or implicitly) all relevant structure features. If the structural characteristics of novel compounds are represented in the training data, and also adequately encoded by the model descriptors, it is reasonable to expect that there will be an increased probability of accurate property predictions for these compounds. In fact, only the predictions for chemicals that lie within the AD for a given model can be considered as reliable (Jaworska 2005).

Several approaches have been reported in the literature on determining the model’s AD, with the most common being the leverage approach (Atkinson 1985). This approach is based on some sort of “distance” metric (also known as the leverage, denoted by h) with which the separation of compounds from the model’s experimental space (the structural centroid of the training set) is determined as a measure of the influence of chemical structures on the model, in the sense that chemicals close to the centroid are less influential in model building than extreme points. Prediction should be considered unreliable for compounds with h values greater than the critical value h*, as these lie outside the AD of the model, i.e., are structurally distant from the training chemicals (h* = 3p′/n, p′ is the number of model variables plus one, and n is the number of the objects used to calculate the model). In addition, the models AD should be examined for possible outlying compounds, i.e., poorly fitted data points that cause models to deviate from the actual line of best fit. The criterion for flagging a compound as an outlier involves the computation of the standardized (or studentized) residuals. It follows that compounds with standardized residual values greater than ±3 (or ±2.5 in the case of studentized residuals) should be analyzed for possible outlying behavior (Alvarez 1995).

Artificial neural networks

Non-linear methods for multiple regressions such as artificial neural networks, support vector machine, and random forest have increasingly found utility in QSRR studies following the understanding that the relationship between molecular parameters and corresponding properties may not necessarily be linear. In fact, it has been demonstrated that non-linear models are capable of providing improved predictions of QSRR models. Therefore, to examine the possible non-linear relationship of the RIs and the MDs employed in the present report, artificial neural network (ANN) models were trained using the feedforward backpropagation algorithm. For these models, the same variables contained in the final MLR models were used as inputs and the following parameters were optimized: initial weights, number of nodes in the hidden layer, learning rate, the momentum, and number of iteration cycles. The values corresponding to the minimum prediction error were selected as the optimal parameters. In this study, a three-layered ANN was employed, i.e., comprised of one hidden layer, in addition to the input and output layers, respectively. The early stopping criterion was used to avoid the over-fitting phenomenon. The training and test errors were reported for each 500 cycles and these values used to construct a learning curve with which the network’s training was monitored to guard against overtraining and subsequent loss of generalizability of the models. An ascent in the learning curve (corresponding to an increase in the prediction error) was used as a flagging point for stopping the learning process.

Results and discussion

Design of the training and validation sets using cluster analysis

Figure 2 shows the output dendrogram for hierarchical cluster analysis. From the amalgamation schedule, 20 clusters were determined and this number (k value) posteriorly used to perform the k-means cluster analysis.

Fig. 2
figure 2

Dendrogram of the hierarchical analysis k-MCA

The ensuing clusters were then employed to split the data in training and validation sets. Note that 13 structurally atypical compounds were identified with cluster analysis and these were excluded, as they were indicative of outlier behavior, and thus, 778 chemical structures were retained. The resulting data set was split into training and test sets, with the 650 and 128 compounds, respectively.

Construction of MLR models and determination of optimum dimension

The MLR-based GC–RI models of size 4–15 variables were built over the training set and the best regressions (according to the optimization function) for each model magnitude selected for posterior validation. Figure 3 shows the statistical parameters for the best 4–15 variable models obtained in the present study. As can be observed, the obtained models generally possess good statistical behavior with correlation coefficients for the internal cross-validation \(Q_{\text{loo}}^{2}\), \(Q_{\text{boot}}^{2}\) superior to 87%, while the external validation coefficients \(Q_{\text{ext}}^{2}\) are greater than 84% for all model sizes. It can thus be inferred that generally, the obtained models possess high predictive power and generalizability.

Fig. 3
figure 3

Statistical parameters for MLR models for different sizes

To determine the optimum model size, the statistical parameters R 2, \(Q_{\text{loo}}^{2}\), \(Q_{\text{boot}}^{2}\), and \(Q_{\text{ext}}^{2}\) for the different model sizes were compared (see Fig. 3). Although the 14-variable model yields superior statistical parameters, the 11-variable model is empirically chosen as the optimal model size considering the parsimony principal (Occam razor). The correlation matrix of the 11 MDs contained in the selected model as well as the corresponding Pareto diagram are provided as Supplementary Information, SI3 and SI4, respectively.

Subsequently, the selected model’s AD was examined for possible outlying compounds. For the best model, 29 statistical outliers were identified, and when these compounds were excluded from the training set, the model’s descriptive and predictive ability significantly improved, justifying their ultimate exclusion. The structures of these outlying compounds are provided as Supplementary Information SI5. Therefore, the final data set constituted 762 compounds with the training and test sets, comprised of 609 and 153 compounds, respectively. In parallel, the DRAGON MDs were stratified into three groups, i.e., 0–1D, 2D, 3D, and models built for each case and their performances compared with the model built from the entire set of MDs. Tables 1, 2 show the equations for the best MLR models and the corresponding statistical parameters, respectively, for each of the MD sets (all DRAGON MDs, 3D, 2D, and 0-1D). As can be observed, the results obtained are satisfactory; all the correlation coefficients, \(Q_{\text{loo}}^{2}\) and \(Q_{\text{boot}}^{2}\), are greater than 94%, while the \(Q_{\text{ext}}^{2}\) values are superior to 98%. In addition, the a (Q 2) and a (R 2) parameters and thus the models are not prone to chance correlation.

Table 1 QSRR models for the prediction of the RIs for essential oils components using MLR and families of DRAGON MDs based on the apolar stationary phase DB-5
Table 2 Principal statistical parameters for the four optimal MLR models

Other parameters considered in the analysis of the quality of the obtained models include: the Fisher score (F), root-mean-squared errors calculated on the training and test sets, denoted by SDEC and SDEP, respectively. As can be observed from Table 2, the models’ parameters are satisfactory. It can, therefore, be deduced that the built models are robust and possess high predictive power.

Figure 4 shows the relationship between the experimental and predicted results of the training and prediction sets for the 4 models (see Supplementary Information SI6 for experimental and predicted values for the models). As can be observed, there exists a close linear association between the experimental and predicted GC–RI endpoints for both the training and test sets, respectively.

Fig. 4
figure 4

Experimentally predicted RIs for the MLR models. Training set in green and prediction set in blue

Model applicability domain (AD)

Figure 5 shows the Williams plots for the obtained models. As can be observed, almost all chemicals lie within the AD, demonstrating the reliability of the models. Some chemicals slightly exceed the critical HAT value (vertical line), but these belong to the training set. Moreover, the removal of these compounds did not significantly alter the corresponding statistical parameters and thus their removal is not justified. On the other hand, a few chemicals are wrongly predicted (>3 s) for each model, but in all the cases belong the models’ AD as their HAT values are lower than the cutoffs. This erroneous prediction could probably be attributed to wrong experimental data rather than significant structural differences with respect to compounds within the AD. We presume that the measured GC–RI values are not appropriate and need additional verification. The identities of these compounds as well as their corresponding sources (references) are provided as Supporting Information SI1–2, SI6.

Fig. 5
figure 5

Williams plot for MLR models on the training set

The validation set was designed to include examples spanning all structural moieties in the data set. Therefore, the satisfactory prediction of the validation structures suggests that the obtained models could be successfully applied in the prediction of GC–RIs of a diverse set of compounds provided that they lie within the models’ AD (Albaugh et al. 2009). For test set William’s plots for the MLR models, see Supplementary Information, SI7. No significant differences have been found between the statistical parameters of four models neither in the training nor in the validation set, although the first model provides the best description for the GC–RIs on the independent test set. Nonetheless given that each model has a distinct AD, it is desirable that the 4 models be jointly used for predicting the GC–RI to enhance the reliability of the modeling procedure. This is known as consensus modeling or ensemble averaging. In fact, it is observed that ensemble averaging of the obtained models provides greater approximation to the experimental GC–RI values. For example, the predicted values for trans-α-Bergamotene are 1478.28, 1412.29, 1456.17, and 1462.47 i.u, according to models 1, 2, 3, and 4, respectively, yielding an average value of 1452.30 ± 28.25, while experimental the GC–RI is 1436.33 ± 4.50 i.u., which is a good correspondence of averages with overlapped standard deviations. The utility of consensus modeling in enhancing the performance of QSRR models, particularly in the identification of compounds without commercially available references, has been validated in other reports in the literature (Dossin et al. 2016).

Artificial neural network-based regression models

For the ANN models, the same variables (MDs) and set of training and test compounds used to build the final MLR models (i.e., 609 and 153 compounds, respectively) were employed. The input layer was comprised of 11 neurons corresponding to the models’ independent variables/MDs, while the output layer contained a single neuron corresponding to the dependent variable (RI values). Table 3 shows the optimum parameters for the four ANN models obtained for 30,000 cycles (Eqs. 1a–4a).

Table 3 Optimization parameters for the four ANN models

With these parameters, four ANN models were trained and posteriorly validated over the test set. Table 4 shows the correlation parameters obtained with the ANN models. In addition, a comparison of the results obtained ANN regressions with the MLR models is performed to evaluate the contribution of non-linear relationships in modeling the GC–RI (see Table 4). As can be observed, all models yield minor improvements in correlations with the RI values for both the training and test sets using the ANN compared to the MLR technique. Although these improvements may not be considered as statistically significant, the incorporation of the ANN in consensus modeling should contribute to more robust and reliable predictions.

Table 4 Statistical parameters obtained with the MLR (Eqs. 1–4) and ANN (Eqs. 1a–4a) models

On the other hand, note that the SDEP values are higher for the ANN than MLR models due to compound 433, which has an extremely high relative error (see Supplementary Information, SI8). It can thus be inferred that this compound is not adequately predicted by non-linear models but rather linear ones. Nonetheless, this compound is a member of the prediction set, and, therefore, the AD and validity of the ANN models are not affected. Residual analysis of the ANN models to check for possible systematic errors was performed and it was observed that the residual points are randomly propagated over both sides of the zero residue axes and, therefore, the regressions were correctly computed (see Supplementary Information SI9 for residual plots).

Moreover, the sensitivity of the variables (MDs) in the ANN model was assessed to determine their relative importance. This parameter is measured as the difference between SDE values when all MDs are considered as inputs [SDE (n)] and when the ith MD is excluded [SDE (n−1)], with both values computed over the same data set. Greater differences are associated with higher relevance for the excluded MD. Table 5 shows the sensitivity of the MDs in each of the models.

Table 5 Sensitivity analysis of the MDs for each ANN model

As can be observed, the most relevant MDs are the molecular weight (M W), mass weighted total autocorrelation MD on leverage matrix/H total index (H Tm), Pogliani index (Dz), and Sum of Conventional Bond Orders (SCBO) for models 1, 2, 3 and 4, respectively. The M W and H Tm indices are related with the structural bulk of chemicals, which, in turn, possesses a close relationship with the dispersion forces in chromatographic retention. On the other hand, the Dz distinguishes heteroatoms in a compound, while the SCBO characterizes bond types. To understand the relevance of these MDs, an inferential evaluation of the information codified is performed. First, it is known that heteroatomic compounds have (permanent) dipoles, while compounds with unsaturated bond systems (e.g., aromatic systems) are polarizable. Therefore, when the latter interacts with the former, the electric field from the permanent dipole induces a reverse dipole in the polarizable system yielding dipole-induced dipole interactions. It can, therefore, be concluded that the Dz and SCBO are related with chromatographic induction forces.

Comparison with other approaches reported in the literature

Finally, a quantitative comparison of the performance of the models obtained in the present study with those reported in the literature is performed with the aim of assessing the practical contribution of the obtained models in the prediction of the GC–RI of essential oils (see Table 6). As can be observed, the studies reported in the literature are based on much smaller sized data sets relative to the data set in the present study and in most cases congeneric in nature. Even then, similar results are obtained.

Table 6 Comparison of obtained results with those reported in the literature

It can, therefore, be inferred that for the first time, QSRR models for predicting the RI of essential oil components with a wide AD, good statistical quality, robustness, and high predictive power are obtained. In addition, these models provide knowledge on the factors that influence the chromatographic retention of essential oils components over the DB-5 stationary phase.

Conclusions

Retention indices have gained an increasingly relevant role in analytical chemistry given their utility in reducing false-positive (or negative) compound identification rates. Indeed, MS database repositories, e.g., NIST, Wiley, and FFNC currently include associated RIs to ensure more accurate identification of confounding molecular structures. In this report, QSRR models were built for Kováts retention indices based on a large and structurally diverse database of 791 essential oils components for the non-polar GC DB-5 column. These models were vigorously validated using both internal and external validation techniques on the training and test sets, respectively, and the corresponding statistical parameters were satisfactory, showing predictive ability of these models. The descriptors included in the prediction models provide information on the different molecular properties and/or interaction forces that influence the chromatographic retention/elution of essential oils components on the DB-5 stationary phase. All together, the obtained models provide valuable tools for the prediction of RIs for new essential oils components within the models’ ADs and whose experimental data are undetermined.