Introduction

Proteins belong to the most significant biologically active substances. Acting as hormones, neurotransmitters, immunomodulators, coenzymes, enzyme substrates and inhibitors, receptor ligands, drugs, toxins, and antibiotics they play an important role in controlling and regulating many critically essential procedures in living organisms. Additionally, to understand living cell functioning, an inclusive exploration of the whole protein set of a cell will be necessary [1, 2]. As a result, degradation of proteins to peptides, separation and analysis of peptides are becoming progressively more important in proteomics.

High-performance liquid chromatography (HPLC) is now powerfully established as the foremost technique for the analysis and separation of an extensive range of molecules. Especially, HPLC in its different modes has become the fundamental technique in the description of peptides and proteins and peptide separation and has, consequently, played a key role in the fast advances in the biological and biomedical sciences over the last years [3].

Reversed-phase liquid chromatography (RPLC) is possibly the most regularly used mode of separation for peptides, although ion-exchange (IEC) and size-exclusion chromatography (SEC) also find applications. The three-dimensional structure of proteins can be perceptive to the often cruel conditions employed in RPLC, and consequently, RPLC is utilized less for the isolation of proteins where it is important to recover the protein in a biologically active form [4]. RPLC is a very influential method for the analysis of peptides and proteins for a number of reasons that include: (1) the excellent resolution that can be attained under a wide range of chromatographic conditions for very intimately related molecules as well as structurally quite distinct molecules; (2) the experimental simplicity with which chromatographic selectivity can be influenced through changes in mobile phase characteristics; (3) the generally high recoveries and, hence, high productivity; and (4) the excellent reproducibility of repetitive separations carried out over a long period of time, which is caused partially by the stability of the sorbent materials under a wide range of mobile-phase conditions [5].

RPLC involves the separation of peptides on the basis of hydrophobicity. The separation depends on hydrophobic binding of the solute molecule from the mobile phase to the immobilized hydrophobic ligands attached to the stationary phase. The mobile phase composition and the pressure are two essential factors which influenced the separation of peptides. The retention pattern of a peptide changes as the mobile phase composition and the column pressure change. The retention of peptides (log K) do not vary linearly with the mobile phase but do follow a quadratic relationship [6].

Regardless of the ever increasing usage of HPLC for the separation and analysis of peptides and proteins, selection of the chromatographic conditions is still found by time-consuming trial-and-error methods. A priori knowledge of the retention time of a given peptide on a given chromatographic system would help in the selection of proper chromatographic conditions. Currently, prediction of the retention behavior of peptides is mainly rooted in the amino acid composition [710]. However, using this technique, some experiments for the standard samples must be achieved to derive the group retention coefficients of the amino acid in the given conditions, which is still time-consuming and is difficult to generalize the calculated results.

Quantitative Structure–Property Relationship (QSPR) studies, which relate descriptors of the molecular structure to properties of chemical compounds, have proved to be successful in predicting retention times of peptides [11]. The advantage of this technique over other predictive methods lies in the fact that the descriptors used can be computed exclusively from structural considerations and do not rely on experimental properties as input parameters. Once the structure of given compound is known, one can compute a larger number of different molecular and geometric descriptors. Therefore, once a reliable model is derived, one can use the model to estimate the property of a compound, whether or not the compound already has been synthesized [12]. In the actual study, closeness between predicted and experimental retention times will help in the future identification of peptides.

Although QSPR methods have been effectively used to forecast many physicochemical properties, only a small number of research groups have investigated the quantitative correlation between the structural parameters and the chromatographic retention of peptides; This might be due to the problematic optimization of the peptides structures which is very time-consuming because in most of the cases, the size of the peptides is rather large. Liu et al. [13] developed a QSPR model for the prediction of the capacity factors of 75 peptides based on Support Vector Machine and the Heuristic Method. Petritis and co-workers [14] used the Genetic Algorithm and Artificial Neural Network (ANN) techniques for the prediction of peptide liquid chromatography elution times in proteome analyses. Ma et al. [15] predicted electrophoretic mobilities of peptides in capillary zone electrophoresis (CZE) using the Linear Heuristic Method (HM) and a Nonlinear Radial Basis Function Neural Network (RBFNN). Shinoda et al. [16] developed a computational method to predict the retention times of peptides in HPLC using Multiple Linear Regression (MLR) and ANN. Du et al. [17] generated Quantitative Structure–Retention Relationship (QSRR) models to correlate retention times of peptides in reversed-phase liquid chromatography to their structures based on linear and non-linear modeling methods. They used MLR for a linear QSRR model and Radial Basis Function Neural Networks (RBFNN) and Projection Pursuit Regression (PPR) for the nonlinear modeling. Put et al. [18] estimated the retention times of a set of peptides based on PLS regression and Uninformative Variable Elimination PLS (UVE-PLS) models.

Vapnik and Cortes have worked on a new computational classification method called Support Vector Machine (SVM) [19, 20]. SVM has been extended to solve regression problems, and has shown great performance in QSPR studies due to its remarkable ability to interpret the nonlinear relationships between molecular-structure descriptions and properties [2126].

In this work, SVM was performed for modeling and predicting the retention times of various peptides using different kinds of molecular descriptors. The main goal was to generate a QSPR model that could be employed for the prediction of t R of a diverse set of peptides from their molecular structures and to show the flexible modeling ability of SVM and at the same time, to seek the important structural features related to the retention times of peptides. PLS and ANN methods were also employed to generate quantitative linear and nonlinear models to compare with those obtained by SVM.

SVM feature mapping technique was used for the prediction of retention time values of a large set of peptides with different molecular structures. This is a simple, sensitive and inexpensive method that can accurately predict the chemical property such as retention time. The proposed model could identify and provide some insight into what calculated descriptors related to retention time. SVMs-based modeling methods could produce more accurate QSPR models compared to linear regression methods, since they have the ability to handle the possible nonlinear relationships during the training process.

Materials and methods

Data Set

The data set of retention times of 93 peptides with known amino acid composition was extracted from the values reported by Put and Vander Heyden [11]. The retention times of the peptides were measured on seven RP chromatographic systems (CS1–CS7) [27]. The following columns were selected: CS1, XTerra MS C18 (Waters, Millford, MA, USA; 15.0 × 0.46 cm id); CS2, LiChrospher RP-18 (Merck, Darmstadt, Germany; 25.0 × 0.46 cm id); CS3, LiChrospher CN (Merck; 10.0 × 0.46 cm id); CS4, Discovery HS F5-3 (Supelco, Bellefonte, PA, USA; 15.0 × 0.46 cm id) with a silica-based pentafluorophenylpropylsilane stationary phase; CS5, Discovery RP-Amide C16 (Supelco; 15.0 × 0.46 cm2 id); CS6, Chromolith (Merck; 10.0 × 0.46 cm id), a monolithic silica column; CS7, PLRP-S (Polymer Laboratories, Amherst, MA, USA; 15.0 × 0.41 cm id) with a crosslinked polystyrene/divinylbenzene stationary phase. The retention times of all molecules included in the data set were obtained under the same conditions. For all systems, the operating temperature was constant at 40 °C; the flow rate 1 mL min−1 and the detection wavelength 223 nm. The data set was randomly divided into three separate sections, the training, test and validation sets, consisting of 55, 19, and 19 members, respectively. The training set was used to adjust the parameters of models, the test set to prevent the network on model from over-fitting and the external validation set to evaluate the prediction abilities of the constructed models.

Descriptor Calculation and Reduction

The primary step to acquire a QSPR model is to encode the structural features of molecules, which are named molecular descriptors. The QSPR model performance and the accuracy of the results are strongly dependent on the way the structural representation is carried out. In the first step, all structures were drawn with the HyperChem (Ver. 7.0) program [28] and then pre-optimized using MM + molecular mechanics force field. A more precise optimization is then done with the semiempirical PM6 method in Mopac (2009) [29]. All calculations were carried out at the restricted Hartree–Fock level with no configuration interaction. The molecular structures were optimized using the Polak–Ribiere algorithm until the root–mean–square gradient was 0.001. In a next step, the Hyperchem and Mopac output files were used by the Dragon package (Version 3) to calculate molecular descriptors [30]. Overall more than 1,400 theoretical descriptors were calculated for each molecule by this software. These descriptors can be categorized into several groups, 0D, constitutional descriptors; 1D, functional groups, atom-centered fragments, empirical descriptors and molecular properties; 2D, topological descriptors, molecular walk counts, BCUTs descriptors, Galvez topological charge indices, 2D autocorrelations; 3D, aromaticity indices, Randic molecular profiles from the geometry matrix, geometrical, RDF, 3D-MORSE, WHIMs, and GETAWAYs descriptors.

The calculated descriptors were first analyzed for the existence of constant or near constant variables. The detected ones were then removed. Besides, to decrease the redundancy existing in the descriptors data matrix, the descriptors’ correlation with each other and with the property of the molecules was examined and the collinear descriptors (i.e. R > 0.9) were detected. Among the collinear descriptors, the one presenting the highest correlation with the property was retained and the others removed from the data matrix. A total of 158 out of 521 descriptors showed high correlation and were removed from the next generation. Subsequently genetic algorithm-partial least squares (GA-PLS) variable subset selection method was used for selection of important descriptors.

GA-PLS Variable Selection

One of the problems in selecting the set of molecular descriptors is their collinearity even though the most collinear are already removed. Second, models based on a reduced number of descriptions are simpler and better. To overcome these problems some approaches join the feature-selection technique Genetic Algorithms with Partial Least Squares [3133]. GA-PLS consists of three basic steps. (1) Construction of a preliminary population of chromosomes in which each chromosome is a binary bit string by which the existence of a variable is symbolized; (2) Assessment of fitness of each chromosome in the population by the internal predictivity of PLS. (3) Reproduction of the population of chromosomes in the next generation. The operations of selection, cross-over and mutation of chromosomes, are made in this step. Then, steps 2 and 3 are repeated until the number of the repetitions has reached the designated number of generations.

In this paper, we use Leardi’s GA-PLS method [34]. The values of empirical parameters affecting the performance of GA-PLS were defined as in Table 1. To obtain more reliable results, the GA process was repeated at least ten times. If some variables are present only in one model, it can be concluded that they have been selected just by chance, and consequently, they can be disregarded in the final model.

Table 1 Parameters of the GA-PLS approach

Partial Least Squares Regression

The general principle of a linear regression method is to quantify the relationship between several independent or predictive variables and a dependent variable. Independent or predictive variables could be diverse physicochemical descriptors of the molecules, their principal components or other latent variables. The Partial Least Squares method is used to establish relationships between the dependent variable of the y vector and the descriptors of the X matrix [35]. PLS can analyze data with collinear, noisy, and numerous variables in both X and y [36]. PLS decreases the dimension of the predictor variables by extracting factors or latent variables that are correlated with y while capturing a large amount of the variation in X. This means that PLS maximizes the covariance between X and y. In PLS, the scaled matrices X and y are decomposed into score vectors (t and u), loading vectors (p and q), and residual error matrices (E and F):

$$ X = tp^{T} + E $$
(1)

and

$$ Y = uq^{T} + F $$
(2)

The PLS algorithm used in this study was the singular value decomposition (SVD)-based PLS. This algorithm was proposed by Lorber et al. in 1987 [37]. A discussion of the SVD-based PLS algorithm can be found in the literature [3840]. The program of PLS modeling based on SVD was in-house written in MATLAB 7 [41].

Artificial Neural Networks

ANN can be defined as structures comprised of tightly interconnected adaptive simple processing elements or units that are able of performing especially parallel calculations for data processing and knowledge representation. A detailed description of the theory behind a neural network has been adequately described elsewhere [4244]. Therefore, only the items/points relevant to this work are described here. An ANN consists of some connected neurons and process information. A network consists of one input layer, one output layer and may also contain some hidden layers. Each layer contains some neurons connected to other neurons in previous and/or next layers. A neuron has an input, an output and a transfer function. The Sigmoidal transfer function, f(x), is one of the performed functions, expressed as the following equation:

$$ f(x) = \frac{1}{{(1 + e^{ - x} )}} $$
(3)

The output of node j, O j , is given by Eq. (4):

$$ O_{j} = f\left[ {\sum {w_{ij} O_{i} } + b_{j} } \right] $$
(4)

where O i is the output of ith neuron from the previous layer, w ij presents the weights applied to the connection of neurons ith and j, and b j is a bias term.

A feed-forward neural network consists of three layers. The first layer (input layer) includes nodes and acts as an input buffer for the data. Signals introduced to the network, with one node per element in the sample data vector, pass through the input layer to the layer called the hidden layer. Each node in this layer sums the weighted inputs and forwards them through a transfer function to the output layer. In the output layer, the processes of summing and transferring are repeated. The output of this layer now represents the calculated value for the node k of the network.

An ANN is an adaptive network that changes its structure based on external or internal information that flows through the network during the learning (training) phase. Training is performed by repeatedly presenting the network with identified inputs and outputs, and adjusting the connection weights and biases between the individual nodes. This process is repeated until the output nodes of the network match the preferred outputs to a stated degree of accuracy. Training can, for instance, be done using the back-propagation algorithm. To train the network using the back-propagation algorithm, the differences between the ANN output and its desired value are calculated after each iteration.

In the present work, an in-house ANN program was written in MATLAB 7. This network was feed-forward fully connected and has three layers with sigmoidal transfer functions. Descriptors selected by the GA-PLS method were used as inputs to the network and the output signal represents the retention times of the peptides. Thus, this network has four nodes in the input layer and one node in the output layer. The output of the sigmoid function is in the range between 0 and 1 (dynamic range). Therefore, the value of each input (description) value was divided by the mean description value to bring them into the dynamic range of the sigmoidal transfer function of the network. The initial values of the weights were randomly selected from a uniform distribution that ranged between −0.3 and +0.3 and the initial values of the biases were set to be 1. During training, the network parameters are optimized. These parameters are: number of nodes in the hidden layer, weights and biases learning rates and the momentum. To evaluate the performance of ANN, standard error of training (SET) and standard error of prediction (SEP) were used. Then the network was trained using the training set by back-propagation strategy for optimization of the weights and biases values. It should be noted that it is common to plot the SET versus the number of iterations for optimization of ANN parameters.

Support Vector Machine

The Support Vector Machine is an algorithm developed by the machine learning community. Owing to its unexpected generalization performance, the SVM has attracted attention and obtained a broad application range, such as pattern recognition problems [45, 46], drug design [47], Quantitative Structure–Activity Relationship [48], and QSPR analysis [49].

Support Vector Machines were developed by Vapnik, and the method is becoming more broadly known because of its many attractive features and promising empirical performance [18, 19, 50, 51]. The methodology discloses the Structural Risk Minimization (StRM) principle [52, 53] which has been exposed to be better than the conventional Empirical Risk Minimization (ERM) principle, employed by conservative neural networks.

A training set of m compounds with known properties or activities y i and structurally consequent descriptors x i are represented as \( \left\{ {\left( {x_{i} ,y_{i} } \right)} \right\}_{i = 1}^{m} \) where correlations between structure and properties or activities are defined by y i  = f(x i ). The term f(x i ) can be characterized by a linear function of the form:

$$ f(x_{i} ) = \left\langle {w_{i} , x_{i} } \right\rangle + b $$
(5)

where w identifies the weight vector of the linear function and b communicates to the threshold coefficient. SVM approximates the set of data with a linear function that is formulated in the high-dimensional feature space with the following function:

$$ y = \mathop \sum \limits_{i = 1}^{m} w_{i} \phi \left( {x_{i} } \right) + b $$
(6)

where \( \left\{ {\phi_{i} (x_{i} )} \right\}_{{{\text{i}} = 1}}^{\text{m}} \) represents the features of input variables subjected to kernel transformation, while \( \left\{ {w_{i} } \right\}_{i = 1}^{m} \) and b are coefficients.

SVM is essentially a linear learning approach that was initially proposed for classification problems. However, it is also suitable to regression problems through the use of the ε-insensitive loss function. SVM can manage data possessing non-linear relationships by means of the so-called kernel trick. Kernel transformation is fundamentally a projection of the descriptor matrix from the input space into the higher-dimensional feature space. This can be achieved by the following equation:

$$ k(x,y) = \left\langle {\phi (x)\; \times \;\phi (y)} \right\rangle $$
(7)

where k is a kernel function and ϕ is a mapping from input space X є x to the feature space F. A number of kernel functions are accessible for non-linear transformation of the input space. Popular kernel functions used in SVM include the variance–covariance-based linear and polynomial kernels, and the Euclidean distance-based radial basis function kernels.

A radial basis function kernel as illustrated by the following equation was employed to perform the non-linear mapping:

$$ K(x,y) = \exp ( - \gamma \left\| {x - y} \right\|^{2} ) $$
(8)

After kernel transformation, the new feature space permits the data to be linearly distinguishable by hyperplanes where the hyperplane that maximizes the distance between the data samples was selected by the algorithm as the maximal hyperplane.

Minimization of the regularized risk function (Eq. 9) realizes two important properties of SVMs by means of estimating coefficients w and b: (i) identify regression assessment by performing risk minimization regarding the ε-insensitive loss function, and (ii) perform risk minimization derived from the StRM principle in which elements of the structure are defined by the disparity\( {\parallel\!\!{w}\!\!\parallel^{2}} \le \) constant.

The regularized risk function is defined as:

$$ R\left( C \right) = C\frac{1}{N}\mathop \sum \limits_{i = 1}^{N} L_{\varepsilon } (y,f(x_{i} ,w)) + \frac{1}{2}\parallel w\parallel^{2} $$
(9)

where \( C\frac{1}{N}\sum\nolimits_{i = 1}^{N} {L_{\varepsilon } } (y,f(x_{i} ,w)) \) is the empirical error (risk) and \( \frac{1}{2}\left\| w \right\|^{2} \) is a measure of function flatness. The empirical error is measured by the ε-insensitive loss function (y, f(x i ,w)) in which errors below ε would not be penalized. The punishment parameter C is a regularized constant responsible for determining the trade-off between the empirical error and the model complexity.

The estimation performance of SVM regression models is determined by the ε-insensitive loss function as follows:

$$ L_{\varepsilon } \left( {y,f(x,w)} \right) = \left\{ {\begin{array}{ll} {|y - f\left( {x,w} \right)| - \varepsilon \quad {\text{ for}}}\,\, {|y - f(x,w)| \ge \varepsilon } \\ 0 \quad {\text{otherwise}} \\ \end{array} } \right\} $$
(10)

The parameter ε is referred to as the tube size, and it is defined as the approximation accuracy placed on the training data points. Basically, the purpose of support vector regression is to decide a function f(x) such that there is at most ε deviation from the actual value y i for all training data while being as flat as possible. In other words, the loss function ignores errors as long as it is less than ε but would not accept considerable deviations from it.

Estimation of the Predictive Ability of a QSPR Model

For the estimation of the fitting and predictive abilities of a QSPR model, often the Fischer’s (F) test, the correlation coefficient of the experimental versus fitted/predicted properties (R), the root mean squared error of calibration (RMSEC), the root mean squared error of prediction (RMSEP) and the root-mean squared error of cross-validation (RMSECV) are used. The latter are calculated using the following equations.

$$ {\text{RMSEC}} = \sqrt {\sum\limits_{i = 1}^{c} {\frac{{(y_{{{\text{pred}}_{\text{i}} }} - y_{{{\text{obs}}_{\text{i}} }} )^{2} }}{c}} } $$
(11)

where \( y_{{{\text{obs}}_{\text{i}} }} \) is the observed property (retention time) of a calibration (training) set object, \( y_{{{\text{pred}}_{\text{i}} }} \) is the predicted property of a calibration set object, and c is the number of samples in the calibration set.

$$ {\text{RMSEP}} = \sqrt {\sum\limits_{i = 1}^{t} {\frac{{(y_{{{\text{pred}}_{\text{i}} }} - y_{{{\text{obs}}_{\text{i}} }} )^{2} }}{t}} } $$
(12)

where \( y_{{{\text{obs}}_{\text{i}} }} \) is the observed property of a test set object, \( y_{{{\text{pred}}_{\text{i}} }} \) is the predicted property of a test set object, and t is the number of samples in the test set.

The root mean squared error of cross validation (RMSECV) is defined as in Eq. (13),

$$ {\text{RMSECV}} = \sqrt {\sum\limits_{i = 1}^{v} {\frac{{\left( {y_{{{\text{valid}}_{\text{i}} }} - y_{{{\text{obs}}_{\text{i}} }} } \right)^{2} }}{v}} } $$
(13)

where \( y_{{{\text{obs}}_{\text{i}} }} \) is the observed property of a validation set object, \( y_{{{\text{valid}}_{\text{i}} }} \) is the predicted property of a validation set object and v is the number of samples in the validation set.

The predictive ability of the calibration models from samples that were not used to generate the calibration equation was recorded as RMSECV. The RMSECV is regarded as the indication of the accuracy of calibration models when there are sufficient validation samples.

Results and Discussion

Descriptor Selection with GA-PLS

The GA-PLS procedure was performed on the data set to choose the most favorable set of descriptors. Because the GA is principally a stochastic algorithm, the results of various GA applications would accordingly be a little different. With the purpose of obtaining more consistent results (more reliable models); the GA process is repeated several times. In the present study, for each data set of the GA process was repeated 100 times and the selection of the variables were based on their frequency of incidence in the models, with maximal Cross-validated explained variance (C.V.  %) attained for each operation. In this procedure, the chromosome and its fitness in the species correspond to a set of variables and internal prediction of the derived PLS model, respectively. Selection of useful variables is based on their frequency of occurrence in the best models obtained for each program. The frequency was calculated by the following equation:

$$ {\text{Frequency}}\; (i )= \frac{{{\text{the total number of descriptor (}}i ) {\text{ selected by GA-PLS}}}}{\text{the times of operation using GA-PLS}} $$
(14)

where i is the ith descriptor. The fitness of the individuals indicates the prediction power of the selected descriptors. The final model is picked via a stepwise regression, and the variables are selected in terms of their frequency. The descriptors with a high frequency were considered as more essential in describing the molecular structural properties which have the most imperative contribution to the overall retention times. Descriptors with a frequency above 90 % in 10 operations were selected. Parameters of the Genetic Algorithm for the generation of GA-PLS are shown in Table 1. With this approach, a set of four descriptors (see Table 2) was chosen for each data set and used to create the PLS, ANN and SVM models. For all of the seven datasets, the same descriptors were selected by GA-PLS. Their differences were only in their coefficients of regression. The specification of each model was described in Table 2.

Table 2 The partial least squares regression coefficients

PLS Modeling

Table S1 gives the retention times on all seven CSs for all molecules. The PLS predicted values of the retention times for all peptides are shown in Table S2. Table 2 shows the regression coefficients of the four descriptors for the best PLS models. The optimum number of latent variables to be included in the model was three. The four descriptors in the model are: structural information content (neighborhood symmetry of 3-order) (SIC3), Geary autocorrelation lag 2/weighted by atomic polarizabilities (GATS2P), lowest Eigenvalue n. 1 of Burden matrix/weighted by atomic masses (BELM1) and number of total primary carbons (NCP). Table 3 represents the correlation matrix for these descriptors. As shown in this table there is not any significant correlation between the selected descriptors.

Table 3 Correlation matrix for the four descriptors

For evaluation of the relative significance and contribution of each descriptor in the models, the mean effect (ME) value was calculated for each descriptor by the following equation:

$$ {\text{ME}}_{j} = \frac{{\beta_{j} \sum\nolimits_{i = 1}^{n} {d_{ij} } }}{{\sum\nolimits_{j}^{m} {\beta_{j} } \sum\nolimits_{i = 1}^{n} {d_{ij} } }} $$
(15)

where ME j is the mean effect for the considered descriptor j, β j is the coefficient of descriptor j, d ij is the value of descriptors for each molecule, and m is the number of descriptors in the model. The calculated ME values are plotted in Fig. 1.

Fig. 1
figure 1

Plot of descriptor’s mean effects

The Influence of Each Descriptor on Retention Time

It is well-known that the chromatographic retention time can be considered as a chemical structure-dependent parameter, which is constant for any peptide in a defined separation conditions including mobile phase composition, stationary phase, pH, temperature. At the constant separation condition, amino acid composition, peptide chain length and sequence, (generally structure of peptide) play essential role on retention time. Therefore, we focus on the descriptors which encode the structural features of peptides.

The first descriptor, according to its mean effect, is the lowest eigenvalue n. 1 of Burden matrix/weighted by atomic masses (BELM1). This BCUT descriptor is an expansion of parameters initially developed by Burden [54]. The Burden parameters are derived from a combination of the atomic number for each atom and a description of the nominal bond-type for adjacent and nonadjacent atoms. They may include connectivity information and atomic properties that are relevant to intermolecular interactions. The BCUT descriptors expand a number and types of atomic features that can be considered, and also supply a diversity of proximity measures and weighting schemes. These descriptors can be generated, depending on the choices of connectivity and atomic information, and on the scaling factors controlling the relative balance of these two kinds of information. It can capture sufficient structural features of molecules to yield useful measurement of molecular diversity. These descriptors designed to encode atomic properties that govern intermolecular interactions. The positive coefficient for BELM1 in PLS model indicates that an increase in the value of this descriptor leads to increase in the value of retention time. The increase of this descriptor reduce the solute–solvent interactions and therefore increasing the dispersive interactions with the stationary phase and consequently increasing the value of t R.

The second descriptor is the structural information content (neighborhood symmetry of 3-order) (SIC3). This topological descriptor represents a measure of the graph complexity and is calculated as follows [55]:

$$ {\text{SIC}}_{r} = \frac{{{\text{IC}}_{r} }}{{\log_{2} A}} $$
(16)

where A is the number of atoms and IC r is the information content index (neighborhood symmetry of 1-order), which is calculated as follows:

$$ {\text{IC}}_{r} = - \sum\nolimits_{g = 1}^{G} {\frac{{A_{g} }}{A}} \; \times \, \log_{2} \frac{{A_{g} }}{A} = - \sum\nolimits_{g = 1}^{G} {P_{g} \times \;\log_{2} P_{g} } $$
(17)

where g runs over the G. G is number of equivalence classes (i.e. the number of different amino acid residues), A g , is the cardinality of the gth equivalence class, A is the total number of atoms, and p g is the probability of randomly selecting a vertex of the gth class. This descriptor gives us information on how many atoms with a similar connectivity pattern we have in the molecule. The descriptor is dependent on the number of atoms involved in the molecule, and it arranges the molecules in the order of rising chain length and number of the substituents of peptides. This descriptor describes the difference of the hydrophobicity and steric property of the solute comprehensively. As the hydrophobic and steric interaction is the main interaction between the solute and the stationary phase, this descriptor plays an important role in the elution process and has positive correlation with the t R .

The next descriptor is the Geary autocorrelation lag 2/weighted by atomic polarizabilities (GATS2P). This 2D Autocorrelation descriptor in general explains how the considered property is distributed along the topological structure and is defined as:

$$ {\text{GATS2P}}(d) = \frac{{\frac{1}{2\varDelta } \times \sum\nolimits_{i = 1}^{A} {\sum\nolimits_{j = 1}^{A} {\delta_{ij} \times (w_{i} - w_{j} )^{2} } } }}{{\frac{1}{(A - 1)} \times \sum\nolimits_{i = 1}^{A} {(w_{i} - \bar{w})^{2} } }} $$
(18)

where w i is an atomic property, and \( \bar{w} \) its average value on the molecule, A is the number of atoms, d the considered topological distance (i.e. the lag in the autocorrelation terms), δ ij a Kronecker delta ij  = 1 if d ij  = d, zero otherwise) and Δ is the sum of the Kronecker deltas, i.e. the number of vertex pairs at distance equal to d [56]. Autocorrelation descriptor calculated for molecular geometry are based on interatomic distances collected in the geometry matrix and the property function is defined by the set of atomic properties. The 2D Autocorrelation descriptors in general explain how the considered property is distributed along the topological structure. Increase of this descriptor will enhance the polarizability and the interaction of unsaturated molecules with the mobile phase and therefore, favors the elution process. Furthermore, GATS2P encodes the hydrophobicity of the compound, thus, an increase in this descriptor strengthens the hydrophobicity of the molecule, enhances the interaction between the solute and stationary phase, and then disfavors the elution process. Both these interactions can lead to a decrease in the value of t R on the whole.

The last descriptor is the total number of primary carbons (NCP). This constitutional descriptor depends on the atomic constitution of the chemical structure (molecule). This descriptor is insensitive to any conformational change and does not distinguish among isomers. This constitutional descriptor encodes the size, shape, and degree of branching in the compound, also relates to the dispersion interaction among molecules. The larger the molecular size is, the stronger the dispersion interaction becomes. Thus, in some sense, it has some correlation with the hydrodynamic friction. So the larger the value of the NCP is, the longer the retention times of the molecule.

From the above discussion, it can be concluded that all descriptors involved in the QSPR model have some physical meaning, and that they account for structural features influencing the retention (times) of the molecules. We can conclude that the retention mechanism of RPLC mainly correlates with the factors as mentioned above, dispersive interactions, steric interaction between the solute and stationary phase and hydrodynamic friction among the peptides and the stationary phase and the mobile phase.

ANN Modeling

The next step in our study was the generation, optimization and training of the ANN. Table 4 shows the architecture and specifications of the optimized ANN’s parameters. After the optimization of the ANN’s parameters, the network was trained using the training set for the adjustment of weights and bias values. It is known that neural networks can become over-trained. An over-trained network has generally learned absolutely the stimulus pattern it has seen but cannot give accurate prediction for unseen stimuli, and it is no longer capable to generalize, i.e. the network also has modeled the experimental error in the training set. There are numerous methods for overcoming this problem. One method is to use a test set to assess the prediction power of the network throughout its training. In this method, after each 1,000 training iterations, the network was used to calculate t R of molecules included in the test set. To preserve the predictive power of the network at an enviable level, training was stopped when the value of root mean squared error for the test set started to increase. Results obtained showed overtraining began after 42,000 iterations. Since the test set error is not a good estimation of the generalization error, the prediction potential of the model was evaluated on a third set of data, named validation set. Compounds in the validation set were not used during the training process and were reserved only to evaluate the predictive power of the generated ANN.

Table 4 Architecture and specifications of an optimized ANN model

Table S3 lists the ANN estimated values of retention times of all seven CSs for the training, test and validation sets. The statistical parameters obtained by ANN model for these sets are summarized in Table 5. Comparison between the statistical parameters in Table 5 reveals the superiority of the ANN model over PLS one; the R values systematically higher the errors smaller. The key strength of neural networks, unlike regression analysis, is their ability to flexible mapping of the selected features by manipulating their functional dependence implicitly.

Table 5 Statistical parameters obtained using the PLS, ANN and SVM models

SVM Modeling

SVM is used to generate another non-linear model based on the same subset of descriptors. The performance of SVM for regression depends on the combination of several parameters: the capacity parameter C, ε of the ε-insensitive loss function, and γ controlling the amplitude of the Gaussian function. C is a regularization parameter that controls the tradeoff between maximizing the margin and minimizing the training error. If C is too small, then inadequate strain will be placed on fitting the training data. If C is too large, then the algorithm will over-fit the training data. To make the learning process steady, a large value should be set up for C. The kernel type is another significant factor. For regression errands, the Gaussian RBF kernel is generally utilized. The Gaussian RBF function is represented as follows:

$$ \exp ( - \gamma {\mid }u - v{\mid }^{2}) $$

where γ is a constant, parameter of the kernel, u and V are two independent variables. γ controls the amplitude of the Gaussian RBF function and consequently, controls the generalization ability of SVM. The best value for ε depends on the type of noise present in the data, which is usually unidentified. Even if sufficient knowledge of the noise is reachable to choose an optimal value for ε, there is the practical consideration of the number of resulting support vectors. ε insensitivity avoids the whole training set meeting border conditions and therefore, authorizes for the option of scattering in the dual formulation’s solution. Thus, selecting the appropriate value of ε is mandatory. Consequently, these parameters should be optimized to acquire the best results. To select proper values for these parameters, diverse values were tried; the set of values with the best leave-one-out cross-validation performance was selected as the optimal. From the above process, the γ, ε and C were fixed to 5, 0.04 and 300, respectively, when the support vector number was 45. The predicted results from the optimal SVM are shown in Table 5 for all seven CSs. The SVM model has higher correlation coefficient (R) and Fisher values (F) and lower RMSE for all three sets compared to the PLS and ANN models. The statistical parameters tabulated in Table 5 reveal the high accuracy and predictive ability of the model. Figure 2 shows the plot of the SVM predicted versus experimental values for the retention times of all molecules in the data set. (Divided over training, test and validation sets)

Fig. 2
figure 2

Plot of SVM estimated versus experimental retention times for a CS1, b CS2, c CS3, d CS4, e CS5, f CS6 and g CS7

Comparison of the Results Obtained by Different QSPR Approaches

From the results of the QSPR models for modeling the retention time (Table 5), it can be seen that results obtained using SVM are comparable or superior to those by ANN and PLS. In fact, as a universal machine learning method, SVM is rooted in the structural risk minimization principle, which minimizes an upper bound of the generalization error rather than minimizing the training error. SVM thus has a better generalization performance than PLS and ANN. Moreover, compared to ANN, once corresponding parameters are specified, the solution of SVM is definite and reproducible, which is clearly better.

By performing model validation, it can be concluded that the presented model is a valid model and can be successfully employed to predict the t R of peptides with an accuracy within the confidence limits from the experiential t R determination. It can be logically accomplished that the proposed model will properly predict t R for new peptides. In addition, the presented method could also recognize and provide some insight into what structural features are related to the t R of peptides.

Comparison with Other QSPR Models

Put and Vander Heyden [11] developed a QSRR based on multivariate adaptive regression splines (MARS), two-step MARS (TMARS), PLS, uninformative variable elimination partial least squares (UV-PLS) and MLR for prediction of the retention times of the set of 98 peptides on the seven chromatographic systems. The comparison of statistics of each CS of our SVM model with other QSPR models is shown in Table 6. Comparison of the RMSEs of the present study with those from previous work shows the superiority of our SVM model.

Table 6 Comparison of the obtained SVM model with MARS, TMARS, MLR, PLS and UVE-PLS

Revelli, Mutelet and Jaubert [57] developed a linear solvation energy relationship (LSER) for predicting gas-to-ionic liquid partition coefficient (log K L ) and water to-ionic liquid partition coefficient (log P) of various organic compounds. The solute descriptors they used in their LSER models were: the excess molar refraction E; the dipolarity/polarizability S; the hydrogen bond acidity and basicity A and B, respectively, the gas–liquid partition coefficient on n-hexane at 298 k L and McGown volume V. The squared correlation coefficient (R 2) of the model for prediction set was 0.997 and 0.996, for log K L and log P, respectively which are comparable with our results in Table 6.

Conclusions

In the present work, applying the Support Vector Machine, QSPR models have been developed for predicting the tR of a set of peptides from same of their molecular description values. The outcome of our computations indicates that while the GA-PLS method allows proper selection of important descriptors, the introduction of a SVM gives a substantial improvement in prediction quality. The calculated statistical parameters of these models expose the superiority of SVM over PLS and ANN models. The SVM reveals a better performance because it applies the structural risk minimization principle, which has been disclosed to be better than the conservative empirical risk minimization principle, employed by the usual machine learning techniques. SVM has the advantage over the other techniques of converging to the global optimum, and not to a local optimum that depends on the initialization and parameters affecting rate of convergence.