1 Introduction

Lysine glycation is one of the most common and important post-translational modifications (PTMs), which can potentially affect various biological processes, such as conformation, efficacy, and immunogenicity [1, 2]. Moreover, lysine is one of the essential amino acids in the human body, which can promote human development, enhance immune function, and can improve the functioning of the central nervous system. Because of the low lysine content in cereals and the susceptibility to damage during processing, it is called the first limiting amino acid.

Lysine glycation is a complex multi-step process, beginning with the attachment of reducing sugars to amino groups in cellular proteins, leading to the formation of Schiff’s base as early glycation product [3,4,5]. The advanced glycation end-products are known to facilitate age-related chronic diseases, e.g. atherosclerosis [6], by changing vascular elasticity and thickening vascular walls [7]. Glycation is also observed to promote abnormal amyloid aggregation in aging-related neurodegenerative disorders, such as Alzheimer’s [8] and Parkinson’s [9] diseases. In spite of its essential role, the detection of glycated residues is still solely based on the tedious and time-consuming mass spectrometry technique to measure the monosaccharide modification-induced mass increase in the investigated peptide [10].

Several methods for predicting glycation sites based on protein sequence information have been reported. The neural network predictor named NetGlycate was built by Jonansen et al., which was trained on 89 glycated and 126 non-glycated lysine sites derived from 20 proteins. Later, Liu et al. constructed the model PreGly to predict the glycation sites by extracting the composition of amino acid, 4-interval amino acid pairs, five amino acid physicochemical properties and then selecting effective features through the maximum correlation minimum redundancy (mRMR) algorithm. Xu et al. developed a predictor Gly-PseAAC by combining position-specific amino acid propensity and support vector machine (SVM) algorithm. Recently, Ju et al. used bi-profile Bayes (BPB) feature extraction combined with SVM algorithm to construct a new predictor BPB_GlySite to predict glycosylation sites. While the prediction performance of BPB_GlySite has few improvements over the previous predictors, it is noted that its performance on the Xu training set is not satisfactory, because it obtains the Matthew’s correlation coefficient of 0.3499 only, and thus, requires significant improvement.

In this study, we propose a novel predictor MDS_GlySitePred to improve the prediction performance of glycation sites. To overcome the defective non-uniform distribution of training and test samples, we employed multidimensional scaling (MDS) to cluster the samples [11]. According to different distance radius, the negative samples were divided into three categories, and the positive samples remained unchanged. They characteristics including Parallel correlation pseudo amino acid composition (PC-PseAAC), General parallel correlation pseudo amino acid composition (PC-PseAAC_General), Adapted normal distribution bi-profile Bayes (ANBPB), Double Bi-profile Bayes (DBPB), Bi-profile Bayes (BPB), Top-n-gram, Amino acid composition (AAC), Position-specific di-amino acid propensity (PSDAAP) and Position-specific tri-amino acid propensity (PSTAAP) were extracted from sequence information. By combining the MDS method with the SVM algorithm, and through a tenfold cross-validation test, the MDS method was shown to be superior to the existing prediction in predicting lysine glycation sites. Finally, based on the features combination of PC-PseAAC_General + ANBPB + DBPB + Top-n-gram + AAC, ANBPB + PSDAAP and PC-PseAAC_General + PC-PseAAC + BPB + DBPB + PSTAAP, the importance of the positions around the glycation sites was analyzed. The features analysis shows that the residues around the glycation sites may play the most important role in the prediction of glycation sites. These results may provide useful clues for studying the lysine glycation mechanisms and may facilitate relevant experimental verifications.

2 Materials and methods

This method comprised four major steps: (1) collecting and processing data, (2) using MDS to cluster training datasets, (3) extracting sequence features, (4) constructing and evaluating models. The conceptual diagram of constructing the prediction model is given in Fig. 1.

Fig. 1
figure 1

The conceptual diagram of constructing the prediction model

2.1 Data collection and pre-processing

The most recently constructed training dataset by Xu et al. [2, 12] and Johansen et al. [13, 14] were used in the present study to provide a comprehensive and unbiased comparison of our methods with existing methods. For convenience, the datasets were named Xu dataset and Johansen dataset, respectively. The proteins in Xu’s training set was retrieved from protein lysine modifications database CPLM [15], and it consisted of 223 experimentally annotated glycation lysine sites and 446 non-glycation lysine sites from 72 proteins. In this study, we retrieved the proteins from NCBI, which were used in Xu dataset to get all negative samples. In this work, pseudo-amino acid were not considered. According to Xu [12] and Ju [2], the window size was set to 15. Thus, every training sample was represented as a peptide segment of length with 7 residues downstream and 7 residues upstream of lysine residue K. At last, the new training dataset contained 215 lysine glycation sites and 1781 lysine non-glycation sites. For Johansen’s benchmark dataset, the same method was used to process, and finally obtain 81 positive samples and 244 negative samples. Finally, amino acid composition (AAC) feature extraction was performed on the negative training set. To avoid linearity, we removed the last column from the 20-dimensional feature, leaving 19 columns of feature vectors for later use in the MDS method.

2.2 Feature extraction and encoding

2.2.1 Amino acid composition (AAC)

The amino acid composition [16, 17] simply represents the frequency of 20 common amino acids in the protein sequence, reflects the global characteristics of the protein sequence, and is a basic protein sequence feature extraction algorithm. The AAC maps the membrane protein sequence to a point in the 20-dimensional Euclidean space and can be defined as a 20-dimensional vector:

$$ P = \left[ {x_{1} ,x_{2} , \ldots ,x_{i} , \ldots ,x_{20} } \right]^{T} $$
(1)

where \( x_{i} = f_{i} /\sum\nolimits_{i = 1}^{20} {f_{j} } \), fi is the number of times the first type of amino acid appears in the membrane protein sequence. Obviously, \( \sum\nolimits_{j = 1}^{20} {x_{i} = 1} \). The calculation of amino acid composition is convenient and is the most commonly used sequence feature extraction algorithm in the study of membrane protein classification.

2.2.2 Bi-profile Bayes (BPB)

The bilateral Bayesian feature extraction algorithm proposed by Shao et al. [18, 19] has been widely used to predict various post-translational modification sites [20,21,22]. BPB comprehensively considers the information contained in the two aspects of positive and negative samples. Let \( S = s_{1} s_{2} \ldots s_{n} \) denote a lysine glycosylated sample, where \( s_{j} \left( {j = 1,2, \ldots ,n} \right) \) represents 20 natural amino acids {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}, and n is the length of the peptide fragment after the amino acid K in the middle position is omitted (i.e., n = 14). Given a protein sequence P (Eq. 1), the BPB feature vector of P is defined:

$$ P = \left[ {x_{1} ,x_{2} , \ldots ,x_{n} ,x_{n + 1} , \ldots ,x_{2n} } \right]^{T} $$
(2)

where P is the posterior probability vector, x1x2, …, xn represent the posterior probability of each amino acid at each position in positive peptide sequence datasets, \( x_{n + 1} , \ldots ,x_{2n} \) represent the posterior probability of each amino acid at each position in negative peptide sequence datasets. Two position-specific profiles for final model training, including positive position-specific profiles and negative position-specific profiles, were generated by calculating the frequency of each amino acid at each position in the positive, as well as negative datasets.

2.2.3 Double bi-profile bayes (DBPB)

DBPB is an improvement over BPB [23]. BPB is the posterior probability of each single amino acid at each position in the positive and negative datasets, while DBPB is the posterior probability of every di-amino acid at each position in the datasets. Given a protein sequence P (Eq. 1), the DBPB feature vector of P is defined as follows:

$$ P = \left[ {x_{1} ,x_{2} , \ldots ,x_{n} ,x_{n + 1} , \ldots ,x_{2n} } \right]^{T} $$
(3)

where P is the posterior probability vector, \( x_{1} ,x_{2} , \ldots ,x_{n - 1} \) represent the posterior probability of each amino acid pair at each position in positive peptide sequence datasets,\( x_{{\left( {n - 1} \right) + 1}} , \ldots ,x_{{2\left( {n - 1} \right)}} \) represent the posterior probability of each amino acid pair at each position in negative peptide sequence datasets. Two position-specific profiles for final model training, including positive position-specific profiles and negative position-specific profiles, were generated by calculating the frequency of each amino acid pair at each position in the positive, as well as negative datasets.

2.2.4 Adapted normal distribution bi-profile Bayes (ANBPB)

ANBPB [20, 24] is the improvement of BPB in another aspect. Given a protein sequence P (Eq. 1), the ANBPB feature vector of P is defined as:

$$ P = \left[ {p_{1} ,p_{2} , \ldots ,p_{n} ,p_{n + 1} , \ldots ,p_{2n} } \right]^{T} $$
(4)

where p1p2, …, pn is the posterior probability of each amino acid at each position in positive peptide sequences datasets; \( p_{n + 1} , \ldots ,p_{2n} \) is defined based on the posterior probability of each amino acid at each position in negative peptide sequences datasets. The posterior probability \( p_{1} ,p_{2} , \ldots ,p_{2n } \) is coded by the adapted normal distribution as follows:

$$ \varphi \left( x \right) = \frac{1}{{\sqrt {2\pi } }}\mathop \int \limits_{ - \infty }^{x} e^{{ - \frac{{t^{2} }}{2}}} dt $$
(5)

where \( \varphi \left( x \right) \) is the standard normal distribution function and the detailed description of the formula is given [20, 24].

2.2.5 Position-specific di-amino acid propensity (PSDAAP)

The posterior probability of every two nearest amino acids at each position in the positive peptide sequence datasets is subtracted from the negative peptide sequence datasets [25, 26]. Given a protein sequence P (Eq. 1), the PSDAAP feature vector of P is defined as follows:

$$ P = \left[ {p_{1} ,p_{2} , \ldots ,p_{n - 1} } \right]^{T} $$
(6)
$$ + P = \left[ { + p_{1} , + p_{2} , \ldots , + p_{n - 1} } \right]^{T} $$
(7)
$$ - P = \left[ { - p_{1} , - p_{2} , \ldots , - p_{n - 1} } \right]^{T} $$
(8)

where P is the posterior probability vector, +pi represent the posterior probability of each amino acid pair at each position in positive peptide sequence datasets, −pi represent the posterior probability of each amino acid pair at each position in negative peptide sequence datasets. \( p_{i} = \left( { + p_{i} } \right) - \left( { - p_{i} } \right) \) is the feature vector.

2.2.6 Position-specific tri-amino acid propensity (PSTAAP)

Similar to PSDAAP, the posterior probability of every three nearest amino acids at each position in the positive peptide sequence datasets is subtracted from the negative peptide sequence datasets [25, 26]. Given a protein sequence P (Eq. 1), the PSTAAP feature vector of P is defined as follows:

$$ P = \left[ {p_{1} ,p_{2} , \ldots ,p_{n - 2} } \right]^{T} $$
(9)
$$ + P = \left[ { + p_{1} , + p_{2} , \ldots , + p_{n - 2} } \right]^{T} $$
(10)
$$ - P = \left[ { - p_{1} , - p_{2} , \ldots , - p_{n - 2} } \right]^{T} $$
(11)

where P is the posterior probability vector, \( + p_{i} \) represent the posterior probability of each three amino acid pair at each position in positive peptide sequence datasets, −pi represent the posterior probability of each three amino acid pair at each position in negative peptide sequence datasets. \( p_{i} = \left( { + p_{i} } \right) - \left( { - p_{i} } \right) \) is the feature vector.

2.2.7 Parallel correlation pseudo amino acid composition (PC-PseAAC)

PC-PseAAC [27] is an approach merging the global sequence-order information and the contiguous local sequence-order information into the feature vector of the protein sequence. Given a Protein sequence P (Eq. 1), the PC-PseAAC feature vector of P is defined as follows:

$$ P = \left[ {x_{1} ,x_{2} , \ldots ,x_{20} ,x_{21} , \ldots ,x_{20 + \lambda } } \right]^{T} $$
(12)

where,

$$ x_{u} = \left\{ {\begin{array}{*{20}l} {\frac{{f_{u} }}{{\mathop \sum \nolimits_{i = 1}^{20} f_{i} + w\mathop \sum \nolimits_{j = 1}^{\lambda } \varTheta_{j} }}\left( {1 \le u \le 20} \right) } \\ { \frac{{w\varTheta_{u - 20} }}{{\mathop \sum \nolimits_{i = 1}^{20} f_{i} + w\mathop \sum \nolimits_{j = 1}^{\lambda } \varTheta_{j} }}\left( {20 + 1 \le u \le 20 + \lambda } \right)} \\ \end{array} } \right. $$
(13)

where w is the weight factor ranging from 0 to 1, the parameter λ is an integer that represents the highest counted rank (or tier) of the correlation along a protein sequence, fi(i = 1,2,…,20) is the normalized occurrence frequency of the 20 amino acids in the protein P, Θj(j = 1,2,…,20) is called the j-tier correlation factor reflecting the sequence-order correlation among all the jth most contiguous residues along a protein chain, which is defined as follows:

$$ \varTheta_{\lambda} = \frac{1}{L - \lambda}\mathop \sum \limits_{i = 1}^{L - \lambda} \varTheta (R_{i},R_{i + \lambda})\text{}0 < \lambda < L $$
(14)

where the correlation function is given by

$$ \varTheta (R_{i} ,R_{j} ) = \frac{1}{3}\left\{ {\left[ {H_{1} \left( {R_{j} } \right) - H_{1} \left( {R_{i} } \right)} \right]^{2} + \left[ {H_{2} \left( {R_{j} } \right) - H_{2} \left( {R_{i} } \right)} \right]^{2} + \left[ {M\left( {R_{j} } \right) - M\left( {R_{i} } \right)} \right]^{2} } \right\} $$
(15)

where \( H_{1} \left( {R_{i} } \right) \) is the hydrophobicity value, H2(Ri) is the hydrophilicity value, and \( M\left( {R_{i} } \right) \) is the side-chain mass of the amino acid Ri. Note that before substituting the values of hydrophobicity, hydrophilicity, and side-chain mass into Eq. 7, they are all subjected to a standard conversion as described by the following equations.

$$ H_{1} \left( i \right) = \frac{{H_{1}^{0} \left( i \right) - \mathop \sum \nolimits_{i = 1}^{20} \frac{{H_{1}^{0} \left( i \right)}}{20}}}{{\sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{20} \left[ {H_{1}^{0} \left( i \right) - \mathop \sum \nolimits_{i = 1}^{20} \frac{{H_{1}^{0} \left( i \right)}}{20}} \right]^{2} }}{20}} }} $$
(16)
$$ H_{2} \left( i \right) = \frac{{H_{2}^{0} \left( i \right) - \mathop \sum \nolimits_{i = 1}^{20} \frac{{H_{2}^{0} \left( i \right)}}{20}}}{{\sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{20} \left[ {H_{2}^{0} \left( i \right) - \mathop \sum \nolimits_{i = 1}^{20} \frac{{H_{2}^{0} \left( i \right)}}{20}} \right]^{2} }}{20}} }} $$
(17)
$$ M\left( i \right) = \frac{{M^{0} \left( i \right) - \mathop \sum \nolimits_{i = 1}^{20} \frac{{M^{0} \left( i \right)}}{20}}}{{\sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{20} \left[ {M^{0} \left( i \right) - \mathop \sum \nolimits_{i = 1}^{20} \frac{{M^{0} \left( i \right)}}{20}} \right]^{2} }}{20}} }} $$
(18)

where \( H_{1}^{0} \left( i \right) \), \( H_{1}^{0} \left( i \right),M^{0} \left( i \right) \) are the original hydrophobicity value, the corresponding original hydrophilicity value, the mass of the ith amino acid, respectively. With the wide application of PC-PseAAC, Liu et al. [28] developed a web server “Pse-in-One” that could generate PC-PseAAC. For detailed information on Pse-in-One and its updated version, please refer to [29].

2.2.8 General parallel correlation pseudo amino acid composition (PC-PseAAC-general)

The PC-PseAAC-General approach [30] not only allows users to upload their own indices to generate PC-PseAAC-General feature vectors, but also incorporate comprehensive built-in indices extracted from AA index [31]. Given a protein sequence P (Eq. 1), the PC-PseAAC-General feature vector of P is defined as follows:

$$ P = \left[ {x_{1} ,x_{2} , \ldots ,x_{20} ,x_{21} , \ldots ,x_{20 + \lambda } } \right]^{T} $$
(19)

where

$$ x_{u} = \left\{ {\begin{array}{*{20}l} {\frac{{f_{u} }}{{\mathop \sum \nolimits_{i = 1}^{20} f_{i} + w\mathop \sum \nolimits_{j = 1}^{\lambda } \varTheta_{j} }}\left( {1 \le u \le 20} \right) } \\ {\frac{{w\varTheta_{u - 20} }}{{\mathop \sum \nolimits_{i = 1}^{20} f_{i} + w\mathop \sum \nolimits_{j = 1}^{\lambda } \varTheta_{j} }}\left( {20 + 1 \le u \le 20 + \lambda } \right)} \\ \end{array} } \right. $$
(20)

where w is the weight factor ranging from 0 to 1, the parameter λ is an integer that represents the highest counted rank (or tier) of the correlation along a protein sequence, fi(i = 1,2,…,20) is the normalized occurrence frequency of the 20 amino acids in the protein P, Θj(j = 1,2,…,20) is called the j-tier correlation factor reflecting the sequence-order correlation among all the jth most contiguous residues along with a protein chain, which is defined as follows:

$$ \varTheta_{\lambda} = \frac{1}{L - \lambda}\mathop \sum \limits_{i = 1}^{L - \lambda} \varTheta (R_{i},R_{i + \lambda}) 0 < \lambda < L $$
(21)

where the correlation function is given by follows:

$$ \varTheta (R_{i} ,R_{j} ) = \frac{1}{\mu }\mathop \sum \limits_{u = 1}^{\mu } \left[ {H_{u} \left( {R_{i} } \right) - H_{u} \left( {R_{j} } \right)} \right]^{2} $$
(22)

where µ is the number of physicochemical indices considered; \( H_{u} \left( {R_{i} } \right) \) and \( H_{u} \left( {R_{j} } \right) \) are the uth physicochemical index value of the amino acid Ri and Rj, respectively. Note that before substituting the physicochemical indices values into Eq. 26, they are all subjected to a standard conversion as described by the following equation:

$$ H_{u} \left( i \right) = \frac{{H_{u}^{0} \left( i \right) - \mathop \sum \nolimits_{i = 1}^{20} \frac{{H_{u}^{0} \left( i \right)}}{20}}}{{\sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{20} \left[ {H_{u}^{0} \left( i \right) - \mathop \sum \nolimits_{i = 1}^{20} \frac{{H_{u}^{0} \left( i \right)}}{20}} \right]^{2} }}{20}} }} $$
(23)

where \( H_{u}^{0} \left( i \right) \) is the uth original physicochemical value of the ith amino acid.

2.2.9 Top-n-gram

Top-n-gram [14] can be viewed as a novel profile-based building block of proteins, containing the evolutionary information extracted from the frequency profiles. The frequency profiles calculated from the multiple sequence alignments given as output by PSI-BLAST [32] are converted into Top-n-grams by combining the n most frequent amino acids in each amino acid frequency profile. The protein sequences are transformed into fixed dimension feature vectors by the number of times of each Top-n-gram occur. For more information about this approach, please refer to [14].

2.3 Multidimensional scaling (MDS) method

MDS is a multivariate data analysis technique that displays “distance or similarity” data structures in low-dimensional space, which has been widely applied in many applications, such as data visualization [33], object retrieval [34], data clustering [35], and localization [36].

The solution provided by the multidimensional scaling method is that when the similarity (or distance) between pairs of objects in n objects is given, the representation of these objects in the low dimensional space is determined (Perceptual Mapping) and is made as “substantially matched” as possible with the original similarity (or distance) to minimize any distortion caused by dimensionality reduction. Each point arranged in a multidimensional space represents an object, so the distance between two points is highly related to the similarity between them. In other words, two similar objects are represented by two points with similar distances in a multidimensional space, and two dissimilar objects are represented by two points in the multidimensional space that are far apart. Here, we use the dimensionality reduction clustering function of MDS. The relationship among amino acid sequences in a polypeptide is converted to a distance matrix, and each sequence is regarded as a point in a multidimensional space. By MDS dimensionality reduction clustering, the evolutionary relationship among these sequences can be displayed in low-dimensional space [11].

Generally, the classical MDS [37] is a three-step algorithm, including distance matrix construction, inner product matrix computation, and low dimensional representation calculation. Details of these steps are presented as follows:

  1. 1.

    Distance matrix construction. For each vector xi (1 ≤ i ≤ N), the Euclidean distance di,j between xi and xj (1 ≤ j ≤ N) was calculated and thus the distance matrix D = (di,j)N×N was obtained as follows:

    $$ D = \left[ {\begin{array}{*{20}c} {d_{1,1} } & {d_{1,2} } & \ldots & {d_{1,N} } \\ {d_{2,1} } & {d_{2,2} } & \ldots & {d_{2,N} } \\ \ldots & \ldots & \ldots & \ldots \\ {d_{N,1} } & {d_{N,2} } & \ldots & {d_{N,N} } \\ \end{array} } \right] $$
    (24)
    $$ {\text{d}}_{i,j} = \sum\nolimits_{l = 1}^{K} {\left[ {x_{i} (l) - x_{\text{j}} (l)} \right]}^{2} $$
    (25)

    where xi(l) is the lth elements of xi,xj(l) is the lth elements of xj,Distance matrix D is a real symmetric matrix with all 0 diagonal elements.

  2. 2.

    Inner product matrix computation. With the distance matrix D, the inner product matrix B can be determined by

    $$ B = - \frac{1}{2}JDJ $$
    (26)
    $$ J = E - \frac{1}{N}ee^{T} $$
    (27)

    where J is a centralized matrix obtained by the below equation, and E is a unit matrix sized N × N, e is a unit vector sized N × 1, Je = 0, and JT= J.

  3. 3.

    Low dimensional representation calculation. As B is symmetric and positive semi-definite, it can be decomposed as:

    $$ B = SVS^{T} $$
    (28)
    $$ Z = SV^{{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 2}}\right.\kern-0pt} \!\lower0.7ex\hbox{$2$}}}} $$
    (29)

    where V is a diagonal matrix of the eigenvalues of B, and S is a matrix of the corresponding eigenvectors. Consequently, a low dimensional representation G can be generated by taking the first d columns of Z.

Therefore, G is a matrix of size N × d (d < K). Assume that Vd is the diagonal matrix composed of the d largest eigenvalues, and Ud is the N × d matrix composed of the corresponding d norm of orthogonalized characteristic vectors. If \( U_{d} = (\overrightarrow {v}_{1} ,\overrightarrow {{v_{2} }} , \ldots ,\overrightarrow {{v_{d} }} ) \) and \( V_{\text{d}} = {\text{diag}}(\lambda_{1} ,\lambda_{2} , \ldots ,\lambda_{d} ) \), then the coordinate matrix in d-dimensional space is:

$$ X_{\text{d}} = (\sqrt {\lambda_{1} } \cdot \overrightarrow {{v_{1} }} ,\sqrt {\lambda_{2} } \cdot \overrightarrow {{v_{2} }} , \ldots ,\sqrt {\lambda_{d} } \cdot \overrightarrow {{v_{d} }} ) = U_{d} \sqrt {V_{d} } $$
(30)

2.4 Model construction and evaluation

A support vector machine (SVM) is a set associated with supervised learning methods used for classification and regression based on statistical learning theory. SVM looks for a rule that best maps each member of the training dataset into the correct classification [38, 39], and it has been proved to be a powerful tool in a lot of bioinformatics fields [18, 40,41,42,43]. In this study, the LIBSVM package [44] was applied to build and train a prediction model. The radial basis function \( \left( {RBF} \right) \, K\left( {S_{i} ,S_{j} } \right) = e^{{\left( { - \gamma \left\| {S_{i} - S_{j} } \right\|^{2} } \right)}} \) was used for the kernel function. The grid search was used to search the optimal parameters of SVM. Parameter c was selected from {20, 21,…, 213}, and kernel parameter g was selected from {2−13, 2−12,…, 20}.

To verify the effectively predictive performance of the model, the sensitivity (Sn), specificity (Sp), accuracy (Acc) and Matthew’s correlation coefficient (MCC) were employed as follows:

$$ Sn = 1 - \frac{{N_{ - }^{ + } }}{{N^{ + } }}{\kern 1pt} {\kern 1pt} {\kern 1pt} $$
(31)
$$ Sp = 1 - \frac{{N_{ + }^{ - } }}{{N^{ - } }}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} $$
(32)
$$ Acc = 1 - \frac{{N_{ - }^{ + } + N_{ + }^{ - } }}{{N^{ + } + N^{ - } }} $$
(33)
$$ MCC = \frac{{1 - \frac{{N_{ - }^{ + } + N_{ + }^{ - } }}{{N^{ + } + N^{ - } }}}}{{\sqrt {\left( {1 + \frac{{N_{ + }^{ - } - N_{ - }^{ + } }}{{N^{ + } }}} \right)\left( {1 + \frac{{N_{ - }^{ + } - N_{ + }^{ - } }}{{N^{ - } }}} \right)} }}{\kern 1pt} $$
(34)

where N+ is the total number of the glycation sites investigated, while \( N_{ - }^{ + } \) is the number of the sites incorrectly predicted as the non-glycation sites, and N is the total number of the non-glycation sites investigated, while N + is the number of the non-glycation sites incorrectly predicted as the glycation sites.

In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling or K-fold cross-validation test, and jackknife test [45]. The jackknife test is the most credible one among these three test methods [46], since the outcome obtained by it is always unique for a given benchmark dataset. Accordingly, the jackknife test has been widely recognized and increasingly used by investigators to examine the quality of various predictors [47,48,49,50,51,52]. However, in this work, we used tenfold cross-validation test instead of the jackknife test because the prediction results in previous works [2, 12] are obtained on tenfold cross-validation. Normally, this procedure is repeated 10 times and the final prediction result is an average of the 10 testing subsets. For obtaining a reliable estimate in this study, the tenfold cross-validation was repeated 50 times.

3 Analysis of results

3.1 Experimental data processing

The MDS method was used to train the negative training set, and a perceptual map was obtained (Fig. 2). From the perceptual map, the training samples were observed to be roughly concentrated in three regions. Therefore, the negative training sets were divided into three categories based on these three regions, and the values (0.26, −0.01), (−0.03, −0.28), and (−0.04, 0.06) were selected as origins, with 0.2, 0.12, and 0.1 as radii, respectively. Thus, three groups of non-Glycated lysine are clustered. The numbers of negative samples are 218, 212, 1063 for the three datasets named Xu dataset1, Xu dataset2, and Xu dataset3, respectively. To avoid overestimating the prediction performance of the model due to redundancy and sequence homology, CD-HIT [53, 54] was used to remove redundancy for group Xu dataset3 negative training samples. For those two samples with similarity ≥ 40%, one of the samples was retained, while 395 negative samples with sequence similarity < 40% were obtained in the de-redundant negative samples. For Johansen’s benchmark dataset, we used the same method to obtain the perceptual map (Fig. 3), and also divided the negative samples into three groups, denoted by Johansen dataset1, Johansen dataset2, and Johansen dataset3. The values (−0.07261, 0.07162), (0.1198, 0.07272) and (0.01023, −0.1557) were assigned as origin, with 0.095, 0.085 and 0.145 as respective radii. Finally, the three groups of lysine are obtained, with 81, 56, 51 sites, respectively. Moreover, because Johansen dataset is smaller than Xu dataset, we made the origin coordinates and radius more precise.

Fig. 2
figure 2

Perceptual graph obtained with Xu dataset

Fig. 3
figure 3

Perceptual graph obtained with Johansen dataset

3.2 Combining features

To find the optimization features that are most conducive to the identification of lysine glycosylation, nine feature extraction strategies, including PC-PseAAC (30), PC-PseAAC-General (30), BPB (28), ANBPB (28), DBPB (26), Top-n-gram (20), AAC (20), PSDAAP (13) and PSTAAP (12) were used. Here, the numbers in parentheses represent the dimension of this feature. In this study, a combination of features was used because the combination of multiple features enhances the training effect of the model. The feature combinations were performed in the order in which these nine feature dimensions are decremented. The performance of the combined feature sets for sorting lysine glycation sites and non-glycation sites was examined by tenfold cross-validation. Take, for example, the combination features in Xu dataset1. Firstly, when only the feature PC-PseAAC-General was used, the prediction accuracy achieved was 91%. However, when PC-PseAAC was added, the accuracy decreased to 90.98%, so the PC-PseAAC was rejected and when ANBPB was directly added, the accuracy rate significantly increased to 94.68%. With these trials, a most suitable combination of PC-PseAAC-General + ANBPB + DBPB + Top-n-gram + AAC was obtained, with the sensitivity of 96.28%, specificity of 99.56% and accuracy of 97.92%. Other training groups were handled and processed in a similar way. The results are presented in Table 1.

Table 1 Predictive performance of the combination feature with different sequence encoding schemes based on the tenfold cross-validation

The results show that the prediction performance was enhanced by increasing the features number step by step. As shown in Table 1, on Xu’s dataset, the Acc reached 97.92%, 99.77%, 99.02%, respectively. And on Johansen’s dataset, the Acc reached 97.79%, 96.22%, 100%, respectively.

3.3 Performance of MDS_GlySitePred

The MDS_GlySitePred was constructed on Xu dataset because the previous predictors Gly-PseAAC and BPB_GlySite were all trained on the same dataset. To show the three groups of results intuitively, the results of tenfold cross-validation was repeated 50 times as listed in Table 2. As can be seen, the predictor MDS_GlySitePred achieved the best prediction performance with the Sn of 95.08%, Sp of 97.65%, Acc of 96.58%, and MCC of 0.93. The performance of MDS_GlySitePred is obviously superior to the second-best model BPB_GlySite, and especially showed 31.40% better Sn. These results indicate that MDS_GlySitePred is more effective and more reliable in identifying lysine glycation sites from query proteins than BPB_GlySite and Gly-PseAAC. Since the classification algorithms used in MDS_GlySitePred, BPB_GlySite, and Gly-PseAAC are all SVMs, a better performance of MDS_GlySitePred indicates that the MDS method may be used to cluster samples from a different probability distribution.

Table 2 The comparison of MDS_GlySitePred with BPB_GlySite and Gly-PseAAC on Xu’s dataset by tenfold cross-validation running 50 times

3.4 Comparison between MDS_GlySitePred with existing prediction methods on Johansen’s dataset

To further assess the effectiveness of MDS_GlySitePred, we compared it with other existing prediction methods, including NetGlycate, PreGly, Gly-PseAAC, and BPB_Gly-Site [2, 12,13,14]. All these predictors have been trained on the same Johansen’s benchmark dataset [2, 12,13,14], therefore, the same tenfold cross-validation test could be implemented. The compared results among the four methods are presented in Table 3. Here too, the MDS_GlySitePred method achieved the best results, with the Sn of 94.44%, Sp of 96.15%, Acc of 95.45% and MCC of 0.91. Moreover, the MDS_GlySitePred significantly outperformed the existing glycation sites predictors on Johansen’s benchmark dataset.

Table 3 Comparison of existing predictors on Johansen’s benchmark dataset by tenfold cross-validation test

4 Conclusions

In this work, we built a prediction model MDS_GlySitePred for identifying protein glycation site based on multidimensional scaling (MDS) clustering negative samples. To the best of our knowledge, this is the first time MDS has been applied to predict glycation sites. The experimental results show that the MDS is efficient in dealing with samples obeying different probability distribution. We hope that this model will further facilitate the protein glycation studies. As demonstrated in a series of recent publications [47, 55,56,57] on developing new prediction methods, user-friendly and publicly accessible web-servers will significantly enhance their impact [48, 58,59,60,61,62,63,64,65,66,67,68,69,70]. Hence, our future course of action will be to provide a web-server for the prediction method presented in this paper.