1 Introduction

The discovery of neuropeptides (NPs) is due to the groundbreaking progress in physiology, endocrinology, and biochemistry during the last century. NPs are widely distributed in both the peripheral and central nervous system [1]. The functions which NPs mediated cover not only neural activity, but also various aspects of non-neuronal cells, including food uptake, energy consumption, and social behavior [2, 3]. Mature NPs are stored in dense-cored vesicles and controlled release upon a stimulus [4]. It activates a signaling cascade by binding G protein-coupled receptor commonly [5].

In general, short bioactive NPs are generated from a series of cleavages of a larger neuropeptide precursor (NPP) which rely on proteolytic enzymes and maturation events, such as C-terminal amidation, post-translational modifications. Notably, the cleavages mostly occur at basic residues (Gly, Lys, and Arg) motifs that flank the NPs [6, 7]. Meanwhile, signal peptides in N-terminal are important region which control the NP to the secretory pathway [8]. They are cleaved off during the translocation of NP through the endoplasmic reticulum membrane. The common feature of signal peptides is enrichment of hydrophobic residues.

The NP characterization depends on mass spectrometry which can provide high-quality data, but this approach is time-consuming and labor intensive [9,10,11]. As the complete genome sequence of many animals now becomes available, more effective and faster method is required to identify all potential NPs and their precursor. Several bioinformatics methods have been developed to identify NPPs based on sequence conservation traits [12, 13]. For most cases, due to the function of a particular peptide only depends on a short conserved motif, the peptide precursor sequence may show no significant sequence similarity [14]. In this study, we assume that specific monobasic, dibasic, or tribasic amino acid compositions which embody cleavage sites and signal peptides and other motifs will contribute to recognize NPPs. From this hypothesis, we aim to construct a predictor based on sequence compositions to identify NPPs and then provide a web server to make it easier to use.

2 Materials and Methods

2.1 Data Sets

The NPP data set and protein unrelated to NPP (UnNPP) data set were requisite for training the model. The NPP data set was provided by SwissProt [15] and NeuroPedia [16]. We searched SwissProt with the “keyword” term “Neuropeptide (KW-0527)” but not “Receptor [KW-0675]” and collected all the results. We also downloaded all 270 human neuropeptide sequences from the NeuroPedia database. Then, the data from the two sources were merged into one data set. Those proteins whose sequence status is fragment were removed, due to they could be mature neuropeptide greatly.

For preparing a data set with high quality, the following procedure was executed: (1) Protein sequences including unclear residues (“B”, “J”, “X” etc.) were removed. (2) The CD-Hit software [17] was applied to keep the sequence similarity of each NPP sequence below 90%. (3) The protein should have a clear gene source, it means that the protein entry contains “GN” information.

We constructed candidate UnNPP pool through extracting UnNPPs from SwissProt by excluding the sequences related to neuropeptide. After excluding peptides containing ambiguous residues, the CD-Hit with identity of 0.9 was also performed. We randomly selected UnNPPs with the same number of NPP from candidate UnNPP pool as UnNPP data set. During the selection of UnNPP, it is ensured that the UnNPP data set has the same length distribution with the NPP data set.

2.2 Quantitative Features

Extracting a set of typical features is a crucial step in the process of pattern recognition. Single amino acid composition (AAC) [18], dipeptide composition (DPC) [19], and tripeptide composition (TPC) [20] have achieved excellent performances in the field of pattern classification. To establish the best model, each individual peptide sequence in data sets can be characterized by these three types of quantitative features. The AAC, DPC, and TPC defined as the following equations:

$${\text{AAC}}\left( i \right)=\frac{{x(i){\text{}}}}{{\mathop \sum \nolimits_{{i=1}}^{{20}} x(i){\text{}}}}$$
$${\text{DPC}}\left( j \right)=\frac{{y(j){\text{}}}}{{\mathop \sum \nolimits_{{j=1}}^{{400}} y(j){\text{}}}}$$
$${\text{TPC}}\left( n \right)=\frac{{p(n){\text{}}}}{{\mathop \sum \nolimits_{{n=1}}^{{8000}} p(n){\text{}}}}$$

where i denote one of the 20 amino acids, j can be any one of the 400 dipeptides, and n represents one out of the 8000 tripeptides. x(i), y(j), and p(n) are their counts in each sequence, respectively. Thus, each sequence in the data set is quantized by three feature encoding schemes, AAC, DPC, and TPC. We also constructed a combined peptide composition (CPC) including all 8420 features including AAC, DPC, and TPC. The selection of optimal combined peptide composition (OCPC) was accomplished as follows.

2.3 Selecting the Optimal Feature Set

In the model building process, existence of irrelevant and noisy features can result in poor model performance and increased computational complexity. To select the optimal reduced subsets, feature selection technique based on analysis of variance (ANOVA) [21] was performed. The following feature optimal steps [22] were conducted to construct OCPC against CPC: (1) sorted each feature based on F-score derived from ANOVA in descending order; (2) added a feature to the feature set one by one; (3) calculated accuracy for each new feature set using five-fold cross validation; and (4) selected the feature set with the highest accuracy as OCPC subset.

Based on the ANOVA theory, the significance of sequence compositions can be illustrated by calculating the F-score [23] which can be expressed by

$$F(\mu )=\frac{{S_{B}^{2}(\mu )}}{{S_{W}^{2}(\mu )}}$$

where (\(S_{B}^{2}\)) and (\(S_{W}^{2}\)) denote the inter-class and intra-class variance, respectively. They can be defined as

$$S_{B}^{2}(\mu )=\frac{1}{{d{f_B}}}\mathop \sum \limits_{{i=1}}^{K} {m_i}{\left( {\frac{{\mathop \sum \nolimits_{{i=1}}^{{mi}} {f_\mu }\left( {i,j} \right)}}{{{m_i}}} - \frac{{\mathop \sum \nolimits_{{i=1}}^{K} \mathop \sum \nolimits_{{j=1}}^{{{m_i}}} {f_\mu }(i,j)}}{{\mathop \sum \nolimits_{{i=1}}^{K} {m_i}}}} \right)^2}$$
$$S_{W}^{2}(\mu )=\frac{1}{{d{f_W}}}\mathop \sum \limits_{{i=1}}^{K} \mathop \sum \limits_{{j=1}}^{{{m_i}}} {\left( {{f_\mu }(i,j) - \frac{{\mathop \sum \nolimits_{{i=1}}^{K} \mathop \sum \nolimits_{{j=1}}^{{{m_i}}} {f_\mu }(i,j)}}{{\mathop \sum \nolimits_{{i=1}}^{K} {m_i}}}} \right)^2}$$

where \(d{f_B}\) and \(d{f_W}\) are degrees of freedom for sample variance between groups and within groups, defined as K-1 and M-K, respectively; K and M represent the number of groups and all samples; \(~{m_i}\) stands for the number of samples in ith group; \({f_\mu }(i,j)\) means sequence composition frequency for the jth sample in the ith group; and \(\mu\) ranges from 1 to 8420 for CPC. In our case, K and M are equal to 2 and 800, and both \(~{m_1}\) and \(~{m_2}\) are 300. The value of \(F\left( \mu \right)\) shows the relevance between the \(\mu\)th feature and variable between groups. The greater value of \(F\left( \mu \right)\) is, the more importance it is to classify groups.

2.4 Constructing Support Vector Machine Models

Support vector machine (SVM) is an effective machine-learning methods for supervised pattern recognition, based on statistical learning theory, and has been widely used in the field of bioinformatics [24,25,26,27,28,29,30,31,32]. The basic idea of SVM is to map the low dimensional data into a high dimensional feature space through the kernel function, and then find the hyperplane with the largest separating distance between two groups. In general, four kernel functions, including radial basis function, polynomial function, sigmoid function, and linear function, will be selected to perform the prediction. Since the excellent effectiveness of radial basis function, we utilized it as kernel function in the current work. The two parameters as the kernel parameter γ and penalty parameter C were determined via grid search approach. In this report, the SVM model was implemented using the LibSVM software [33]. For the sake of the best optimal prediction, four models are trained with AAC, DPC, TPC, and OCPC, respectively.

2.5 Evaluating Performance

To evaluate the performance of model, four common metrics including sensitivity (Sn), specificity (Sp), accuracy (Acc), and Matthews correlation coefficient (MCC) were calculated and defined as follows:

$${\text{Sn}}=\frac{{{\text{TP}}}}{{{\text{TP}}+{\text{FN}}}}$$
$${\text{Sp}}=\frac{{{\text{TN}}}}{{{\text{TN}}+{\text{FP}}}}$$
$${\text{Acc}}=\frac{{{\text{TP}}+{\text{TN}}}}{{{\text{TP}}+{\text{FN}}+{\text{TN}}+{\text{FP}}}}$$
$${\text{MCC}}=\frac{{{\text{TP}} \times {\text{TN}} - {\text{FP}} \times {\text{FN}}}}{{\sqrt {({\text{TP}}+{\text{FP}})({\text{TP}}+{\text{FN}})({\text{TN}}+{\text{FP}})({\text{TN}}+{\text{FN}})} }}.$$

Here TP, FP, TN, and FN denote true positive, false positive, true negative, and false negative, respectively.

Actually, MCC is a correlation coefficient between the expectation and prediction. Its value varies between − 1 and + 1. The former represents an entirely opposite prediction, the latter indicates a perfect prediction, and 0 means no better than random prediction.

Area under receiver operating characteristic curve (ROC), named AUC was also applied to measure the quality of the binary classification. ROC is a graphical plot that indicates the performance of a two-class classifier as its probability threshold is varied. It relates sensitivity and 1-specificity. The machine-learning researchers usually use AUC for model comparison as its performance does not depend on the choice of the discrimination threshold. For good model performance, the AUC value should be close to 1, and a value of 0.5 means a random guess.

3 Results and Discussion

3.1 Collection of Data Set

The NPP and UnNPP data sets were constructed from SwissProt and NeuroPedia (Method 2.1). We removed repetitive sequences to construct a NPP union. After a serious of data cleaning, 407 NPPs were retained. The length distribution of NPP set showed that the lengths of majority NPPs were less than 500 aa, account for 96% (Fig. 1). To reduce the differences of sequence length, we removed the two NPPs longer than 1000 aa. The 405 UnNPPs less than 1000 aa was randomly selected to be UnNpp data set as described in “Materials and Methods” section.

Fig. 1
figure 1

Length distribution of the NPP set. The NPP set contains 407 NPPs. The lengths of majority NPPs were less than 500 aa account for 95.59%. Only two NPPs were longer than 1000 aa

Finally, the NPP and UnNPP data sets both have 405 sequences. For a comprehensive assessment, we divided all data into a training data set and an independent testing data set. We randomly selected 300 NPPs and 300 UnNPPs to constructed training data set. The testing data set was consisted of the rest 105 NPPs and 105 UnNPPs.

3.2 Optimization of Feature Set

Three feature encoding schemes were used in the current approach, including AAC, DPC, and TPC. In addition, each protein corresponds to a 20, 400, and 8000—dimension vector. The CPC feature set contains an 8420—dimension vector. As shown in Table 1, for five-fold cross validation, the model of SVM based on AAC reached accuracy of 88.83%; accordingly, those of DPC and TPC reached 91.83% and 93.66%. It shows that TPC-based models are superior to AAC and DPC-based models in the classification. In addition, the optimized reduced OCPC subset was obtained following the “Materials and Methods” section. The model based on OCPC feature set had the highest accuracy as 96.67% with the feature set which contains 1521 sequence compositions and achieved a better performance than NeuroPID [34]. Obviously, the feature selection technique can not only optimize the operation time, but also achieve better predictive performance.

Table 1 Accuracy of SVM-based models trained with different features via five-fold cross validation

3.3 Evaluation of Different Models

We applied four feature sets described in Table 1 to construct four models. Their performances were assessed in a rigorous way by the independence testing data set. No entry of the testing data set appeared in the training of the current model. The results are given in Table 2. The accuracy, MCC, and AUC obtained by OCPC are 88.62%, 0.78, and 0.95. They are slightly higher than corresponding values obtained by other models. Similarly, Fig. 2 shows that the AUC of red line which stands for OCPC model is higher than that of other three models. Finally, the OCPC feature set in training data set was chosen to construct model for further application.

Table 2 Prediction performances of models with different feature sets for independent testing data set
Fig. 2
figure 2

ROC for models with different feature sets for independent testing data set. The curves with different colors stand for ROC of four SVM models. The AUC of red line which stands for OCPC model is 0.9540 and higher than that of other three models

3.4 Analysis of Sequence Composition

We performed a feature analysis for sequence composition. Figure 3 shows a histogram for F-scores of AACs. The x-axis represents the 20 single amino acids, and the y-axis stands for the F-score for the corresponding AAC. As shown in Fig. 3, the residues I (Ile), V (Val), S (Ser), and R (Arg) have more variances between NPP and UnNPP. In those residues, I and V are hydrophobic residues which are enriched in signal peptide. In addition, R is the basic residue of cleavage sites [6].

Fig. 3
figure 3

Histogram for F-scores of 20 single amino acid compositions

The heat map analysis was also performed, as given in Fig. 4. The row of the heat map denotes the first amino acid of dipeptides, and the column represents the second one of that, respectively. Each square stands for one of the 400 dipeptide composition and the color is quantized according to its F-score. The features in blue boxes are different in NPPs and UnNPPs, while those in red boxes are the same in two classes. It was observed that most of the F-score for the dipeptide composition are near 0 (in red box), indicating that a large proportion of features is redundant and irrelevant for NPP predictions. The top four significant DPCs were LL (Leu–Leu), RS (Arg-Ser), IK (Ile-Lys), and KR (Lys-Arg). Interestingly, KR is apt to be cleavage site. For CPC features, the most significant feature was GKR (Gly-Lys-Arg), which is the most common known consensus cleavage site [6]. The significant features that are not basic residues may provide new idea about sequence characteristics in NPPs.

Fig. 4
figure 4

Heat map for F-scores of 400 dipeptide compositions

3.5 Web-Server Guide

For the convenience of other researchers, a web-server publicly accessible named NeuroPP has been developed. The web interface of NeuroPP was coded with Perl and is very friendly to use. The home page of web-server is shown in Fig. 5. In the prediction page, user can submit protein sequences in FASTA format in the textbox directly, or upload a local sequence file to the server. After clicking the predict button, the prediction results will be returned as an online table. The “view more” and “download” options can be chosen to obtain more information. The users should notice that the web server aims to recognize NPPs less than 1000 aa. At the result page, user can rank the results by length or probability to get more intuitive observation. The web server provides a useful interface to recognize unknown NPP.

Fig. 5
figure 5

Screenshot to show the home page of the NeuroPP web server

4 Conclusions

To identify the NPPs from poorly annotated proteomes, several tools have been explored using machine-learning methods. However, only models and methods are far from satisfactory. User-friendly web servers or stand-alone programs are urgently needed. Dan et al. [34] had developed a predictor called NeuroPID to predict the NPPs from metazoan proteomes. This NeuroPID achieved 89–94% accuracy in cross validation, and this method did yield quite encouraging results. However, the prediction performance was not evaluated by an independent data set. Moreover, no online web server of the predictor is available now. In this study, we develop a predictor—NeuroPP to identify NPPs. In cross validation, NeuroPP achieved a higher accuracy with 96.67%. More than that, it showed better performance with accuracy of 88.62% and AUC of 0.9540 for an independent testing data set. The tripeptide composition which is not considered in NeuroPID may contribute to slightly increased accuracy for identifying NPPs. In brief, NeuroPP can perform splendidly in recognition NPPs, save time and cost for relevant experimental biologist, and improve the annotation of proteomes.