Introduction

With the development of tumor immunology, several studies have demonstrated that cytotoxic T lymphocytes (CTLs) play an important role in antitumor immunity. Given the tumor-specific killing ability of CTLs, they are used to treat malignant tumors [1]. As an integrity antigen, the peptide vaccine for CTL epitope poses no risk for infection and can be massively synthesized and oriented for antitumor and antivirus activities. Thus, the accurate identification of CTL epitopes has become increasingly important, and many studies are focused on developing advanced methods for CTL epitope screening [24].

The prediction of proteasomal cleavage sites and peptides transported by transporter associated protein (TAP) has not yet been considered. Thus, the current methods for CTL epitope prediction are not perfect, and most of them are constructed based only on the affinity between peptides of CTL epitopes and major histocompatibility complex (MHC) class I (MHC-I) molecules. Accordingly, an effective CTL epitope must undergo the following translocation mechanism in living cells. First, the CTL epitope is produced by proteasomes from the protein antigen and transported by TAP to the endoplasmic reticulum to assemble with the MHC-I complexes. Second, the product is excreted by the Golgi apparatus to the surface of the cell membrane and made available for interaction with T cell receptors. In the CTL response, these intermolecular recognition processes are connected like a chain, and any errors in them invalidate the whole response. Several studies have also found that some peptides with high binding affinities for MHC-I molecules do not elicit a strong CTL response [5]. Therefore, predicting the binding affinity between peptides and MHC-I molecules is insufficient; a prediction method that combines the prediction of proteasomal cleavage sites, TAP transport peptides, and binding affinity of epitope peptides must be established to improve the accuracy of epitope prediction and identification.

Proteasomes are intracellular multi-subunit proteases that catalyze selective proteolytic protein processing within various cell signal-transduction pathways, such as cell cycle control, transcriptional regulation, and antigen presentation [68]. Proteasomes exist in several forms. For example, the proteolytically active core complexes or 20S proteasomes, association with ATP-dependent 19S cap complexes yields the larger 26S proteasomes, which can recognize proteins and are marked by ubiquitin for proteasomal degradation [9, 10]. Eukaryotic eukaryotic 20S proteasomes consist of four stacked rings (overall stoichiometry α7β7β7α7) each consisting of seven different subunits [11]. Each of the two inner β-rings carries three catalytically active sites on its inner surface, and proteolytic specificities of active site have been described as chymotrypsin like [cleaving after large, hydrophobic amino acids (AAs)], trypsin like (cleaving after basic AAs), and peptidyl-glutamyl-peptide hydrolyzing (cleaving after acidic AAs) [12]. Proteasomes cleave ubiquitin-protein conjugates into smaller peptides and generate the correct C-terminus of MHC-I-binding peptides [13, 14].

Several computational approaches for elucidating the cleavage specificity of proteasomes have been presented. These approaches are all based on experimentally verified cleavage sites within protein substrates and the analysis of the flanking region of such sites. EpiJen is an integrated approach for predicting proteasomal cleavage sites, transporting TAP, and binding MHC-I based on quantitative matrices [15]. Pierre et al. [16] used a new matrix-based method to establish a proteasomal cleavage site prediction method based on experimentally verified proteasomal cleavage sites. Holzhutter et al. [17] used a statistical method based on the MAPPP prediction server to analyze cleavage sites found in a set of peptide substrates with lengths ranging from 22 AAs to 30 AAs. PAProC [18, 19] is a method based on the proteasomal degradation of enolase protein. Up to 10 flanking AAs around the verified cleavage sites are used by an evolutionary algorithm to create the network-based model used for prediction. Another method for proteasomal cleavage prediction is NetChop, which uses a neural network for prediction [20]. Two different sets of training data are used by NetChop: verified cleavage sites within proteins and naturally processed MHC ligands. Altuvia and Margalit [21] used MHC-I ligands to study proteasomal cleavage.

This study proposes a new set of parameters to characterize the structure of AAs and uses a backpropagation (BP) neural network (BP-NN) to build a model for accurately predicting proteasomal cleavage. The accuracy of the prediction model reaches 95.454 % for the N-terminus and 95.011 % for the C-terminus, which depends on the cleavage window sizes. The results show that the proteasomal cleavage site identification depends on the sequence next to it and that the predictive performance of the C-terminus is better than that of the N-terminus on average.

Materials and methods

Experimental data

A total of 175 epitopes associated with HLA-A*0201 molecules with 8 AAs to 11 AAs were collected from the AntiJen database and used as the positive samples of proteasomal cleavage. The negative samples were constructed according to literature [20]. Briefly, pseudo-epitopes with a N- or C-termini at the middle of the epitopes are randomly produced, which assume the presence of non-cleavage sites within the epitopes. The cleavage sites that produce the epitopes and pseudo-epitopes are defined as positive (+1) and negative (−1) samples, respectively. These samples are included in the appendix.

Structural descriptors and characterization

Three properties of AAs, including residue volume [22], electron-ion interaction potential values [23], and hydrophobicity [24], are selected to characterize the peptide. The values of the parameters are listed in Table 1, and their correlation matrix is shown in Table 2. Three descriptors are needed for each AA to characterize the peptide structures, and 3 × N parameters are needed to characterize the peptide with N AAs. If a cleavage window contains N AAs, the window can be characterized by a set of 3 × N descriptors. For example, a window constituting 4 AAs should be represented by 12 descriptors.

Table 1 Three descriptor scales for natural amino acids
Table 2 Correlation matrix of parameters

Neural network algorithm

Artificial neural networks have been developed as generalizations of mathematical models of biological nervous systems [25, 26]. The present study uses a BP-NN model with three layers. The nodes of the input layer are equal to the descriptors of AAs, and the node of the output layer is the ideal value, i.e., a negative or positive sample represented by −1 or +1, respectively. The nodes of the hidden layer are calculated by the following formula:

$$ {n_l}=\sqrt{{{n_i}+{n_o}}}+10, $$

where n l, n i, and n o are the nodes of the hidden, input, and output layers, respectively. The weights connected to the input, hidden, and output layers are randomly generated and adjusted using standard BP (a gradient descent type) algorithm with the circular samples presented. Finally, the process is terminated when the minimal error or the maximal iteration is reached. For example, a cleavage window sequence containing 4 AAs should be represented by 12 descriptors, which should all be placed in the input layer. Thus, the BP-NN for the 4 AA cleavage window must have 12 neurons in the input layer.

Optimization of cleavage windows

The cleavage window is formed with several AAs at the cleavage site and defined as “P m …P2P1| P1′P2′…P n ′”, where “|” is the cleavage site and m and n are 1 to 6, respectively. For example, the cleavage window has 5 AAs when m = 2 and n = 3, which means that the previous 2 and the next 3 AAs belong to the cleavage window. For each cleavage window, the network is trained for 100 times with all samples, and the statistical results are the average values. The shortest cleavage window, also called the optimized window, is formed after training for every window.

Evaluation of network performance

The network performance is evaluated using the correlation coefficient (CC) [27] as follows:

$$ CC=\frac{{TP\times TN-FN\times FP}}{{\sqrt{{\left( {TN+FN} \right)\left( {FN+TP} \right)\left( {TP+FP} \right)\left( {FP+TN} \right)}}}}, $$

where TP is the number of true positives (experimentally verified cleavage sites also the predicted cleavage sites), TN is the number of true negatives (experimentally verified non-cleavage sites also the predicted non-cleavage sites), FP is the number of false positives (experimentally verified non-cleavage sites predicted as cleavage sites), and FN is the number of false negatives (experimentally verified cleavage sites predicted as non-cleavage sites). The additional performance measurements used are as follows:

$$ \begin{array}{*{20}c} {SE=\frac{TP }{TP+FN}\times 100} & {SP=\frac{TN }{TN+FP}\times 100} & {AC=\frac{TP+TN }{TP+FP+TN+FN}\times 100} \\ \end{array}. $$

Results and discussion

The proteasomal cleavage window is important for the recognition of cleavage sites. Several studies have selected different windows [18, 20], and the maximum length of window is composed of 12 AAs in the present study. To determine the optimized window size, BP-NN models are built with different cleavage windows. The procedure is similar to window optimization and the final results are the average values of 50 times of training and testing.

To identify the motif recognized by proteasomes, 48 windows are constructed according to the combination of AAs before and/or after at N- and C-termini cleavage sites. The sequence for every window is characterized by the parameters and modeled by BP-NN. The BP-NN predictive model is then evaluated and validated with the following four statistical parameters: sensitivity (SN), specificity (SP), accuracy (AC), and CC. The AC and CC are comprehensive parameters, including SN and SP information.

A total of 96 sets of evaluation results are obtained through calculation of the BP-NN models. These evaluation results are divided into two groups, namely the N- and C-termini. As shown in Table 3, the windows for the N-terminus containing 0 or 1 AA at both sides of the cleavage site have unsatisfactory reliability, whereas the other windows have high predictive ability (AC of about 90 % and CC above 0.80). As shown in Table 4, the windows containing no AA before the cleavage site have low predictive ability, whereas the windows containing one or more AAs before the cleavage site have high reliability (AC of about 95 % and CC of about 0.90). In general, the predictive ability of the models based on the C-terminus is superior to that of the models based on the N-terminus.

Table 3 Statistical results of N-terminus
Table 4 Statistical results of C-terminus

To compare the performance of the predictive models clearly, the AC and CC statistical parameters are shown with histograms for the N-terminus (Fig. 1) and C-terminus (Fig. 2). As shown in Fig. 1, the performance of the model based on the N-terminus is not high, but the AAs after the cleavage site are less than 3, which indicate that the proteasome can at least recognize the cleavage site through the 3 AAs. Figure 1 shows the optimized window consisting of 3 AAs expressed as “|P1′P2′P3′”. The windows show the 3 AAs motif at the N-terminus of epitope. Similarly, Fig. 2 shows the optimized window size consisting of 2 AAs expressed as “P1|P1′”. The BP-NN models show high performance based on one or more AAs before the cleavage site. The windows expressed as “P1|” can be regarded as the most suitable motif for proteasome recognition. The optimized windows show that the epitope can be derived from antigen by proteasomes, which are recognized through AAs at the N-terminus and the C-terminus with sizes of 3 and 1, respectively. The motif located in the terminus is similar to the anchor residues of epitope and recognized by proteasomes instead of MHC.

Fig. 1
figure 1

Predicted results of N-terminus, a accuracy; b coefficient of correlation

Fig. 2
figure 2

Predictive results of C-terminus, a accuracy; b coefficient of correlation

Knowledge of the cleavage patterns is essential for understanding how the proteasome contributes to the generation of peptides presented by MHC class I molecules on the surface of vertebrate cells [27]. Compared with eukaryotic proteasomes, immunosubunits replace the constitutive proteolytically active subunits and significantly improve proteasome-dependent antigen processing. Groll et al. [28] reported that the conformation of MHC-I bound peptides is similar to that of the propeptides of proteasome active subunits, which suggests the co-evolution of MHC molecules and proteasomal ligand-binding sites. Conserved AAs should be present near the cleavage for recognition of proteasomes, which are analogous to the anchor residues of epitopes bound to MHC. In this study, the conserved analysis is performed by weblogo [29], and the sequence logo for the N- and C-termini is shown in Fig. 3. The motif for the N-terminus contains the conservative site at position 2 and the corresponding residues are Leu, Ile, and Met. Similarly, the motif for the C-terminus contains one residue and the conservative AAs are Val, Leu, and Ile. More than 75 % of peptides are nonamers and the conservative sites are consistent with nonamer epitopes, which are located at positions 2 and 9.

Fig. 3
figure 3

Sequence logo of motif for N- and C-terminus

In eukaryotic cells, the ubiquitin-proteasome system plays a critical role in the cleavage of antigen protein in epitope processing and presentation in the MHC-I pathway. Ubiquitination occurs by the covalent attachment of multiple ubiquitin molecules onto the substrate, which is then transported and degraded in the 26S proteasome. Consequently, the weak interactions between the AAs closing the cleavage site and the proteasome depend on the AA properties, such as stericity, electrostatic properties, and hydrophobicity. The proteasome ligand-binding site is similar to MHC; thus, the interaction between the optimized motif and proteasome depends on that between MHC and epitope. The properties, residue volume, electron-ion interaction potential values, and hydrophobicity can reflect the interaction between motif and proteasome, and the BP-NN models show high reliability and predictive ability.

Numerous predictive models have been constructed with the machine learning algorithm [20, 30, 31], including the support vector machine and ANN. Table 5 lists the statistical parameters of the predictive model built for the C-terminus from the present and previous studies. The present predictive model is superior to those in literature. Although the datasets, characterization of sequence, and modeling method differ among the listed models in Table 5, the BP-NN model has high predictive ability because of the properties representing the profile of the interaction between proteasome and motif. Therefore, the modeling based on the properties of AAs can provide a good scheme predicting proteasomal cleavage sites.

Table 5 Comparison with literature (C-terminal)

Conclusions

Sequences at both sides of a cleavage site are selected as windows for proteasome recognition. The optimized windows for the N- and C-termini are determined by BP-NN, wherein the optimization standard is the window with higher reliability and the shortest sequence. The prediction model from the BP-NN has high reliability that is also superior to those of previous models. The predicted CCs of BP-NN are 0.862 for N-terminus and 0.922 for C-terminus for the test dataset. The properties used in this work can build a good predictive model but cannot reflect the comprehensive structural features of cleavage sites. Therefore, more suitable parameters for characterizing cleavage sites are favorable for proteasome recognition at corresponding positions. The predicted BP-NN model can well calculate the cleavage site of the determinate and predict the novel cleavage site of the protein sequence.