1 Introduction

Vacuoles represent the cellular component of any living cell that varies in size and shape (Zhang et al. 2014a). Plant vacuole is represented by a single large structure that is involved in diverse functions such as plant growth and development, maintaining cellular homeostasis, cellular function to retaining turgor and nutrients, ions and secondary metabolites accretion (Pereira et al. 2014). Inside the seeds, the vacuole acts as the storage site of proteins and carbohydrates, various kinds of flavonoids for flower and fruit color, and is also associated with cellular response to the environment (Grotewold 2006; Marty 1999; Park et al. 2004). Vacuole proteins function as a transporter to transport diverse class of ions, sugars, amino acids, and other molecules (Zhang et al. 2015). Lytic vacuole plays significant role in the degradation of cellular waste, defence, and program cell death (Ibl and Stoger 2014; Shimada et al. 2018).

With the availability of the whole genome or proteome of any plant, the ultimate goal is their fast and accurate functional assignment which depends upon the subcellular location of the proteins. The experimental method of subcellular localization is a very tedious and time-consuming process, therefore the focus is on the development of automatic and fast computational tool for accurate prediction. In the past, different multi-class algorithms have been developed for subcellular localization of proteins such as BaCelLo (Pierleoni et al. 2006), MultiLoc2 (Blum et al. 2009), Plant-mPLoc (Chou and Shen 2010), PProwler 1.2 (Hawkins and Bodén 2006), Predotar v1.03 (Boden and Hawkins 2005), TargetP 1.1 (Emanuelsson et al. 2000), WoLF PSORT (Horton et al. 2007), YLoc (Briesemeister et al. 2010), pLoc-mPlant (Cheng et al. 2017), and Plant-mSubP (Sahu et al. 2019). However, none of them is specifically designed for the plant vacuole proteins, and thus perform very poorly in predicting the plant vacuole proteins. This emphasizes the need for an accurate computational model specifically trained for the plant vacuole proteins. Therefore, in this study, we developed a SVM-based prediction model for classification of vacuole proteins which is much better than previously developed software.

2 Materials and methods

2.1 Dataset preparation

The dataset used in this study was derived from publically available database UniprotKB/SwissProt (release 3 July 2019). We searched the database with the query: (taxonomy: viridiplantae, location: SL-0272, length: >50, and reviewed: Yes), removed sequences with non-standard amino acids and identified a total of 579 plant vacuole proteins. To develop a supervised machine-learning-based model, the requirement of negative data is must. Thus, in a similar manner, we created the negative dataset with the query (taxonomy: viridiplantae, NOT location: SL-0272, length: >50 and reviewed: Yes) and identified 36,189 non-vacuole proteins from plants. To develop a non-redundant dataset of both vacuole and non-vacuole proteins, we used the CD-HIT program at 40% and 60% sequence identity threshold (Li and Godzik 2006). In case of vacuole proteins, this results in a total of 200 and 274 sequences at 40% and 60% identity cut-off. Similarly, CD-HIT results in 9485 proteins sequences at 40% identity cut-off from the non-vacuole protein dataset. To create a balanced dataset, we randomly selected 200 proteins from negative dataset and used them for developing the prediction model (Wei et al. 2018a). Thus, our final training dataset had 200 vacuole and 200 non-vacuole plant proteins. Hereafter, we call them positive and negative datasets.

2.2 Independent dataset

Evaluation of the performance of any machine-learning-based model requires an independent dataset. Therefore, to create an independent dataset, we considered the difference of proteins at 60% (274) and 40% (200) cut-off and used it as an independent positive dataset. An equal number from the negative dataset which was not present in negative training dataset was used for creating negative independent dataset. Thus, our independent dataset consisted of 74 vacuole and 74 non-vacuole protein sequences.

2.3 Blind dataset

A blind dataset of plant vacuole proteins was created from the cropPAL database (Hooper et al. 2016). This database had experimentally determined (FP or MS/MS experiments) and predicted (more than 10 software) subcellular location of proteins from the 12 different plants. We extracted 228 vacuole proteins which were experimentally verified by either FP or MS/MS experiment. Further, one protein of length shorter than 50 amino acids was removed, thus making our final dataset of 227 vacuole proteins.

2.4 Feature calculation

To develop a machine-learning-based predictive model, a fixed-length vector is an essential requirement. In the past, different types of protein features, such as composition-based, physicochemical properties, and position-specific sequence matrix (PSSM), were used to develop robust prediction models. These features can be easily calculated by simple mathematical expressions. In this study, we used 7 types of composition-based features and 21 types of PSSM-based features to develop an efficient and reliable prediction model (supplementary table 1).

Table 1 Performance of composition-based models on training and independent datasets
  1. 1.

    Composition-based features

    1. a.

      Amino acid composition (AAC): This is most widely used in developing protein sequence-based prediction models. In this case, any protein sequence is represented by 20 amino acids of fixed length. Percentage of each amino acid residue in a protein sequence is calculated as:

      $$ {\text{Percentage of amino acid}}\left( i \right) = \frac{{{\text{Total\;number\;of \;amino\;acid}}\left( {\text{i}} \right)}}{{{\text{Total\;number\;of\;amino\;acids\;in\;protein}}}} \times 100 $$
      (1)

      where i represent one of the 20 standard amino acids.

    2. b.

      Dipeptide composition (DPC): In this method, the composition of two consecutive amino acids of a sequence is calculated. This has a total vector of size 400 (20 × 20) having partial information of the order of amino acids. It can be calculated using the following formula:

      $$ {\text{Dipeptide composition}} \left( i \right) = \frac{{{\text{Total\;number\;of\;Dipeptide }}\left( i \right) }}{{{\text{Total\;number\;of\;all\;possible\;dipeptides }}}} \times 100 $$
      (2)

      where i represent one out of 400 dipeptides.

    3. c.

      Tripeptide composition (TPC): Tripeptide composition represents the percentage composition of each of the 8000 possible tripeptide form by 20 amino acids and calculated as:

      $$ {\text{Tripeptide composition}} \left( i \right) = \frac{{{\text{Total\;number\;of \;Tripeptide }}\left( i \right) }}{{{\text{Total\;number\;of\;all\;possible\;Tripeptides }}}} \times 100 $$
      (3)

      where i represent one out of 8000 tripeptides.

    4. d.

      C-terminal Composition: Previously, it was observed that the C-terminal protein region might have any significant roles in biological activity, so this portion could be used for separate sequence composition calculations. In this study, we extracted the 5 and 10 amino acid residues from the C-terminal region of the protein and used for calculation of amino acid composition.

    5. e.

      Split and rest amino acid composition: Previous studies reported that some important sequence motifs might be present in a specific protein region and help to improve prediction accuracy (Srinivasan et al. 2013). In the case, the whole protein is split into three equal parts and composition of each part is calculated separately. However, in case of rest composition method, amino acid composition of protein is calculated after removing the specified N- and C-terminal residues. In our case, we removed 10 residues from each of N- and C-terminal of protein and calculate the composition of rest region of protein.

  2. 2.

    PSSM-based features

    PSSM profile generated using the PSI-blast search is based on evolutionary information used to identify the remote homologs. Previously, it has been used in developing various machine-learning-based models for the sequence annotation (McGuffin et al. 2000; Saha et al. 2006). POSSUM server was used to produce a PSSM profile based on the uniref50 database searched for three iterations at e-value 0.001 and 21 different types of PSSM-based features were calculated and used for modelling (Wang et al. 2017). POSSUM divided these features set into four major groups, i.e. generated by transformation of rows, columns, both row and column, combination of all these features (supplementary table 1).

2.5 Support Vector Machine (SVM)

SVM is a powerful machine-learning software that has been extensively used in various bioinformatics analyses (FY et al. 2019; Boopathi et al. 2019; Manavalan et al. 2018a, 2018b; Manavalan and Lee 2017; Wei et al. 2018b). This is a very reliable technique for biological sequence analysis due to its capability of handling noise and high-dimensional feature space (Zavaljevski et al. 2002). SVM allows the users to tune various parameters available for different kernels such as linear, polynomial, sigmoid, or radial basis function (RBF) (Ramana and Gupta 2009; Ramana 2015; Mishra et al. 2014). In this study, we used freely available software SVMlight (http://svmlight.joachims.org) to train SVM classifiers and develop prediction models.

2.6 Five-fold cross-validation

A five-fold cross-validation technique was used to examine the quality of develop models. In the case, the complete dataset was divided into five equal subsets of which four subsets were combined and used for as a training set and fifth subset was used as test set. The complete process was repeated five times so that each subset was used as a test set at least one time.

2.7 Performance evaluations

To evaluate the quality of developed models, we used confusion matrix metrics with sensitivity, specificity, accuracy, and Matthew correlation coefficient (MCC) as described previously (Dao et al. 2019). Area under receiver opening curve (ROC) was also considered to measure the overall prediction performance.

$$ Sensitivity = \frac{TP}{(TP + FN)} \times 100 $$
(4)
$$ Specificity = \frac{TN}{(TN + FP)} \times 100 $$
(5)
$$ Accuracy = \frac{(TP + TN)}{(TP + FP + TN + FN)} \times 100 $$
(6)
$$ MCC = \frac{(TP \times TN) - (FP \times FN)}{{\sqrt {(TP + FN)(TP + FP)(TN + FP)(TN + FN)} }} $$
(7)

Here TP, FP, TN, and FN are the true positives, false positives, true negatives, and false negatives, respectively.

3 Results

3.1 Composition-based models

The present study reported the various composition-based models developed using amino acid, dipeptide, C-terminal, and rest and split amino acid composition for the annotation of plant vacuole proteins. Firstly, the model developed on AAC resulted in ~79% accuracy with an MCC value 0.58 on the independent dataset (table 1). Similarly, the dipeptide composition-based model showed 82.43% sensitivity and 78.38% specificity with an accuracy of 80.41% on the independent dataset (table 1). However, C-terminal-based C5 and C10 model performed poorly with maximum accuracy ~49% and 55% on independent datasets respectively. Furthermore, the model developed using split and rest amino acid composition achieve 70.95% and 75.68% accuracy on independent datasets. As observed from table 1, DPC-based model performs the best compared with all the other composition-based models.

3.2 PSSM-based models

Based on the PSSM profile, we developed 21 different models for evaluating the significance and performance of each of the PSSM-based features. Among the various row-transform-based features, AAC-PSSM showed 93.24% sensitivity, 78.38% specificity, and 85.81% accuracy with MCC value 0.72 on an independent dataset. From table 2, we observed that F-PSSM and Smooth-PSSM performed poorly with maximum MCC values 0.30 and 0.46 on independent dataset respectively. Conversely, S-PSSM and RPM-PSSM performed well among the row-transformed features with RPM-PSSM more balanced compared to S-PSSM in terms of sensitivity and specificity values (table 2). Similarly, among the column-transformed features, K-PSSM and TRI-PSSM performed the best with sensitivity 90.54%/93.24%, specificity 82.43%/82.43% with accuracy value 86.49%/87.84% respectively. However, DPC-PSSM and TPC-PSSM were not as good as compared to the others. We observed that the model developed on mixed features performed better compared to individual transformed features (table 2). As evident from table 2, K-PSSM was the best performing balanced model in terms of accuracy and MCC value among all the PSSM-based models.

Table 2 Performance of the PSSM-based models on training and independent datasets

3.3 Validation on blind dataset

The blind dataset of vacuole proteins constructed in this study was used to evaluate the performance of our best models. We considered our two best model: DPC model and K-PSSM model, and compared the performance with previously developed models. The prediction results of previously developed models showed very poor performance with accuracy varies from 1.32% to 41.85% (figure 1, table 3). BaCello has incorporated the amino acid composition, sequence profile, and signal information to develop the SVM-based model. However, MultiLoc2 is based on a six-layered prediction that uses gene ontology and phylogenetic information along with sequence composition and motif analysis. Similarly, Plant-mPLoc is specially designed for plant proteins classification with protein domain, gene ontology, and evolutionary information along with sequence composition. TargetP is based on N-, C-terminal signal sequences, while WoLF PSORT, in addition to signal, is based on composition and functional motifs. Recently, Plant-mSubP for classification of single as well as dual-label plant protein has been developed which shows the best performance on a hybrid model (PseAAC-NCC-DIPEP) composed of pseudo-amino acid composition, N-terminal signal, and dipeptide composition. Furthermore, most of the good performing models have predicted multiple locations rather than one location, as evident from the cropPAL database. This clearly shows that these models have not captured sufficient features for plant vacuole proteins. However, our DPC and K-PSSM model has correctly classified 136 and 143 proteins with accuracy 59.91%, and 62.99% respectively (figure 1). The high accuracy clearly indicates the applicability of our models.

Figure 1
figure 1

Performance of different software on the blind dataset.

Table 3 Benchmarking of different software on blind datasets

3.4 Software

Based on our study, we have developed GUI-based software ‘VacPred’ that is compatible with the different operating systems (figure 2 ). We have incorporated our two best algorithms – DPC model and K-PSSM model – for the prediction of plant vacuole proteins. To execute the DPC-based prediction, users only need a protein sequence fasta file without any limitation on file size or number of sequences. Our K-PSSM-based model is based on the features calculated using POSSUM software; thus, users need to first calculate the K-PSSM features using POSSUM software and the output file of this software is directly given as input for the prediction of plant vacuole proteins. VacPred is developed using nodejs-based framework from electron software that used javascript and PERL in the backend. This is accessible at www.deepaklab.com/vacpred.

Figure 2
figure 2

The homepage and prediction result of the VacPred software.

4 Discussion

Next-generation sequencing has completed various genome or transcriptome projects and many more are in progress. Genome annotation including subcellular localization is the most crucial and important steps of any genome sequencing projects that shed the light on protein structure and functions. Among the various cellular organelles, plant vacuole is one of the most important components of plant cells that perform diverse functions (Zhang et al. 2014a, b; Pereira et al. 2014). The experimental identification of plant vacuole protein is a time-consuming and costly affair that requires sophisticated instruments and manpower.

To overcome this, machine-learning-based computational methods evolved as highly efficient and less expensive way of sequence annotation. Furthermore, our analysis confirmed that all the previously developed models were not able to predict plant vacuole proteins with high accuracy. We applied machine-learning-based techniques and developed more than 30 different types of models. In the end, we had selected two best performing models including one dipeptide composition-based and one PSSM-based model. Both models showed similar performance on a blind dataset with ~60% and ~63% accuracy on DPC and KPSSM-based model. Based on this analysis, we developed a standalone GUI software ‘VacPred’ that will be useful for large-scale annotation projects for the plant vacuole protein prediction.