Introduction

Glutathione S-Transferases (GSTs) are known as detoxification enzymes, which detoxify reduced form of glutathione toxenobiotic substrates (Dasari et al. 2018). However, these proteins have other cellular defense functions; such as a metabolic roles in the biosynthesis of steroids, prostaglandins (Landi 2000), aromatic amino acid degradation (Kilty et al. 1998), apoptosis regulation, and stress signaling (Tew and Ronai 1999).

GSTs are multifunctional, heterogeneous superfamily of proteins that catalyze the conjugation of GSH by a sulfhydryl group (–SH) to electrophilic centers and can catalyze different reactions with their distinct substrates. GST are divided into eight subgroups based on their amino acid sequences, isoelectric points and immunological properties; including alpha, mu, phi, theta, kappa, zeta, sigma, and omega. Most of GSTs contain homodimers with 23–30 kDa subunits. Each GST has two typical domains; one conserved N-terminal, a glutathione-specific binding domain (G-site), and a variable C-terminal domain which binds to the hydrophobic substrate (C-site) (Schultz and Sylvester 2001). GSTs are available in a variety of livings organisms, from bacteria to homo sapiens (Di Matteo et al. 2019). Plant GSTs are six classes of a multifunctional heterogeneous superfamily that perform a wide range of non-enzymatic and pivotal catalytic functions (Sylvestre-Gonon et al. 2019). The earliest evidence for GSTs in mammalians came from the discovery of GSTs in rat liver, early 1960s. They are classified into three different groups; including membrane-bound microsomal, mitochondrial and cytoplasmic. Mammalian GSTs have various activities; from detoxification of reactive electrophiles to cell signaling (Landi 2000).

Fungal GSTs have been discovered in a number of fungal species. Some GSTs could play critical roles in fungal necrotrophs protection against reactive oxygen species and plant-derived toxic metabolites that aggregate on the host-pathogen interface during infection. In addition, GSTs could bind wood-derived molecules and are able to participate in the intracellular transport and further omission of these compounds, which could be toxic for the cells (Kato et al. 2004). However, the biological role and importance of GTSs are not yet completely understood.

Since the implementation of the human genome project in 2000, and benefited from the next-generation sequencing technology, a huge number of bioinformatics data have been accumulated, which is far beyond the capacity of classical molecular analysis (Roberts et al. 2006). Bioinformatics technology can provide a practical and time-saving approach to answer critical research questions and find a correlation between data. Formulating the sequence of biological samples; such as protein or RNA, effectively with a separate model that can considerably keep its sequence order information, is one of the most difficult challenges in computational biology.

Nowadays, different aspects of proteins are predicted by various approaches. Some of these methods are based on amino acid composition (Lee et al. 2006); sequence (Xia et al. 2010), and template (Chen and Kihara 2011; Haghighi et al. 2019). Pseudo amino acid composition is a typical discrete form that has been widely used for protein function prediction, which presents information based on the protein sequence. The pseudo amino acid (PseAA) composition can provide a lot more information about a protein sequence compared with the conventional amino acid (AA) composition (Chou 2019; Chou et al. 2019; Tian et al. 2019; Yu et al. 2017).

Machine learning is a method for data analysis that automates analytical model making (Chen et al. 2020). This branch of technology allows the systems to learn from data, recognize patterns and make decisions with minimal human intervention (Galetsi et al. 2020). Machine learning algorithms enable the system to learn from data and make a prediction based on those information (Chakrabarti et al. 2006). Support-vector machine (SVM) is a novel kernel-based machine learning which is useful for biological researches (Schölkopf et al. 2002). Random Forest (RF) is another learning method for classification which operates by building a multitude of decision tree at training time and outputting the classification or prediction (Kam 1995). Covariance Discrimination (CD) and optimized evidence-theoretic K-nearest neighbor (OET-KNN) are two other strong learning methods which are used in the classification of biological data (Shen and Chou 2005). In the present investigation, we aimed to predict and classify GSTs by the concept of Chou’s PseAAC. SVM, RF, CD, and OET-KNN that were used as classifiers algorithms to predict GSTs. It has been proposed that five-steps rule of Chou is a notable achievement in the field of molecular biology (Chou 2020). This method has been used in many different studies such as predicting remote homology proteins through merging grey incidence analysis and domain similarity investigation (Lin et al. 2020).

In order to make an appropriate biological prediction, the “Chou’s 5-steps rule” (Chou 2011) could be used which includes (to see the example, please refer to reference (Chou 2019) of this study); first, selection or building up an appropriate benchmark dataset for testing and training of the prediction. The second step is providing the samples with efficient formulation which could mirror their inherent interaction with the objective of prediction. The third level of Chou’s five step is presenting a strong algorithm for conducting the prediction process, and forth one is appropriate performance of the cross-validation assays to measure the prediction’s accuracy. The final step in the mentioned method is establishing an easy to use web-based tool for the predictor with public access. Articles written based on this usually provide an easily understood logic of development, high transparency in work, the results could be easily repeated by the other researchers, they could prompt the rest of approaches for analyzing the sequences, and provide a comfortable approach for a large group of experimental researchers.

Methodology

Datasets Preparation for Analyzing GSTs Pseudo Amino Acids Composition

Three primary datasets consisting 671 fungal protein sequences, 586 animal protein sequences and 703 plant protein sequences of GSTs, as the positive dataset and 638, 829 and 681 sequences of non-GSTs proteins, respectively as the negative fungal, animal and plant datasets were fetched. These sequences were obtained from the National Center for Biotechnology Information (NCBI) database (http://www.ncbi.nlm.nih.gov/protein). To increase the accuracy of prediction, similar and fragmental sequences were not chosen. To avoid intolerance, the sequences with more than 90% similarity were excluded from our datasets, using CD-Hit tool (Li and Godzik 2006). Final positive fungal, animal and plant positive datasets were including 584, 513 and 541 sequences, respectively. Also, negative dataset sequences for fungal, animal and plant non-GSTs proteins were decreased to 523, 576 and 501, respectively.

Generating Chou’s PseAAC for GSTs Protein

In order to operate datasets for data mining approaches, proteins were presented numerically using the PseAAC web server (Shen and Chou 2008). Kou-Chen Chou introduced the concept of PseAAC in 2001 which represents the protein attributes according to the sequence-order information, enabling to provide a comprehensive combination with other properties to perform a reliable classification (Chou and Cai 2003). First 20 numbers reflect the amino acid composition and additional numbers demonstrate sequence-order information. The PseAAC web server has the following six amino acid characters: hydrophilicity, hydrophobicity, mass, pK2, pK1 and pI. Total 63 combinations of six characters were used. For the present study, type 1 PseAAC, which is also called the parallel correlation type was used. In order to provide the most reliable results, the weight factor was set on different figures from 0.1 to 1.0 and ƛ parameter was set on the valid range of numbers from 1 to 10.

Classification of GSTs by Machine Learning Algorithms

Nowadays, a practical way to explore patterns in biological data is to employ machine learning algorithms. Machine learning is a sub-branch of computer science analyzing assessment of algorithms that can be used for prediction and classification of data based on the models learned from sample input (Li et al. 2018). In machine learning algorithms, a training data is needed in order to predict without being programmed to perform the task. There are two main categories in the field of machine learning which are supervised and unsupervised learning. Supervised learning utilizes a dataset whose features are characterized. In unsupervised learning, no labels are assigned to the algorithm and the goal of this task is to explore existing data and identify the similarities. In this study, the data were assigned by known attributes; therefore, supervised learning was applied (Snoek et al. 2012).

There are many algorithms which have satisfying results for biological data. In the present study, Rapid Miner software (version 7.2) was used for validation and classification of data (Snoek et al. 2012). The following algorithms incorporated in the software were selected for GST classification:

Support Vector Machine (SVM) Model

SVMs are machine learning algorithms that analyze data used for regression and classification analysis. The goal of SVM algorithm is to find the most appropriate classification function to differentiate between members of two classes in a training dataset (Suykens and Vandewalle 1999).

Random Forest (RF) Model

RFs were initially introduced by Breiman in 2001 (Breiman 2001). This algorithm adds extra randomness layers to the bagging. RF is an ensemble learning method for regression and classification. RFs operate tasks that construct a multitude of decision trees at training time and output classification or prediction of the individual trees. In comparison with individual classification trees, RFs can highly improve the accuracy of the predictions (Pal 2005).

Covariance Discrimination (CD) Algorithm

Covariance matrix uses a standardized scale of − 1 to + 1 to achieve the strength of a relationship between two variables. Covariance is a measure of how the changes in one variable are connected with changes in the second one. Specifically, covariance calculates the degree to which two variables are linearly associated (Zou et al. 2020).

OET-KNN Algorithm

The optimized evidence-theoretic K-nearest neighbor (OET-KNN) which is a non-parametric learning algorithm for data classification (Ghosh et al. 2020), was used to separate the data related to GST proteins into different classes, in order to predict the classification and regression of a new sample dataset.

Results

In this study, parameters of sensitivity (SEN), specificity (SPC), accuracy (ACC) and Matthew’s correlation coefficient (MCC) were used for the evaluation of classifiers’ performances, computed based on the following equations:

$${\text{ACC}}\,{\text{ = }}\,{\text{(TP}}\,{\text{ + }}\,{\text{TN) / (TP}}\,{\text{ + }}\,{\text{TN}}\,{\text{ + }}\,{\text{FP}}\,{\text{ + }}\,{\text{FN)}}$$
(1)
$${\text{SPC}}\, = \,{\text{TN}}/\left( {{\text{TN}}\, + \,{\text{FP}}} \right)$$
(2)
$${\text{SEN}}\, = \,{\text{TP}}/\left( {{\text{TP}}\, + \,{\text{FN}}} \right)$$
(3)
$${\text{MCC}}\, = \,[\left( {{\text{TP}}*{\text{TN}}} \right)\,{-}\,\left( {{\text{FP}}*{\text{FN}}} \right)/\sqrt {\left( {{\text{TP}}\, + \,{\text{FP}}} \right)\left( {{\text{TP}}\, + \,{\text{FN}}} \right)\left( {{\text{TN}}\, + \,{\text{FP}}} \right)\left( {{\text{TN}}\, + \,{\text{FN}}} \right)}$$
(4)

TP, TN, FP, and FN are the numbers of true positives (GSTs predicted as GSTs), true negatives (non- GSTs predicted as non- GSTs), false positives (non- GSTs predicted as GSTs) and false negatives (GSTs predicted as non- GSTs), respectively. The overall performance of the method was obtained by averaging the above parameters for all classifiers (SEN, SPEC, ACC, and MCC). Accuracy (ACC) is the number of those that have been classified correctly. Specificity (SPC), known as true negative rate, calculates the ability to predict negative data. Therefore, results with high SPC are reliable. High SEN of a classifier indicates negative predicted results that are reliable. MCC or Matthew’s correlation coefficient is used for the evaluation of binary classification and is a number between − 1 and + 1, where + 1 indicates an ideal classification, 0 represents random classification and − 1 means complete dissent between prediction and observation. When values of these four parameters are more than 70–80%, the prediction is believed to be highly accurate. Performance values of classifiers were optimized and best results for each classifier were reported. Along with them, ƛ and W parameters of machine learning were taken into consideration for classification’s efficacy of each dataset.

Every algorithm was trained with the dataset individually. This validation technique divides the dataset into training and testing sets. A variety of rounds of validation are applied and the results are averaged. Using cross-validation, a more accurate evaluation of the performance of model prediction was achieved. Algorithms with the best results were selected for further validation with the independent test dataset.

Optimized results provided by various machine learning algorithm for animal, plant and fungal sources are provided in Tables 1, 2 and 3. The highest values for each parameter is indicated in bold. All four machine learning algorithms had a good total accuracy higher than 0.62, certifying the provided results. According to the results provided at this section, RFs demonstrated the best prediction for animal GSTs with 0.9339 accuracy and SVM showed the best results for fungal and plant GSTs with 0.8982 and 0.9655 accuracies, respectively.

Table 1 The performance of various machine learning algorithms for animal GSTs classification
Table 2 The performance of various machine learning algorithms for plant GSTs prediction
Table 3 The performance of various machine learning algorithms for fungal GSTs prediction

Machine Learning Based Classification for Animal GSTs

Random forest algorithm presented the highest values for the classification of animal cell derived GST enzyme including; ACC, MCC, and SEN. The highest SPC was presented by SVM algorithm, along with second-high ACC, MCC, and SEN. The ƛ was variable among different models, with a value of eight for SVM, and the lowest value of one for the CD algorithm. From the perspective of W parameter, the CD algorithm had the highest value, equal to 0.5, and all the others showed values equal to 0.1.

Among the used algorithms, CD showed the lowest values for ACC (0.79), MCC (0.60), and SPC (0.74). The OET-KNN algorithm demonstrated the lowest SEN of all the algorithms with a value of 0.81. Table 1 provides the detail of all the results provided by each algorithm, which could further interest the readers.

Machine Learning Based Classification for Plant GSTs

For the classification of the GST enzyme of the plants, SVM presented the highest values for ACC (0.96), MCC (0.93), and SPC (0.97). After SVM in the second rank, the Random forest showed the highest ACC, and MCC values. The highest sensitivity was for the CD model (0.9778) and the OET-KNN indicated the second-high specificity for plant GST classification.

The W value was the same for all the algorithms, equal to 0.1. From the ƛ parameter point of view, the Random forest had the highest value (eight) and the CD showed the lowest one with a value equal to one. The lowest ACC, MCC, and SPC value for classification of plant GST was presented by CD algorithm, and similar to the animal GST classification, the KNN showed the lowest SEN value. The detailed results ofant derived GSTs are provided in Table 2.

Machine Learning Based Classification for Fungal GST

The SVM model provided the highest values in accuracy (0.89), specificity (0.92), and Matthew’s correlation coefficient (0.79) for classification of fungal GST enzyme, but the highest sensitivity was provided by the CD model (0.99). The second-high ACC (0.88), and MCC (0.76) and SEN (0.92) values were provided by the RF algorithm. Similar to plant classification, the second-high SPC (0.91) was provided by the OET-KNN model. The w parameter was the same among all the chosen models, and it was equal to 0.1. The ƛ parameter had different values in each model, having the highest (nine) for the SVM, and the lowest one was presented for OET-KEN model (two). Among the used algorithms, the lowest value for ACC (0.62), MCC (0.34), and SPC (0.21) belonged to CD model, and the lowest sensitivity was demonstrated by OET-KNN algorithm (0.64). The detailed results for the performed prediction of the fungal GST are provided in Table 3.

Discussion

Application of data mining methods for in silico studies has been evaluated by earlier investigations and has proven to possess a high potential for resolving the issues of old fashioned methods for categorizing and classification of biological data such as protein function domain inspection, function motif investigation, and protein function inference (Raza 2012). The importance of machine learning algorithms for data classification is beyond any question and many different area of science, including molecular biology have been using these methods for classification of the large datasets for various molecules such as enzymes (Gupta et al. 2019).

GSTs are detoxification enzyme with a wide range of known activities, available in various of living organisms such as fungal, animal and bacteria. Because of the importance of GSTs, it is of critical importance to predict and classify its different subtypes from different organisms; therefore, information about the efficacy of different algorithms of machine learning will be useful for further investigation about this enzyme and might be used for designing a server for such data classification. In the previous reports, machine learning based methods have been used for enzymes classification and clustering (Yadav and Tiwari 2015). At the current investigation, we applied machine learning models including OET-KNN, Random Forests, SVM, and Covariance Discrimination for the PseAAC data analyzing and interpretation.

Due to the large amount of data on protein sequences, investigating the function of these sequences is necessary. Most of the current methods have been developed to predict the function of proteins are based on alignment and similarity of the sequences. These methods are error-prone in the absence of significant similarity and variation in the size of queried and the target protein and most of the existing machine-learning algorithms can only handle vectors as elaborated in a comprehensive review (Chou 2015). One of the most reliable approaches for protein classification is Chou’s PseAAC (Du et al. 2014). To prevent the complete loss of sequence-pattern information for proteins, the pseudo amino acid composition (Chou 2001) or PseAAC (Chou 2005) has been developed. Upon introduction of Chou’s PseAAC, this method has been used in various fields of computational proteomics (please refer provided examples in reports (Behbahani et al. 2019; Chou 2017; Dehzangi et al. 2015)).

Due to a large application demand, four strong and free access tools as BPseAAC^, BPseAAC-Builder^, Bpropy^, and BPseAAC-General^, are provided by the developer (Cao et al. 2013; Du et al. 2014; Du et al. 2012; Shen and Chou 2008). The initial three programs are for creating various modes of Chou’s exclusive PseAAC (Chou 2009), such as “Functional Domain” mode (see Eqs. 9–10 of (Chou 2011)), “Gene Ontology” mode (see Eqs. 11–12 of (Chou 2011)), and “Sequential Evolution” or “PSSM” mode (see Eqs. 13–14 of (Chou 2011)), though the BPseAAC-General^ is useful for Chou’s general PseAAC (Chou 2011).

This method has been used in many protein prediction studies such as prediction of GABA receptor (Mohabatkar et al. 2011), protein cellular attributes (Chou 2001), cyclin proteins (Mohabatkar 2010), risk type of human papillomaviruses (Esmaeili et al. 2010), outer membrane proteins (Lin 2008) and secondary structure content (Chen et al. 2009). Due to the successful achievements of using PseAAC for protein/peptide sequence processing, PseKNC (Pseudo K-tuple Nucleotide Composition) (Chen et al. 2014) has been advanced to create plenty of feature vectors for DNA/RNA sequences (Chen et al. 2015, Liu et al. 2018) which have proven to be very practical. These methods could be applied for providing any desired feature vectors for protein/peptide and DNA/RNA sequences in adjustment with the requirements of studies. Particularly, in 2015 a very powerful web-server called ‘Pse-in-One’ (Liu et al. 2015) and its updated version ‘Pse-in-One2.0’ (Liu et al. 2017) have been established that can be used to generate any desired feature vectors for protein/peptide and DNA/RNA sequences according to the need of users’ studies.

Although GSTs have a crucial role in survival from cancer and other diseases (Allocati et al. 2018), there is few reports on approaches for predicting the GSTs proteins. One of the most highlighted researches in this field was the study by Nitish Kumar Mishra et al. who initially used a SVM based method has been developed using amino acid and dipeptide composition and achieved the maximum accuracy of 91.59% and 95.79%, respectively. In their study, they designed a SVM-based method using tripeptide composition and achieved maximum accuracy 97.66%, which is showed to be higher than the accuracy achieved by HMM based searching (96.26%). Based on the results of their study, they developed a web-server named GSTPred (http://www.imtech.res.in/raghava/gstpred/) (Mishra et al. 2007). The results provided by the current machine learning study also showed that SVM could provide high scores of ACC, SPC and MCC for the classification of fungal and plant GSTs, but the RF algorithm showed highest scores of ACC, MCC, and SEN for the classification of animal GSTs. It is also necessary to mention that Covariance Discrimination model could provide a high sensitivity for classification of fungal and plant GST.

There is no pre-existing server available for predicting different types of GSTs. Here our results demonstrate that using the concept of Chou’s PseAAC and different machine learning algorithms is a successful method in predicting different types of GSTs. Next step of this study could be designing a web server to predict animal, fungal and bacterial GSTs proteins. In the present investigation, a mixture of amino acid composition and four different machine learning algorithms for distinguishing fungal, animal and bacterial GSTs were applied. SVM method shows the highest accuracy for plant and fungal GSTs (96.55% and 89.82%, respectively). Also, RF shows 93.39% accuracy for the animal GSTs.

As Chou et al. stated in their report (Chen et al. 2016; Liu et al. 2016), user-friendly and free access web programs prepare guidance for future bioinformatics investigations (examples provided at (Chouet al. 2019; Xiao et al. 2019)). Indeed, they have greatly increased the impact of in silico methods on medical science (Chou 2015, 2019), which has provided a significant breakthrough in medical sciences (Chou 2017). Using these valuable insights, we wish to use the provided results of the current study to continue this investigation and design an adjustable program for such a data classification by application of the mentioned algorithms.

Conclusion

In conclusion, due to the importance of GSTs, prediction of fungal, plant and animal GSTs was performed using the concept of Chou’s PseAAC. Results indicated high accuracy using different machine learning algorithms for classifying this enzyme by their sequences and shed light on the variation among different models. Such predictions and classifications are critically important before choosing the best source of the enzyme for specific applications.