Introduction

Phenotype is one of the most significant concepts in genetics (Studies et al. 2007; Lopes et al. 2013). Generally, phenotype is used to describe all the characteristics of a research object that can be observed (Studies et al. 2007). Covering both macroscopical and microcosmic structures, phenotype is not only defined by whether such characteristics can be “seen” by the researchers, but also includes all the biochemical and physiological features (Wojczynski and Tiwari 2008). Corresponding to phenotype, genotype reflects the detailed inner genetic characteristics of an organisms, which is usually represented by the sequence and modification of DNA (Glatt et al. 2007). Generally, phenotype is affected by both genotype and environment, making it a more complicated biological concept (Glatt et al. 2007; Wojczynski and Tiwari 2008).

With the development of next generation sequencing techniques (Davey et al. 2011; Sommer et al. 2013), the genotype of a single organism can be easily detected, sequentially monitored and even predicted according to genetic rules. However, as for phenotype, considering that phenotype reflects the integrated features of genotype and environment factors (Lopes et al. 2013), it is hard to define phenotype characteristics, even difficult to predict unknown phenotypes. For centuries, various bioinformatics methods have been presented, providing a group of potential computational approaches to solve such problem. Generally, all such approaches focused on either the biochemical and biophysical structures (structural features) or the functional network (functional features) of the target protein or large molecule to predict their respective phenotypes (Glatt et al. 2007; Wojczynski and Tiwari 2008; Lopes et al. 2013). For instance, in 2010, researchers have identified the clinical phenotype of various fabry disease associated proteins by their specific mutant structures (Saito et al. 2010). And early in 2007, researchers confirmed the efficacy and prediction accuracy of network-based prediction in Saccharomyces cerevisiae, providing a reliable application of functional network-based prediction (McGary et al. 2007).

According to recent publications (Jiang et al. 2016; Zitnik and Leskovec 2017), both structural and functional feature-based phenotype prediction are effective and reliable in phenotype associated studies (McGary et al. 2007; Saito et al. 2010; Sommer et al. 2013). However, restricted by current biological techniques, it is still quite expensive and time-consuming to obtain sufficient structural information of large-scale phenotype-associated genes/proteins. Therefore, up to now, functional feature-based phenotypic prediction will be the most effective and accurate to comprehensively analyze phenotypes. In different functional feature-based studies, the biological functions have different descriptions with different research perspective to phenotype studies. Here, we introduced the most famous groups of gene/protein function descriptors for further analysis: Gene Ontology (GO) (Consortium 2018) and Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al. 2015) terms, as new candidates of phenotypic functional features. As known, the concept of GO can describe biological functions regardless the diversity of molecular levels and multiplicity of species generically, and KEGG is utilized for multi-omics biological functions and related bioinformatics researches. Therefore, both of these two feature groups can properly supply general functional descriptions for functional prediction on phenotypes. Considering the GO and KEGG terms have highly complicated inner structures for prediction, we also applied node2vec (Grover and Leskovec 2016) to learn new embedding features from a protein–protein network (PPI), which has been described as an effective algorithmic framework (Grover and Leskovec 2016; Yan et al. 2016) to learn useful feature representations from highly structured networks (e.g. PPI) for downstream tasks (e.g. phenotype prediction) (Yang et al. 2019). In addition, we formulate the phenotype prediction as a multi-label classification (Pan et al. 2019) in this work because a gene/protein may be associated with multiple phenotypes.

In brief, we first extracted functional enrichment features from GO and KEGG, and learned functional embedding features of genes from a gene–gene network by node2vec. Then these fused feature representations were fed into a multi-step feature selection to determine optimal features, which were further fed into a multi-label multi-class classification model for final phenotype prediction. According to recent studies, our method has indeed identified many literature-supported genes/proteins and their associated phenotypes, which provides a new computational tool for the accurate and effective phenotypic prediction.

Materials and methods

Datasets

We employed the proteins of budding yeast Saccharomyces cerevisiae model organism used in one previous study (Chen et al. 2016), which were retrieved from CYGD (ftp://ftpmips.gsf.de/yeast/) (Güldener et al. 2005). The original data contained some proteins without sequences and phenotypic annotations, after excluding which, 1462 proteins were accessed and investigated in this study. These proteins are assigned one or more following types of phenotypic annotations: (I) conditional phenotypes; (II) cell cycle defects; (III) mating and sporulation defects; (IV) auxotrophies, carbon, and nitrogen utilization defects; (V) cell morphology and organelle mutants; (VI) stress response defects; (VII) carbohydrate and lipid biosynthesis; (VIII) nucleic acid metabolism defects; (IX) sensitivity to amino acid analogs and other drugs; (X) Sensitivity to antibiotics; (XI) sensitivity to immunosuppressants. The distribution of 1462 proteins on 11 types can be found in the previous study (Chen et al. 2016), where 853 proteins were assigned exact one type of phenotypic annotation, 374 were labeled exact two types, and the rest proteins had more than two types of phenotypic annotation. Accordingly, the problem for predicting protein phenotypic annotation is a multi-label multi-class classification problem.

Feature representation

GO term and KEGG pathway are two widely used materials in bioinformatics. For each gene/protein, its relationship to GO terms and KEGG pathways can be encoded into a vector for representing the protein. Here, we used the enrichment scores (Carmona-Saez et al. 2007) to indicate such relationship. Such way to encode proteins/genes is quite popular (Li et al. 2013, 2019; Chen et al. 2017b, 2019). Compared with the one-hot way to encode proteins/genes, which is quite sensitive to the relationship to some GO terms or KEGG pathways, the enrichment scores are much more robust because they were always continuous numbers. In addition, we also abstracted the relationship to other proteins for a given protein to represent the protein via a network embedding algorithm.

GO and KEGG enrichment features

Given a protein p, let Gp be a set consisting of it and its interacting proteins in STRING. Its GO enrichment score to one GO or KEGG term was computed in the following way.

GO enrichment score

The GO enrichment score of p on a GO term GOj was defined as the − log10 of the hypergeometric test P value on Gp and the set GGO containing proteins annotated by GOj. Its calculation formula is as follows:

$${\text{GES}}_{j} = - \log_{10} \left( {\sum\limits_{k = m}^{n} {\frac{{\left( {\begin{array}{*{20}c} M \\ k \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {N - M} \\ {n - k} \\ \end{array} } \right)}}{{\left( {\begin{array}{*{20}c} N \\ n \\ \end{array} } \right)}}} } \right),$$
(1)

where N was the total number of proteins in yeast, M was the number of proteins in GGO, n was the number of proteins in Gp and m was the number of proteins both in Gp and GGO. 5523 GO terms yielded 5523 GO enrichment scores for each protein.

KEGG enrichment score

The KEGG enrichment score of p on a KEGG pathway Pj was defined in a similar way. In detail, it was the − log10 of the hypergeometric test P value on Gp and the set Gpathway containing proteins annotated by Pj, which was computed by

$${\text{PES}}_{j} = - \log_{10} \left( {\sum\limits_{k = m}^{n} {\frac{{\left( {\begin{array}{*{20}c} M \\ k \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {N - M} \\ {n - k} \\ \end{array} } \right)}}{{\left( {\begin{array}{*{20}c} N \\ n \\ \end{array} } \right)}}} } \right),$$
(2)

where N and n were same as those in Eq. 1, M stood for the number of proteins in Gpathway, m stood for the number of proteins both in Gp and Gpathway. 106 KEGG pathways produced 106 KEGG enrichment scores for each protein.

The GO and KEGG enrichment scores were termed as functional enrichment features.

Embedding features learned from a protein–protein interaction network

In recent years, some network embedding algorithms have been applied to tackle various biological problems (Luo et al. 2017; Zhao et al. 2019; Che et al. 2020; Zhou et al. 2020a; Zhu et al. 2021). These algorithms can overview a node in a system level and abstract its locations into various numbers. Here, one powerful network embedding algorithm, Node2vec (Grover and Leskovec 2016), was employed to encode each investigated protein.

To apply such network embedding algorithm, a protein network was necessary. This study used the protein–protein interaction (PPI) information reported in STRING (https://string-db.org/, version 10) (von Mering et al. 2003) to construct the protein network. We downloaded the file ‘4932.protein.links.v10.0.txt.gz’, which contained all PPI information for yeast. The constructed network defined 6418 yeast proteins as nodes and two proteins were adjacent if and only if they can interact with each other. The number of edges in such network was 939,998. For convenience, the constructed protein network was denoted as Np.

The node2vec (Grover and Leskovec 2016) was applied on Np to obtain the feature vector of each node in Np. It extends the Skip-gram architecture (Mikolov et al. 2013) of word2vec to the network version by employing the random walk algorithm on a network. For each node, it generates some sequences of nodes in terms of the random walk algorithm. Each sequence of nodes is termed as a sentence and each node is a word. After that, a feature vector is produced based on word2vec. For the detailed description of node2vec, please refer to (Grover and Leskovec 2016). In this study, the node2vec program was downloaded from https://snap.stanford.edu/node2vec/. Default parameters were adopted. Furthermore, the dimension of the output vector was set to 500. For convenience, these features were called functional embedding features.

As a result, each protein was represented by a vector with collecting functional enrichment and embedding features. Totally, 6129 (= 5523 + 106 + 500) features constituted the vector for each gene/protein.

Boruta feature filtering

Boruta feature filtering is able to select all relevant features to the output labels fast. Boruta is based on the random forest (RF) classifier. Boruta consists of the following steps: (1) create copies of original data and shuffle the feature values (called shadow features) of the copies data, and the original and shuffled data are combined to train a RF, which measures the feature importance; (2) for each feature, the Z score is calculated, it is standardization of the feature importance score from the RF; (3) select the maximum Z score from the shadow features as MZSF; (4) tag the original features whose Z score is greater than MZSF as important, and tag the feature whose Z score is smaller than MZSF as unimportant; (5) repeat the above processes until all features are tagged.

In this study, the Boruta program retrieved from https://github.com/scikit-learn-contrib/boruta_py is adopted. Default parameters are used for convenience.

mRMR feature selection

The minimum redundancy maximum relevance (mRMR) method (Peng et al. 2005) is a mutual information (MI)-based method for evaluating the importance of each feature. This procedure is implemented by calculating the MI values between features and output labels, and also between features themselves. To indicate the importance of each feature, a feature list is produced by the mRMR method, in which important features have high ranks. This study uses the mRMR program downloaded from http://penglab.janelia.org/proj/mRMR/. Also, default parameters are adopted.

Incremental feature selection (IFS)

IFS is a feature selection with an integrated supervised classifier (Liu and Setiono 1998). Based on the ranked features from mRMR, a series of feature subsets are constructed with a step interval as 1. For instance, the first feature subset has the top 1 feature, and the second feature subset has the top 2 features, and so on. For each feature subset, a classifier is trained on the samples consisting of the features from this feature subset, and the performance is evaluated using tenfold cross-validation (Kohavi 1995). After evaluating on all the generated feature subsets, the feature subset is selected as optimal feature subset when it achieves the highest performance.

Multi-label multi-class classifier RAkEL

In this study, we formulate the phenotype prediction as a multi-label multi-class classification problem. RAkEL (Tsoumakas et al. 2011) is a multi-label classification framework, which breaks the initial labels into several small subsets and is based on label powerset (LP) framework. Previously, LP considers each combination of labels in the training set as class values for single-label classification and train one base classifier on the new transformed data. However, LP cannot handle the data with a large set of labels and some classes with a few training samples, which is time-intensive. RAkEL improves LP by breaking the original labels into several label sets, each label set has a corresponding LP classifier. To date, several multi-label classification models have been set up with this method in tackling different biological problems (Saleema et al. 2012; Weng et al. 2018; Che et al. 2020; Jia et al. 2020a; Zhou et al. 2020a, b; Zhu et al. 2021). In this study, we use the implemented RAkEL in MEKA, which set the parameters m = 10, k = 10, and three base classifiers are used for multi-class classification respectively. These classifiers have wide applications in bioinformatics (Pan et al. 2010, 2021; Chen et al. 2017a; Jia et al. 2020b; Liang et al. 2020; Liu et al. 2021; Zhang et al. 2021a, b).

IBk

IBk is a K-nearest neighbors classifier, which automatically selects the K value based on cross-validation. IBk only uses specific instances with a low storage requirement, and its main output is a concept description that consists of multiple stored instances and the past performance during the training process. IBk has three main components: (1) similarity function, which calculates the similarity between a training instance s and instances in the concept description; (2) classification function, which is sued to classify the instance s and the instances in the concept description; (3) concept description updater, which updates the classification performance in concept description and decides which instance should be kept in the concept description.

RF

RF is a meta classifier consisting of multiple decision trees, and each tree is grown from a bootstrap sample set with a feature subset randomly selected from original features. RF has been widely used in analyzing biological data and demonstrate impressive performance in many studies and applications.

Support vector machine (SVM)

SVM tries to find a hyperplane with the maximum margin between two classes; it can handle both linear and non-linear data. Especially for non-linear data, it uses kernel trick to map the original nonlinear data in a low-dimensional space to a new linear data in a high-dimensional space. SVM needs find those support vectors on the margin between two classes, and these vectors are further used for classifying new samples.

SMOTE

In this work, the analyzed data were imbalance. Thus, the SMOTE (Chawla et al. 2002) is applied to produce new samples for the minor class iteratively until the sample number of the minor class is equivalent to that of the major class, so that, the new balanced data can help promote the construction efficiency of the classification models. We adopt the tool “SMOTE” from Weka in this work.

Performance metrics

In this study, we train multi-label multi-class classifier to predict the phenotypes of genes. Thus, each gene will be predicted to have multiple phenotypes. We mainly use two metrics to measure the prediction performance. One is the exact match, in which the predicted labels must exactly be the same as the true labels. The other is accuracy, which is calculated based on the joint and union set of true and predicted labels as follows:

$${\text{Accuracy}} = \frac{1}{N}\sum\limits_{i = 1}^{N} {\frac{{\left| {y_{i} \cap y_{i}^{*} } \right|}}{{\left| {y_{i} \cup y_{i}^{*} } \right|}}} ,$$
(3)

where yi is the true label set for sample i, \(y_{i}^{*}\) is its predicted label set, and N is the total number of samples. Evidently, the higher the exact match/accuracy is, the higher the performance of the classifier is.

In addition, another measurement, hamming loss, is also employed, which can be computed by

$${\text{Hamming loss}} = \frac{1}{N}\sum\limits_{i = 1}^{N} {\frac{{\left| {y_{i} \Delta y_{i}^{*} } \right|}}{m}} ,$$
(4)

where m is the number of labels (m = 11 in this study) and \(\Delta\) represents the symmetric difference operation of sets.

Results

In this study, the essential features are extracted from GO terms, KEGG pathways and PPI network for each gene. Several advanced computational techniques are adopted to build the multi-label multi-class classification model. The whole pipeline of our analytic method is shown in Fig. 1.

Fig. 1
figure 1

Flowchart of the proposed multi-label multi-class classification models for predicting gene phenotypes

Results of Boruta and mRMR methods

We first extract the enrichment features for each gene from GO and KEGG, and use node2vec to learn the embedding features for each gene (coding proteins). We combine the two sources of features as the final extracted features. Before adopting feature selection methods to analyze features, we construct a new dataset, where each sample has only one label. For example, if a sample has two labels, it will be deemed as two samples with different labels in the new dataset. Such new dataset is fed into Boruta feature selection to extract important features, resulting in 299 features, which are given in Supplementary Material S1. Among these 299 features, embedding features are most, followed by GO enrichment features and KEGG enrichment features (see Fig. 2a). Thus, the embedding features are most relevant to the identification of gene phenotype. Finally, the selected relevant features are further fed into mRMR method to rank. The ranked feature list is also given in Supplementary Material S1. The rank distribution of three feature types is illustrated in Fig. 2b. Evidently, embedding features occupy most high rank features, further confirming the importance of embedding features.

Fig. 2
figure 2

Analysis of the features selected by Boruta and evaluated by mRMR method on three feature types. a Number of selected features on three feature types; b Rank distribution of three feature types

Results of the IFS method

Based on the feature list obtained in section “Results of Boruta and mRMR methods”, we run IFS with RAkEL using three base classifiers (IBk, RF and SVM), to detect optimal features for distinguishing gene phenotypes. A series of feature subsets for IFS are generated. For each feature subset, we train and evaluate the RAkEL on the samples consisting of features from such subset. The evaluation results are counted as accuracy, exact match and hamming loss, as listed in Supplementary Material S2. Accuracy and exact match are selected as the key measurements to assess the performance of each classifier. Accordingly, two curves are plotted for each of base classifier, where one is for accuracy and the other one is for exact match, as shown in Figs. 3 and 4, respectively.

Fig. 3
figure 3

Accuracy of RAkEL with three base classifiers (RF, SVM and IBk) using different number of features. The RAkEL with RF and top 217 features yields the highest accuracy of 0.5290

Fig. 4
figure 4

Exact match of RAkEL with three base classifiers (RF, SVM and IBk) using different number of features. The RAkEL with IBk and top 184 features yields the highest exact match of 0.3646

From Fig. 3, we can see that the highest accuracies for RF, IBk and SVM are 0.5290, 0.5195 and 0.4047, respectively. These values are obtained using top 217, 184 and 299 features. The hamming loss values of these classifiers are listed in Table 1. Clearly, RAkEL using RF as the base classifier and top 217 features yields the highest accuracy. Such classifier is deemed as the optimum classifier based on accuracy. For the 217 features used in this classifier, 123 are embedding features, 89 are GO enrichment features and five are KEGG enrichment features, as shown in Fig. 5a. Embedding features still occupy most, followed by the GO enrichment features and KEGG enrichment features. Furthermore, the rank distribution of three feature types among these 217 features is also investigated, as shown in Fig. 5b. Similar to the features selected by Boruta and evaluated by mRMR method, embedding features are more important than other two feature types.

Table 1 Accuracy and hamming loss of RAkEL with different base classifiers
Fig. 5
figure 5

Analysis of the features used in the optimum RAkEL classifier based on accuracy. a Number of selected features on three feature types; b Rank distribution of three feature types

From Fig. 4, it can be observed that the highest exact match values for three base classifiers are 0.3304, 0.3646 and 0.2647, respectively. They are obtained using top 217, 184 and 294 features. The corresponding hamming loss values are listed in Table 2. Clearly, the RAkEL using IBk as the base classifier and top 184 features produces the highest exact match. Thus, such classifier is deemed as the optimum classifier based on exact match. Among the features used in this classifier, 110 features are embedding features, 71 are GO enrichment features and three are KEGG enrichment features, as shown in Fig. 6a. Likewise, embedding features are still most. Moreover, we investigated the rank distribution of three feature types, as shown in Fig. 6b. Again, the ranks of embedding features are highest.

Table 2 Exact match and hamming loss of RAkEL with different base classifiers
Fig. 6
figure 6

Analysis of the features used in the optimum RAkEL classifier based on exact match. a Number of selected features on three feature types; b Rank distribution of three feature types

With the above arguments, we can build two optimum classifiers. One uses the RF as the base classifier and the other one adopts IBk as the base classifier. To indicate the robustness of such two classifiers, we further evaluate their performance with tenfold cross-validation 100 times. Obtained three measurements: accuracy, exact match and hamming loss, are shown in Figs. 7 and 8, respectively. It can be observed that each measurement varies in a small interval, suggesting that these two classifiers are quite stable.

Fig. 7
figure 7

Violin plot to show the performance of the optimum RAkEL classifier based on exact match under tenfold cross-validation 100 times. a Accuracy; b Exact match; c Hamming loss

Fig. 8
figure 8

Violin plot to show the performance of the optimum RAkEL classifier based on accuracy under tenfold cross-validation 100 times. a Accuracy; b Exact match; c Hamming loss

Potential novel phenotypic annotations of some genes

As mentioned above, two classifiers are proposed for predicting phenotypes of proteins/genes. The tenfold cross-validation results of each classifier are picked up for detailed analysis.

For the RAkEL using IBk as the base classifier, the cross-validation results indicate that the predicted phenotypes of 1047 genes (71.61%) are all members of their true phenotypes. For the rest 415 genes, the incorrectly predicted phenotype with the maximum likelihood is picked up, which is provided in Supplementary Material S3. In section “IBk-based gene phenotype prediction”, some of them will be discussed.

As for the cross-validation results of RAkEL using RF as the base classifier, the predicted phenotypes of 944 genes (64.57%) are all correct. We also picked up the incorrectly predicted phenotype with the maximum likelihood for each of the rest 518 genes, which is also available in Supplementary Material S3. Some of them will be analyzed in section “RF-based gene phenotype prediction”.

Discussion

As we have mentioned above, we encode genes with their proper functional annotations (GO, KEGG and PPI). Further using novel machine learning models, we identify the functional clustering patterns of genes. In this study, we use two base classifiers to build the multi-label classifiers: IBk and RF. According to the prediction results, some genes are clustered into the so-called incorrected directed classes, such genes and their cluster re-assignments can be confirmed to be reasonable at the biological function level, which are supported by Saccharomyces Genome Database and recent publications. These findings can help discover novel phenotypes of proteins/genes and can be further confirmed by solid experiments.

IBk-based gene phenotype prediction

When we screened out specific genes processed by RAkEL using IBk as the base classifier, predicted phenotypic annotations of most genes (71.61%) are absolutely true annotations. As for the remaining miss directed genes, actually, they are not simply clustered into at least one incorrect cluster but be re-assigned to the alternative cluster due to their functional complexity. The most likely incorrect clusters of these genes were picked up. Some of them are listed in Table 3.

Table 3 Latent novel phenotypic annotations of some genes identified by RAkEL with IBk

The first gene is YAL010C, also named as MDM10 and participating in the biological regulation of ERMES and the SAM complex (König 2012). According to the existing datasets, such gene would be clustered in 1 and 5 clusters (conditional phenotypes, cell morphology and organelle mutants). Previous studies has already confirmed the contribution of YAL010C on conditional phenotypes and cell morphology (Sogo and Yaffe 1994). However, our presented computational method, clustered such gene into cluster 4, auxotrophies, carbon, and nitrogen utilization defects. According to recent publications, early in 2003, researchers confirmed MDM10 regulated the amino acid utilization in Aspergillus nidulans, another typical fungus (Koch et al. 2003). Therefore, it is quite reasonable to have different functional annotation with new phenotype of YAL010C considering its biological complexity.

The next re-assigned gene is YAL035W. Also named as FUN12, such gene has been widely reported to participate as GTPase promoting Met-tRNAiMet binding (Alone et al. 2008; Kim et al. 2018). Initially, such gene would be clustered into class 1, indicating its specific biological functions in conditional phenotypes (Haruki et al. 2008). However, with our newly present method, such gene has been clustered into class 9 (sensitivity to amino acid analogs and other drugs). According to recent publications, early in 2006, researchers have already identified such gene as an eukaryotic ribosomal complexes associated protein interacting with certain exogenous amino acid analogs and were shown to be associated with related drug sensitivity (Fleischer et al. 2006), corresponding with our prediction. Therefore, the prediction of YAL035W in re-assigned functional clusters may be caused by multi-functional capacity of such gene.

As the following gene, YAL047C has also been clustered into a different cluster comparing with previous information. In the prediction result, acting like a receptor for gamma-tubulin small complex, such gene has been widely reported to contribute to microtubule formation and stabilization (Luban et al. 2005), which would be initially clustered into cluster 1 and 5 just like YAL010C (Corbacho et al. 2005; Nguyen et al. 2018). By contrast, such gene has been clustered into class 2 which describes cell cycle defects, newly discovered in this work.

For the following gene as YAL054C, according to SGD, it has also been known as FUN44 and ACS1 widely reported to participate in histone acetylation-associated biological processes (Yukawa et al. 2009; Li et al. 2010a). Originally, such gene has been confirmed to participate in class 5-associated biological processes (cell morphology and organelle mutants) (White 1999). In our prediction list, YAL054C has been functionally clustered into class 10, describing sensitivity to antibiotics. Early in 2003, a system study (Palsson et al. 2003) on the composition and methods for yeast metabolism confirmed that our candidate gene YAL054C may actually participate in antibiotics associated processes. Apart from this independent study, the direct evidence for the interrelationship between YAL054C and antibiotics biological processes is still remained for further validation at different molecular levels.

What is more, we also observed a specific gene named as YAL058W. According to recent publications, such gene has been participating in ER membrane folding and glycoprotein quality control (Li et al. 2010b), which might originally be classified into class 5 (cell morphology and organelle mutants) (Seeley et al. 2002). Meanwhile, according to our new computational analysis, such gene has been classified into class 9 (sensitivity to amino acid analogs and other drugs). According to related publications (Caro et al. 1997; Li et al. 2010b), YAL058W has been widely reported to be actually sensitive to amino acids, validating the efficacy and accuracy of our prediction.

RF-based gene phenotype prediction

Similar to above genes predicted by RAkEL using IBk as the base classifier, we also predicted various genes with accurate functional cluster distribution by RAkEL using RF as the base classifier. Among all the genes, 944 genes (64.57%) were predicted a part of their true functional annotations (clustering results). As for the remaining genes, they may also be re-assigned into different clusters due to the complexity of their biological functions. We also selected the most likely predicted cluster for each of these genes. Some of the top candidate genes of the re-assigned prediction of RF are just the same as those of IBk (like YAL010C, YAL047C, YAL035W and YAL054C), indicating the robust of our prediction based on novel machine learning models. Here, we discussed other two genes, listed in Table 4.

Table 4 Latent novel phenotypic annotations of some genes identified by RAkEL with RF

In our optimal prediction list from RF, gene YAL002W could initially be clustered into class 1 (conditional phenotypes) (Horazdovsky et al. 1996) and class 5 (cell morphology and organelle mutants) (Zhou et al. 2009). Relied on RF, such gene has been re-clustered into another effective biological group (Class 6), describing stress response defects. Gene YAL002W also named as VPS8 has been widely reported to participate in membrane-binding processes of the CORVET complex (Peplowska et al. 2007), and such gene has been widely reported to contribute to the regulation of heat stress responses in multiple species including yeast (Huisinga and Pugh 2004; Le Breton and Mayer 2016). Therefore, considering the complicated biological contribution of YAL002W, it’s quite reasonable to have such different phenotype cluster assignment in our prediction results.

Another gene named as YAL023C also has different cluster assignment under the RF model. Initially, such gene could be clustered to class 5, describing cell morphology and organelle mutants, on the basis of recent publications (Karpova et al. 1998; Mouyna et al. 2010) and SGD annotation. However, by RF prediction, YAL023C has been re-clustered to a new class (carbohydrate and lipid biosynthesis defects). Although the direct relationship between our candidate gene and such phenotype has not been identified, there are some publications (Lussier et al. 1995; Novotná et al. 2004; Villa-García et al. 2011) confirming that YAL023C is associated with basic membrane functions of yeast.

Although both computational methods work well for the prediction of gene phenotypes, in this study, IBk method may work better and may be more suitable for further application in such research field. All in all, both computational methods can group most candidate genes into their respective functional clusters correctly. However, due to the complexity of gene functions, some genes have been re-clustered to another clusters/classes. According to our discussion above, we confirmed that most of such re-assigned genes indeed participate in the newly predicted phenotype at biological functional level, validating the efficacy and accuracy of our function-based gene phenotype prediction.