Keywords

1 Introduction

Aqueous solubility prediction is one of the challenging tasks among the drug discovery process and other applications. It can be used to determine the drug-likeness of any compound and also important for prediction of ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties [1, 2]. Human blood consists of 80% of water so, the compounds with low water solubility having low absorption rate. Therefore, prediction of aqueous solubility in an early phase of drug discovery and development process can help to re- duce the time and cost factors. It also eliminates the molecules, which have very low aqueous solubility for reducing the risk of failure. Most of these are QSAR (Quantitative-Structure-Activity Relationship) methods where the activity of any molecule generate by mapping and encoding the structure of that molecule. Structures encode by undirected graph representation and mapping to specific length feature vectors. Chemical compounds can also be represented using SMILES notations, which is a standard representation and used to extract other information about that compound from chemical databases [3]. It is imperative to find a suitable regression or classification model to learn the input training samples. There are various computational models have been applied to solve chemoinformatics problems [4]. Literature reflects significant work in the area of multiple properties prediction. Literature reflects significant work in the area of properties prediction like Quantitative Structure-Property Relationships Models (QSPR) [5, 6], Use of Deep learning for a molecular graph representation of drug-like molecules [7], the role of solubility oral absorption prediction using a decision tree [8], Random forest model [9], theoretical models [10, 11], kinetic and intrinsic solubility prediction by potentiometric [12], multi-linear regression model, theoretical physics with machine learning models [13]. In this study, the aqueous solubility of molecules has predicted using six regression models. Here, the main aim to improve the performance of the predictive model. For this purpose, Data samples have clustered before applying to the predictive regression model. All molecules have represented in terms of molecular descriptors, which are the inputs for the predictive model. The descriptions of all methodologies are in Sect. 2. Aqueous solubility has predicted using the proposed model and also without clustering approach. Statistical performance measures have used to compare the simulation results of existing regression models with clustering approach.

2 Materials and Methods

2.1 Dataset

In this study, we have used two datasets; one of them is a small dataset of the only drug-like molecule, and another is a relatively large dataset with mixed data on drugs and non-drugs molecules. The details of each dataset have given in Table 1. We have collected SDF format, and molecular descriptors have extracted using the freely available PaDEL software.

Table 1. Dataset details

2.2 Physicochemical Properties

All the molecules have collected in SDF format from the previous literature and manually from databases. Molecular descriptors can be estimated using various online server or using the software. In this study, we have used freely available PaDEL software for physicochemical properties calculation. Initially, we have taken 1185 features and after applying feature reduction methods the reduced feature set is 19. The reduced features, which have used for this study are XLogP, Hydrogen bond acceptor, Hydrogen bond donor, HybRatio, TopoPSA, AlogP, MlogP, Lipoaffinity index, Hydrogen atom count, Oxygen count, CrippenLogP, bpol, molecular weight, Mcgowan Volume, Kappa Indices, CrippenMR, FracC, Ring count, Rotatable bond count.

2.3 Feature Selection

Chemo-informatics deals with drug discovery process which involved a huge number of compounds with different types of chemical properties. So it is very important to reduce feature space to enhance the prediction performance Feature reduction provides faster and cost-effective predictive algorithms and it also helps to understand the underlying process that generated the data which is a very important step in machine learning process. In order to reduce the no of irrelevant features in this study, relief and Pearson correlation methods have used for feature selection.

Relief:

Relief is a feature selection algorithm which uses a statistical method. It can handle both discrete and continuous type features. It runs in low order polynomial time and independent to heuristic search. It takes linear time and effective when the instances are large in number. It computes weights and ranking of all the features for any input data samples using the target vector. This is effective for classification and also for the regression problem. The feature weight value is ranging from −1 to +1 [16, 17].

Pearson Correlation:

Pearson Correlation is a basic feature reduction technique which is used to measure the linear dependency or correlation between two features [18]. The outcome of this method lies in the range of −1 to +1, both values are inclusive. Where +1 is for the positive correlation, −1 is for negative correlation and 0 is no correlation. The correlation is calculated by the following equation:

$$ \upgamma = \frac{{\sum {XY - \frac{(\sum X )*(\sum Y )}{n}} }}{{\sqrt {(\sum {X^{2} } - \frac{{(\sum {X)^{2} } }}{n})*(\sum {Y^{2} - \frac{{(\sum Y )^{2} }}{n}} )} }} $$
(1)

Where X and Y are the features between which correlation is Calculated and n is the number of samples.

2.4 Clustering

Cluster analysis or clustering approach is based on an unsupervised learning approach in which the grouping of data samples takes place according to their similarity. Data samples in each group are more similar to the data sample of the same group as compare to other groups. Clustering is most important to visualize the data effectively. There are various types of clustering algorithms like connectivity models, centroid models, distributive models, density models etc. according to their application in the literature, K-Means algorithm is one of the popular centroid models clustering techniques.

K-Means Clustering Algorithm:

It is based on an iterative process where the main aim of each iteration is to find local maxima. K-Means algorithm [18, 19] can be described by following steps:

  • Step 1: Specify a desirable number of clusters.

  • Step 2: Assign data points randomly in each cluster.

  • Step 3: Compute centroids of each cluster.

  • Step 4: According to closest cluster centroid reassign each data points.

  • Step 5: Recompute cluster centroids.

  • Step 6: Repeat steps 4 and 5 or terminates if no improvement possible and reach global optima.

2.5 Regression Analysis

A regression analysis widely used statistical method in which relationship among the variables estimated. It is used for prediction or forecasting so it also comes under as machine learning approaches. It helps to understand that what the effect on the dependent variable is when values of the independent variable are changed. In this method, problem is to find a variable or set of variables which are significant predictors of the required criterion variable. It may depend on a single independent variable or combination of a set of independent variables. There are various types of the regression model in the literature but in this study, we have used Gaussian Progress Regression, Multi-Linear Regression, Neural Net Regression, Ensemble Regression, Random Forest Regression, SVM Regression predictive models for comparative study [20,21,22,23]. All the given regression models have been used with and without cluster-based model for comparative study.

2.6 Statistical Performance Measures

In this study, to compare the result and performance Analysis of prediction model we have used Root Mean Square Error (RMSE) as Statistical performance matrices:

$$ RMSE = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^{n} {(Y_{i}^{Actual} - Y_{i}^{Predicted} )^{2} } } $$
(2)

3 Proposed Model

Aqueous solubility or other physiochemical properties have been predicted using various regression models in the previous studies. In this study, the main aim of the proposed model to increase the performance of the prediction model by clustering the data samples. The steps of the proposed model have been described in Fig. 1. So for this, in the first step, we have taken the solubility dataset which has been described in the previous section. All the data samples are represented in terms of molecular descriptors. Few molecular descriptors have been selected from the huge feature space for this study. For the selection of relevant features, there are various techniques like Filter, wrapper or subset selection. For this model, combination of relief and Pearson correlation method has been used as a feature selection method. Then data samples have been clustered according to their similarity. Clustering is important to visualize the data according to similarity and dissimilarity among them. Data samples which are more similar they belong to the same group or cluster. Regression analysis is a set of statistical processes in which the value of the single variable is estimated in terms of other variables which is represented as the equation. Mapping all data in a single equation is very difficult for a large sample set. When the size of the data set is increased then mean absolute error is also increased. In regression analysis error is calculated in term of difference in actual and predicted value. So to reduce the error of predictive model data samples have been clustered before prediction in the proposed model. After clustering the data samples regression model is implemented for each cluster. The aqueous solubility of each new sample has been predicted after assigning to a specific cluster and the target value is calculated according to the cluster equation. This step reduces the overall mean absolute error which improves the performance of the predictive model. It eliminates the limitation of a single regression model. In this model, K-Means have been used as a clustering technique and for prediction, various regression models have been used like Gaussian regression, multi-linear, neural net regression, ensemble regression, random forest regression, SVM regression. Using K-Means clustering data has divided into more than one group. After clustering for each group we have trained the model and finding different regression equation. For validation, the predicted value is calculated for each equation and finds the average and compares the result with actual log S value. In the last step for performance analysis, root mean square error (RMSE) has used as statistical performance matrices and compare the performance of other given regression model with the proposed cluster-based regression predictive model. The outcomes of this model have been described in the next section.

Fig. 1.
figure 1

Proposed cluster-based regression model

4 Result and Discussion

Aqueous solubility dataset 1 and dataset 2 collected from the resources was in SDF form, so features have extracted for this using PaDEL software. There is a huge set of physiochemical properties to represent any compound, so it is also a challenging task to select the properties which give more accurate values of other property. In this study, for feature selection, we have used relief feature selection and after that Pearson Correlation i.e. a combination of both methods have used. Relief is an effective model which is used to select best-ranked features and Pearson correlation has been used to eliminate redundancy of the feature set and also used to select most relevant features which are highly correlated with the target. So the combination of both techniques performs better with given predictive models. For prediction purpose, we have used 6 different regression models and compared the performance of each model with and without a proposed cluster-based model. All predictive models have trained and tested using MATLAB and WEKA software. The training and testing phase results for each dataset has represented in tabular form Table 2, Table 3 respectively.

Table 2. Training and Test phase Phase results for Dataset 1 with and without a cluster-based model
Table 3. Training and Test phase Phase result for Dataset 2 with and without a cluster-based model

Table 2 is the result summary of dataset 1 training and test phase in which, we have got the best result with cluster-based random forest model. For dataset 1 best outcome RMSE, 0.60 is with using two cluster centers. Due to small data samples by increasing number of centers performance also decreases. Figure 2 and Fig. 3 is regression plot of training phase and test phase random forest model for dataset 1 respectively. Due to biased and small dataset of only drug-like molecules the result is not significantly improved using cluster based model.

Fig. 2.
figure 2

Regression plot of cluster-based Random Forest model dataset 1 training phase

Fig. 3.
figure 3

Regression plot of cluster-based Random Forest model dataset 1 test phase

Therefore we have also taken another relatively large dataset of mixed molecules and compared the result. The original work has been done using dataset 2 and dataset 1 has taken only for comparative study for different sample size dataset. In Table 2, we can see that in the training phase error is less as compare to Table 3 but test result has not improved. Except random forest model using other model for dataset 1 performance is low. Due to small sample size if we increase the number of clusters predictive model is not perform effectively. For dataset 2 optimum number of clusters is 4, above this performance decreases relatively in this experiment. In Table 3, training phase error is very low for random forest model in both cases and also gives best result with this model. Figure 4 and Fig. 5 are the representation of regression plots for training and test phase of best model using actual and predicted solubility values.

Fig. 4.
figure 4

Regression plot of cluster-based Random Forest model dataset 2 training phase

Fig. 5.
figure 5

Regression plot of cluster-based Random Forest model dataset 2 test phase

5 Conclusion

Predicting aqueous solubility is a critical task in the area of chemo-informatics. In this study, we have used two types of dataset one is a minimal sample size, and another is a relatively large sample. In both cases, with clustering or without clustering, we got the best result with a random forest regression model. We have got the lowest RMSE of 0.60 and 0.61 for dataset 1 and dataset 2 test phase respectively. For getting more improved unbiased results, a vast dataset should use. In this cluster-based model, we have got a good result with only using two clusters and four clusters for dataset 1 and dataset 2 respectively. For other datasets, the number of groups may vary.