Introduction

Multidrug resistance (MDR) is a cellular drug resistance developed in cancer cells that involves reduced drug accumulation in intracellular space. The most common cellular response associated with MDR is the overexpression of membrane transporter proteins belonging to ATP-binding cassette superfamily. Among these transporter proteins, P-glycoprotein (PgP) is overexpressed in many cancer cell-line models [1]. PgP is also known as an ATP-binding cassette sub-family B member 1 or multidrug resistance protein 1 (MDR1) [2, 3]. The PgP is widely distributed in the hepatocytes of bile duct, apical membranes of intestinal mucosal cells, renal proximal tubular cells of kidney, and capillary endothelial cells of the brain and testis. This transmembrane glyco-protein of 1280 amino acids is important for intestinal absorption, drug metabolism, and blood brain barrier (BBB) penetration, and is expressed by MDR1 gene [4]. The PgP consists of two transmembrane domains, each containing six transmembrane α-helices which make drug binding domains to transports the drugs. Two ATP binding domains located on the cytoplasmic side of membrane are crucial for the transport of toxins by hydrolysis of ATP [3].

PgP shows broad ligand specificity, and translocates its ligands out of the cell against the concentration gradient using the energy derived by ATP hydrolysis. Overexpression of PgP lowers intracellular concentrations of drugs to sub-therapeutic levels by increased ATP dependent efflux leading to MDR. The progress in understanding of the MDR have made the inhibition of PgP a viable and attractive therapeutic approach to overcome MDR [5,6,7]. In the past decades, several inhibitors designed to target the PgP inhibition failed in clinical trials [8]. The known inhibitors of PgP are broadly classified into four generations. The first and second generations of inhibitors showed uncertain pharmacokinetics [9] and interaction with oxidizing enzyme [10, 11], respectively. The third-generation of inhibitors improved significantly but were unsuccessful in clinical trials due to their toxicity [12, 13]. The fourth-generation of inhibitors are natural products, show less toxicity and low molecular weight, and can potentially lead to a next generation of PgP inhibitors [14, 15]. The quest for novel PgP inhibitors for the reversal of MDR in cancer patients is currently of research interest. This require development of new QSAR models with novel descriptors to identify the PgP inhibitors with the highest accuracy and precision.

Several in-silico QSAR (quantitative structure–activity relationship) methodologies are known in the literature for identifying the drug molecules for PgP inhibition. Most of these methods were able to identify the PgP inhibitors using pharmacophore description models with the help of advanced machine learning algorithms. These models were broadly classified as binary classification models [16,17,18,19,20,21,22,23,24,25,26], correlation models [27,28,29,30], and pharmacophore based models [22, 23, 26, 31,32,33,34,35]. The SAR (structure–activity relationships) based methods confirm that lipophilicity (log P) [36,37,38], molecular weight [15, 17, 39], aromaticity [20, 22, 40], and hydrogen bond acceptor [20, 22, 35, 41] were important molecular properties for the identification of such inhibitors. These studies further support that lower log P as well as molecular weight are crucial physicochemical factors for ideal PgP inhibitors. The QSAR modeling studies on a small set of PgP inhibitors support that ideal compounds should possess log P greater than 2.92, high EHOMO (energy of highest occupied molecular orbital), and at least one tertiary basic nitrogen atom [36]. Chen et. al. developed QSAR classification models using fingerprints and molecular property descriptors for a diverse set of 973 PgP inhibitors with an accuracy of 81% [17]. These authors reported solubility, log D, and molecular weight as important descriptors for classification of inhibitors from non-inhibitors. Schyman et. al. have used variable-nearest neighbor (v-NN) method and predicted the PgP inhibitors for a diverse and large set of 2,276 compounds with an accuracy of 87% [25].

The present study focuses on development of the machine learning based binary classification schemes to identify the PgP inhibitors from non-inhibitors using 3D-RISM-KH based solvation free energy descriptors. This work is aimed as a proof of concept that molecular solvation theory can be successfully used to identify PgP inhibitors. We have used the 3D-RISM-KH molecular solvation theory to calculate the solvation free energy and solvation free energy based descriptors for PgP± compounds. The 3D-RISM-KH theory is a first principle statistical mechanics based solvation model that uses rigorous descriptions of direct correlation functions to calculate thermochemical properties of pure liquids and solutions in the form of excess and total chemical potentials, partial molar volume, solvent distribution function around solute, etc. The applicability of these descriptors have been tested by developing the classification schemes to identify the PgP± compounds. The machine learning methodologies have primarily estimated the importance of 3D-RISM-KH based descriptors in predicting the PgP± compounds, and these have been further used to develop the models with the classification schemes to identify the PgP± compounds.

Computational methods

Database preparation

The database of the PgP inhibitor and non-inhibitor (PgP±) compounds were taken from the published work of Broccatelli et al. [20]. Their extensive literature search from more than 60 references yielded a large data set of 1274 PgP± compounds. The details of data curation, experimental methods, and IC50 values used to classify PgP± compounds are given in the data collection section and supporting material published by Broccatelli et al. [20]. The duplication of PgP± compounds was not observed in the data set. The SMILES strings of all PgP± compounds are imported to the Molecular Operating Environment (MOE2018) drug discovery software platform [42] with the help of database preparation module. The addition of hydrogens and generation of 3D-Cartesian coordinates for PgP± compounds were carried out in the MOE. For all the calculations, we have used the neutral form of the species. The PgP± compounds were subjected to gas phase geometry optimization at the semi-empirical AM1 level using the Gaussian16 software package [43, 44]. The molecular descriptors of all the molecules were generated using the MOE2018 drug discovery software.

3D-RISM-KH based descriptors generation

The 3D-RISM-KH based excess chemical potential and partial molar volume (used as descriptors in prediction) were calculated for the PgP± compounds using our in-house 3D-RISM-KH code. A working version of this code (executed as rism1d and rism3d.snglpnt) is implemented in the AMBERTOOLS suite of programs [45]. We used five solvents, viz. chloroform, cyclohexane, n-hexadecane, n-octanol, and water for 1D-RISM susceptibility calculations for pure liquids. The parameters for these solvents were validated against experimental solvation free energy datasets, as reported by us previously [46, 47]. We have employed UFF [48] parameters with AM1 charges for all the solutes. The 3D-RISM-KH calculations for solute molecules were performed using a uniform cubic 3D-grid of 128 × 128 × 128 points in the box of size 64 × 64 × 64 Å3 to represent a solute with a few solvation layers with convergence accuracy set to 10−5 in the modified direct inversion in the iterative subspace (MDIIS) solver [49]. The detailed workflow chart describes the calculations of 3D-RISM-KH based descriptors given in the ESM (Fig. S1).

Machine learning and statistical modeling

The machine learning predictive models for PgP± compounds were developed with descriptors. The full list of descriptors for the entire dataset are provided in the ESM. The statistical importance analysis of descriptors, machine learning calculations and performance indices of models were performed using the Rstudio version 3.4.4 [50]. The R packages were used to perform the calculations briefly described in the S1 of ESM [51,52,53,54,55,56,57]. The definitions of machine learning methodologies and performance indices are given in the ESM (Tables S4–S7 and S2 in the ESM). The analysis of statistical importance of descriptors was performed with the GBM (gradient boosting machines) and RF (random Forest) methods to identify the crucial descriptors to use in predictive models. The database of PgP± compounds is divided into a training (75% of compounds) and a test set (25% of compounds) by randomly assigning the molecules. The GBM, GLM (Generalized linear models), SVM (support vector machines), and weighted-kNN (weighted κ-nearest neighbor) machine learning schemes were used to identify PgP± compounds. The performance indices (accuracy, precision, sensitivity, specificity, and F1-score) were calculated with R package by generating the confusion matrix for each classification run.

Results and discussion

The current study aims at developing the binary classification models to identify the PgP inhibitors from non-inhibitors with precision and accuracy using the 3D-RISM-KH molecular solvation theory. The 3D-RISM-KH molecular solvation theory-based solvation parameter, the excess chemical potential in solvents as descriptors were calculated for PgP± compounds. The machine learning based binary classification schemes are developed with 3D-RISM-KH based descriptors along with other descriptors and used for classification of PgP± compounds. To achieve the objectives, we prepared the database of PgP± compounds and generated the descriptors for PgP± compounds as described in the computational methods. The analysis of statistical importance of descriptors was performed on descriptors (total 354) with the GBM (gradient boosting machines) method, and identified 23 descriptors as important ones for preliminary model building activities. The list of descriptors is given in Fig. 1 and Table S1 in the ESM. The pool of these descriptors consists of ten 3D-RISM-KH based descriptors and thirteen 2D-descriptors. Among all the 23 descriptors, the 3D-RISM-KH based descriptors contributed relative importance of 48.1% in total (Fig. 1). The top 5 crucial descriptors contributed 62% of relative importance, and the remaining 18 crucial descriptors contributed 38%.

Fig. 1
figure 1

Relative importance of descriptors in the models obtained from statistical importance analysis descriptors with the GBM and RF methods. Upper panel: Model-23d was developed with 23 crucial descriptors obtained by statistical importance analysis of 354 descriptors. Lower panel: Model-5d (right) and model-4d (left) were developed with 5 and 4 crucial descriptors obtained by further statistical importance analysis of 23 crucial descriptors, respectively. Statistical importance of each descriptor (in percentage) given on X-axis

The 23 descriptors obtained from the initial GBM calculations were subjected to further analysis of the descriptors importance using the GBM and RF methods, with the aim to reduce number of descriptors in the prediction model while keeping the accuracy intact. The descriptor list is given in Fig. 1b, c and in the ESM (Tables S2, S3). The GBM identified excess chemical potential in water and in octanol, number of aromatic atom, sum of atomic polarization, and topological polar surface area (TopoPSA) as crucial descriptors. The 3D-RISM-KH based descriptors excess chemical potential in water shows a highest relative importance of 38.8%. The RF method identified excess chemical potential in water and in octanol, number of aromatic atom, and sum of atomic polarization as crucial descriptors. The top descriptor, excess chemical potential in water shows a highest relative importance of 38.8% and 41.2% in the GBM and RF methods, respectively. The analysis of the crucial descriptor revealed that four descriptors found common with three classification methods. These are excess chemical potential in water and in octanol, number of aromatic atoms, and sum of atomic polarization. These findings are in line with the previous literature pointing out to lipophilicity and molecular weight as important descriptors for such a classification [15, 31,32,33,34,35,36,37,38,39,40,41].

We developed three descriptors models based on relative importance of descriptors from the statistical importance analysis: (i) model-23d (maximum descriptor model) (ii) model-5d, and (iii) model-4d (minimum descriptor model). Model-23d, model-5d, and model-4d were developed with 23, 5, and 4 descriptors, respectively, as suggested in the naming scheme of these models. The choice of descriptors for model-4d was guided by the RF method, and for model-23d and model-5d by the GBM method. The models were used to classify PgP± compounds with machine learning schemes as described in the computational methods section. The performance indices of different classification schemes using three models for the test set of compounds are given in Fig. 2 and Tables S4–S7 in the ESM.

Fig. 2
figure 2

Performance indices (Tables S4–S7 and S2 in the ESM) of different machine learning schemes used for classification of PgP± compounds. amodel-23d (average accuracy and precision for runs a to e, Table S4 in the ESM), bmodel-5d, cmodel-4d, d best accuracy and precision for different classification schemes used for model-23d

The classification schemes identify the test set compounds as PgP-inhibitor (yes/1) or PgP-non-inhibitor (no/0), based on the models applied. The accuracy of the GBM, GLM, SVM, and Weighted kNN classification methods with model-23d is in the range of 84.0–86.5%, 48.1–71.4%, 95.6–96.9% and 85.2–87.4%, respectively.Footnote 1 The SVM shows the best accuracy of ~ 97% among the four classification schemes with model-23d. The GLM method shows a low accuracy in identify the PgP± compounds using model-23d. The GBM and weighted-kNN methods show better accuracy than the GLM method. Similar trends in accuracy were observed in model-5d and model-4d with four classification schemes. The accuracy of the GBM, GLM, SVM, and Weighted kNN classification methods with model-5d is 82.4%, 68.9%, 90.3% and 86.4%, respectively. The accuracy of the GBM, GLM, SVM, and Weighted kNN classification methods with model-4d is 81.4%, 48.4%, 87.1% and 84.6%, respectively. The SVM method shows the best accuracy, whereas the GLM method shows a low accuracy with model-5d and model-4d. The GBM and weighted-kNN methods performed better than the GLM method with all the descriptor based models. Among all the different classification schemes used with the three models, the SVM identified the PgP± compounds with the best accuracy in the range of 87.1 to 96.9%. The stability of the statistical models was tested by randomly removing data points from the test set (50–100 points) and recalculating the statistical performance indices of the new test sets with a reduced number of data points. The best accuracy in identifying the PgP± compounds were achieved with the model-23d and SVM method. This model has a higher number of descriptors than the other two models. Model-23d was built with 10 of the 3D-RISM-KH based descriptors and 13 of the 2D-descriptors. We compared the performance of the current models with the literature known models. The literature references with their models are summarized in Table 1.

Table 1 Summary of the previously published predictive models for the PgP± compounds

Conclusions

In conclusion, we have applied our 3D-RISM-KH solvation theory based predictors to construct a PgP inhibitor model, using binary (1/0) values of PgP± compounds. This is the first report providing a proof of concept that 3D-RISM-KH solvation theory-based descriptors can be used successfully to predict the PgP± compounds in a binary fashion. Amongst different models tested here, the maximum descriptor model with the SVM classification scheme showed excellent performance.

In the current study, the 2D descriptors show a significant contribution along with the 3D-RISM-KH based descriptors in predicting for the PgP± compounds. The previous reports also used lipophilicity (log P) [36,37,38], molecular weight [15, 17, 39], aromaticity [20, 40, 41], and hydrogen bond acceptor [20, 35, 41] as important 2D descriptors to distinguish inhibitors from non-inhibitors. These are not sufficient to reach a high accuracy in predicting the PgP± compounds. The current study specifies that the accounting of 3D-RISM-KH based descriptors along with 2D descriptors in the models show a higher accuracy in predicting the PgP± compounds. The presence of excess chemical potential in water and octanol in the model-23d point to the importance of the lipophilicity of molecules, being an important feature for such classification. For a molecule to involve in molecular recognitions in several cellular levels, it has to pass through a series of solvation-desolvation processes. The 3D-RISM-KH based solvation descriptors clearly capture this physical feature of the process. Octanol and cyclohexane are typical mimics of non-polar environment, something a drug molecule experiences on being absorbed from the plasma. The presence of the solvation free energy-based descriptors in our top descriptor list is also in agreement with the previous literature reports [17, 36,37,38]. The 3D-RISM-KH based solvation free energy descriptors were also used as crucial descriptors for prediction of blood–brain barrier (BBB) and skin permeability [46, 58].

The SVM-PgP± prediction model shows better accuracy in comparison with the literature reported predictive models. The maximum descriptor model may identify the PgP inhibitor compounds with high accuracy and precision. The models act as a tool for early phases of drug discovery to identify the PgP± compounds. The 3D-RISM-KH based descriptors may act as better descriptors for the prediction models to classify the inhibitors of other transporter proteins involved in the MDR.