Introduction

Human P-glycoprotein (P-gp), encoded by the ABCB1/MDR1 gene, is one of the most important and widely studied members of the ATP-Binding Cassette (ABC) transporter superfamily [1, 2]. The energy-dependent efflux activity with broad substrate specificity (natural products, anticancer agents, peptides, steroids, lipids, cytokines, dyes, and ionophores) as well as the widespread distribution in normal tissues (intestine, brain, testis, placenta, liver, and kidney) and tumors of this protein support its roles in human physiology and drug therapy [3]. Taking part in the physiological defense mechanism against xenobiotics such as toxins and drugs, P-gp should be considered as an antitarget in company with hERG (human ether-a-go-go related gene), Cytochrome P450s, and PXR (pregnane X-receptor) in the drug discovery and development process [4, 5]. Blocking this nontarget with inhibitors (e.g., ketoconazole) could change the concentration of a co-administered drug or its metabolite in blood (e.g., terfenadine) and consequently lead to drug–drug interactions and unwanted side effects (e.g., QT prolongation/Torsades de Pointes) [6]. Conferring a multidrug resistance (MDR) phenotype of cancer cells through active outward transport of cytotoxic drugs, P-gp has been also counted as a clinical target in chemotherapy [7]. P-gp inhibition is subjected to overcome MDR in cancer patients by increasing the intracellular accumulation of substrate drugs and thus their cytotoxicity [8].

With regard to structure, P-gp is a 170 kDa transmembrane protein [9] formed by two symmetrical halves: N-terminal and C-terminal. Each half of the molecule contains six transmembrane (TM) domains, followed by a nucleotide-binding domain (NBD). The inward facing conformation (which is stabilized by TMs 4, 5, 10, and 11) makes up a large internal cavity (\(\sim \)6,000 Å\(^{3})\) opened to both the cytoplasm and the inner leaflet for multispecific drug binding [10]. Until now, ABCB10 is the only human ABC transporter resolved for structure-based approaches, apart from X-ray structures of several other ABC transporters from prokaryotes (e.g., bacteria) and eukaryotic organisms (e.g., mice) [11, 12]. To deal with the unavailability of its crystallographic 3D structures at high-resolution, homology models of P-gp were generated using the resolved relevant structures as templates. For example, the recent work of Ambudkar et al. [13] revealed multiple active-binding sites for substrates and modulators including one primary site in a large flexible pocket in the TM domains and secondary sites from the combination of homology modeling, molecular docking, site-directed mutagenesis, and cell- and membrane-based assays.

The concept of using small molecule inhibitors (SMIs) of P-gp for resensitization of tumor cells to chemotherapeutic drugs has been universally acknowledged [14]. Over the last three decades, three generations of P-gp inhibitors have been developed and examined in preclinical and clinical studies [15]. However, no clinically used drugs for P-gp blockade are yet available and the failure of candidates may be attributed to their unfavorable properties (e.g., poor solubility, poor specificity, and toxicity) and pharmacokinetic interactions [1519]. In this context, flavonoids (polyphenolic compounds) that are ubiquitous in nature have emerged as reversal agents of the efflux pump-mediated MDR because of their advantages, including dual effects (P-gp modulation and antitumor activity) and safety and formed the third generation, nonpharmaceutical category of P-gp inhibitors [15, 2024]. Chemical modifications on flavonoids of herbal origin have been implemented for structure–activity relationship (SAR) studies. For example, Ecker et al. [25] have recently reported a series of new synthetic chalcones with good biological evaluation results and indicated the importance of specific groups for P-gp inhibitory activity based on 2D- and 3D-QSAR analyses.

Computer-aided drug design (CADD) methods including QSAR, molecular modeling, and docking are helpful to predict and elucidate ligand–protein interactions in the early stage of the drug discovery process [12, 26]. For this reason, the present study aimed to develop high-performance 2D-QSAR models (individual and ensemble) for predicting P-gp inhibitory activity. In addition, we endeavored to identify binding sites as well as plausible binding modes into P-gp homology model of the compounds having the best predicted pIC\(_{50}\) values from two libraries of in-house chalcones and drug-bank compounds by means of docking.

Materials and methods

The research process including machine learning, homology modeling, and molecular docking is summarized as shown in Fig. 1 and described in detail as follows:

Fig. 1
figure 1

Study flowchart (a) and nodes in Clementine (b)

Data sources

Databases for predicting P-gp inhibitory activity

As proven in the work of Ecker et al. [27], it was impossible to create a large chemical diverse dataset of P-gp inhibitors whose bioactivity values derived from different assays. For this reason, three SAR datasets based on the same daunorubicin efflux assay in MDR CCRF vcr1000 cells were compiled and combined: (i) 198 chemical compounds from the publication of Ecker et al. [27]; (ii) 159 tested compounds from the PubChem BioAssay with Assay Identification (AID) number 281137, comprising 141 active compounds and 18 unspecified compounds; and (iii) 303 tested compounds from the PubChem BioAssay with AID number 781331, comprising 274 active compounds and 29 unspecified compounds. In fact, they were all presented in the ChEMBL open source database [28]. In addition, 22 chalcone derivatives whose IC\(_{50}\) values were measured in a daunomycin efflux inhibition assay using the mentioned cell line were also collected from another publication of Ecker et al. [25] to broaden the chemical space of QSAR models. These four datasets were merged using MOE 2008.10 [29]. After inspection, two duplicates (CID: 13504844 and CID: 9976755) in the AID 281137 database having their structures identical to those of two other compounds (CID: 73357260 and CID: 73346637) in the AID 781331 database were removed. As a result of this process, a final dataset of 499 compounds with chemical structures and bioactivity values (IC\(_{50})\) was obtained for the generation of prediction models (Supplementary Material 2, SM2-DB1). The negative logarithm of IC\(_{50}\) (pIC\(_{50})\) was used as the dependent variable. Among them, we randomly took 99 compounds (20 %) as an external test set to assess the extrapolation capacity of generated models.

Databases for virtual screening

In this work, we utilized an in-house library of 87 chalcones for in silico screening to seek novel efflux pump inhibitors which might help to overcome MDR in tumors (Supplementary Material 2, SM2-DB2). Furthermore, another library of 6874 compounds belonging to six groups, namely approved, experimental, investigational, nutraceutical, withdrawn, and illicit, was downloaded from DrugBank [3033] for the same purpose to exploit existing drugs (Supplementary Material 2, SM2-DB3).

Attribute calculation and selection

The 2D structures were built in ChemBioDrawUltra 12.0 [34] if not available and subsequently energy minimized in MOE [29] before attribute calculation. Chemical descriptors converting chemical compounds into descriptor vectors to be understood by computers are of great importance in predicting protein–ligand interactions [35]. On one hand, 184 2D molecular descriptors which are physicochemical properties, subdivided surface areas, atom and bond counts, Kier&Hall connectivity and Kappa shape indices, adjacency and distance matrices, pharmacophoric features, and partial charges were computed for 499 database compounds using MOE [29]. In addition, their 1444 1D-2D descriptors representing 63 different molecular properties were also calculated using PaDEL-Descriptor 2.21 [36].

Feature selection aims to remove redundant or irrelevant features for improving model quality and reducing computational cost [37]. First of all, the compounds with no missing attributes were filtered using RapidMiner 5.3.008 [38]. Subsequently, removing useless and correlated attributes and optimizing selection by a genetic algorithm (GA) were also implemented in RapidMiner [38]. Finally, the search method BestFirst in Weka 3.7.9 [39] that searches the space of attribute subsets by greedy hill-climbing augmented with a backtracking facility was used to select attributes with 10-fold cross validation. All parameters in the variable selection process were set to default.

Data partition into training and test sets

For internal validation, the database of 400 remaining compounds was divided into the training set (75 %) and the internal test set (25 %) using two tools in MOE [29], Rand and Diverse Subset. The Rand function assigning a random number between 0 and 1 to each compound was used to split the database randomly. In contrast, the Diverse Subset application assigning a rank to each compound by a ranking process whose methodology can be found in detail in the software was used to determine the farthest ones in the study population for the diverse training set.

Applicability domain determination

To predict a new compound truly using a QSAR model, determining the applicability domain (AD) is necessary. This term was interpreted in different ways [4042] but it could be simply understood as “the response and chemical structure space in which the QSAR model makes predictions with a given reliability” [43]. There are available methods for defining the domain of applicability such as Ranges in the descriptor space; Geometrical methods; Distance-based methods; Probability density distribution; and Range of the response variable and Miscellaneous [40, 42, 44]. Recently, a new simple method has been suggested by Roy et al. [43] to identify the X-outliers in case of training set and the compounds outside the AD in case of the test set. Its principle is based on the theory of standardization approach, considering mean ±3 standard deviation (SD) as the zone of most training set compounds (99.7 %) and the remainder as the zone of dissimilar compounds. Following this method, an X-outlier (if in the training set) or outside AD (if in the test set) could be specified by computing the standardized descriptor \({S}_{\mathrm{i(k)}}\), the maximum \({S}_{\mathrm{i(k)}}\) value \(([{S}_{\mathrm{i}}]_{\mathrm{max(k)}})\), the minimum \({S}_{\mathrm{i(k)}}\) value \(([{S}_\mathrm{i}]_{\mathrm{min(k)}})\) (if necessary), the \({S}_{\mathrm{new(k)}}\) (if necessary) and comparing them with 3 (threshold value). In the present study, we employed the software “Applicability domain using standardization approach” also developed by Roy et al. [43] to execute this process.

Machine learning methods

Machine learning provides the technical basis of data mining for a variety of purposes. In drug discovery and development, machine learning tools and techniques are being increasingly applied to predict pharmacodynamic (inhibitor, substrate, antagonist, agonist, blocker, and activator) and pharmacokinetic (ADMET) properties of the compounds [45]. In the present study, we used two nodes, Numeric Predictor and Ensemble in Clementine 12.0 [46], for predicting P-gp inhibitory activity. The Numeric Predictor node can estimate the models for a maximum of six machine learning algorithms, viz Neural Network, C&R Tree, CHAID, Regression, Generalized Linear, and Support Vector Machine (SVM). And Ensemble is the node used for combining predictions from the most accurate models to avoid limitations of individual models and achieve a higher overall accuracy. The principles and the application examples of these methods were described in detail elsewhere [37, 46, 47].

Model evaluation

In this study, statistical parameters and validation criteria used to measure the QSAR model performance for predictivity were presented as follows:

Internal validation

The internal quality of models is judged by the squared correlation coefficient \({R}^{2}\) and the cross-validated squared correlation coefficient \({Q}^{2}\). Leave-one-out (LOO) is a particular case of k-fold cross validation where k equals the number of compounds in the data [48]. In the case of LOO cross validation, the \({Q}^{2}\) metric [49] is calculated according to the following formula:

$$\begin{aligned} {Q}_{\mathrm{LOO}}^2 =1-\frac{\mathop \sum \nolimits _{{i}\,=\,1}^{{n}_{\mathrm{TR}} } \left( {\hat{ {y}}}_{i} -{y}_\mathrm{i} \right) ^{2}}{\mathop \sum \nolimits _{{i}\,=\,1}^{{n}_{\mathrm{TR}} } \left( {{y}_\mathrm{i} -{\bar{{y}}}} \right) ^{2}}. \end{aligned}$$
(1)

In Eq. (1), \({y}_{i} \), and \({\hat{{y}}}_{i} \) are, respectively, the observed and predicted activity values and \({\bar{{y}}} \) is the mean value of \({y}_{i} \) in the training set. For an acceptable model, \({R}^{2} \ge 0.7, {Q}_{\mathrm{LOO}}^2 \ge 0.6\) and \(\left| {{R}^{2}-{Q}_{\mathrm{LOO}}^2 } \right| \le 0.1\) are required [5053].

For further internal validation, model randomization or y-randomization was implemented on the training set. In this process, the values of the dependent variable (y) were randomly scrambled 10 times and new models were developed from each y-randomized training set. The \({R}_\mathrm{p}^2 \) metric proposed by Roy et al. [49] to ensure the models not to be developed by chance is calculated by the following formula:

$$\begin{aligned} {R}_\mathrm{p}^2 ={R}^{2}\sqrt{{R}^{2}-{R}_\mathrm{r}^2 }) \end{aligned}$$
(2)

In Eq. (2), \({R}^{2}\) and \({R}_\mathrm{r}^2 \) are the squared correlation coefficient of the nonrandomized model and the squared mean correlation coefficient of randomized models, respectively. For a predictive QSAR model, the value of \({R}_\mathrm{p}^2 \) should be more than 0.5 [49].

External validation

The more widely used parameters \({Q}_{\mathrm{F}1}^2 \)[54]; \({Q}_{\mathrm{F}2}^2 \) [55]; \({Q}_{\mathrm{F}3}^2 \) [56, 57]; \({r}_\mathrm{m}^2 ;{\overline{{r}_\mathrm{m}^2}} ; \Delta {r}_\mathrm{m}^2 \) [5860]; and CCC (concordance correlation coefficient) [61] were applied to externally validate the models for their predictions on the compounds not involving the model development. Apart from them, mean absolute error (MAE) was also employed to assess the predictability on the external test set [57]. These metrics are calculated based on the following formulas:

$$\begin{aligned}&{Q}_{\mathrm{F}1}^2 =1-\frac{\mathop \sum \nolimits _{{i}\,=\,1}^{{n}_{\mathrm{EXT}} } \left( {{y}_{i} -{\hat{{y}}}_{i} } \right) ^{2}}{\mathop \sum \nolimits _{{i}\,=\,1}^{{n}_{\mathrm{EXT}} } \left( {{y}_{i} -{\bar{{y}}}_{\mathrm{TR}} } \right) ^{2}} \end{aligned}$$
(3)
$$\begin{aligned}&{Q}_{\mathrm{F}2}^2 =1-\frac{\mathop \sum \nolimits _{{i}\,=\,1}^{{n}_{\mathrm{EXT}} } \left( {{y}_{i} -{\hat{{y}}}_{i} } \right) ^{2}}{\mathop \sum \nolimits _{{i}\,=\,1}^{{n}_{\mathrm{EXT}} } \left( {{y}_{i} -{\bar{{y}}}_{\mathrm{EXT}} } \right) ^{2}} \end{aligned}$$
(4)
$$\begin{aligned}&{Q}_{\mathrm{F}3}^2 =1-\frac{\left[ {\mathop \sum \nolimits _{{i}\,=\,1}^{{n}_{\mathrm{EXT}} } \left( {{y}_{i} -{\hat{{y}}}_{i} } \right) ^{2}} \right] /{n}_{\mathrm{EXT}} }{\left[ {\mathop \sum \nolimits _{{i}\,=\,1}^{{n}_{\mathrm{TR}} } \left( {{y}_{i} -{\bar{{y}}}_{\mathrm{TR}} } \right) ^{2}} \right] /{n}_{\mathrm{TR}} } \end{aligned}$$
(5)
$$\begin{aligned}&{r}_\mathrm{m}^2 ={r}^{2}\left( {1-\sqrt{{r}^{2}-{r}_0^2 }} \right) \end{aligned}$$
(6)
$$\begin{aligned}&{r}^{\prime 2}_\mathrm{m} ={r}^{\prime 2}\left( {1-\sqrt{{r}^{\prime 2}-{r}^{\prime 2}_0 }} \right) \end{aligned}$$
(7)
$$\begin{aligned}&\overline{{r}_\mathrm{m}^2}=\frac{{r}_\mathrm{m}^2 +{r}^{\prime 2}_\mathrm{m} }{2} \end{aligned}$$
(8)
$$\begin{aligned}&\Delta {r}_\mathrm{m}^2 =\left| {{r}_\mathrm{m}^2 -{r}^{\prime 2}_\mathrm{m} } \right| \end{aligned}$$
(9)
$$\begin{aligned}&\hbox {CCC}=\frac{2\mathop \sum \nolimits _{{i}\,=\,1}^{{n}_{\mathrm{EXT}} } \left( {{y}_{i} -{\bar{{y}}}} \right) \left( {\hat{{y}}}_{i} -{\bar{\hat{{y}}}} \right) }{\mathop \sum \nolimits _{{i}\,=\,1}^{{n}_{\mathrm{EXT}} } \left( {{y}_{i} -{\bar{{y}}}} \right) ^{2}+\mathop \sum \nolimits _{{i}\,=\,1}^{{n}_{\mathrm{EXT}} } \left( {\hat{{y}}}_{i} -{\bar{\hat{{y}}}} \right) ^{2}+{n}_{\mathrm{EXT}} \left( {\bar{ {y}}} -{\bar{\hat{{y}}}} \right) ^{2}}\nonumber \\ \end{aligned}$$
(10)
$$\begin{aligned}&\hbox {MAE}=\frac{1}{{n}}\times \mathop \sum \nolimits _{{i}\,=\,1}^{{n}_{\mathrm{EXT}} } \left| {y}_{i} - {\hat{{y}}} _{i} \right| \end{aligned}$$
(11)

In Eqs. (3), (4), (5), (10) and (11), \({y}_{i} \) and \({\hat{{y}}} _{i} \) are, respectively, the observed and predicted activity values, while \({\bar{{y}}} \) and \({\bar{\hat{{y}}} }\) are, respectively, the mean values of \({y}_{i} \) and \({\hat{{y}}}_{i} \). In Eq. (6) and (7), \({r}^{2}\) and \({r}_0^2 \) are, respectively, the determination coefficients in the regression function with and without intercept in case of using experimental data on the y-axis and predicted data on the x-axis, while \({r}^{\prime 2}\) and \({r}^{\prime 2}_0 \) are, respectively, the same coefficients in the opposite case. The most stringent validation criteria thresholds including \({Q}_{\mathrm{Fn}}^2 \ge 0.7; {\overline{{{r}_\mathrm{m}^2}}} \ge 0.65\); CCC \(\ge 0.85\) [62, 63]; \({r}_\mathrm{m}^2 \ge 0.5\) [49]; and \(\Delta {r}_\mathrm{m}^2 \le 0.2\) [64] were applied to verify the external predictivity of good models. Using the MAE-based criteria lately proposed by Roy et al. [65], the predictions could be considered good when MAE \(\le \)0.1 \(\times \) training set range and MAE + 3 \(\sigma \le \) 0.2 \(\times \) training set range and bad when MAE > 0.15 \(\times \) training set range or MAE + 3 \(\sigma>\) 0.25 \(\times \) training set range. In this study, we employed the software “XternalValidationPlus” also developed by Roy et al. [65] to execute this process.

Homology modeling

Due to the lack of high-resolution crystal structures of transporter proteins such as P-gp, homology modeling is considered as a feasible solution to obtain their structural information [66]. In the present study, the I-TASSER (Iterative Threading ASSEmbly Refinement) server which is one freely available online system for automated protein 3D structure prediction [67, 68] was employed to generate homology or comparative models of P-gp for docking studies. The I-TASSER methodology based on the state-of-the-art algorithms was described in detail elsewhere [69, 70] and summarized in three stages: (i) multiple threading, (ii) structural assembly, and (iii) model selection and refinement [68]. Before running, the P-gp sequence of 1280 amino acids was submitted in FASTA format [71] without assigning additional restraints and templates. The output including the confidence score (C-score), the template modeling score (TM-score), the root mean square deviation (RMSD), the number of decoys, and the cluster density was provided by the server for quantitative assessments of the predictions (models).

C-score is an estimate of the confidence of structure prediction. In a typical range of [−5, 2], the higher value of C-score indicates the better quality and the C-score value \(-\)1.5 indicates the correct topology of predicted model. RMSD is an average distance of all residue pairs in the predicted structure and the native structure, ranging between 1 and 2 Å for high-resolution models and between 2 and 5 Å for medium resolution models. Since RMSD might be influenced by a local error, TM-score which also measures the structural similarity between two structures is proposed to solve this problem. TM-score <0.17 means random predictions and TM-score >0.5 means correct topology for all sizes of proteins [72]. The cluster density is the number of structure decoys (low temperature replicas) at a unit of space in the SPICKER cluster and a higher cluster density means a better quality model.

In addition to the above parameters, the stereochemical quality of the final model was checked using PROCHECK [73]. Our structure in PDB format was uploaded to PDBsum to produce its Ramachandran plot of the phi–psi torsion angles for all residues in the structure, except those at the chain termini. Because glycine residues are not restricted to any particular region of the plot, they are separately identified by triangles. Based on an analysis of 118 structures of resolution of at least 2.0 Å and R-factor no greater than 20.0, a good quality model would be expected to have over 90 % in the most favored regions [A,B,L] (also called core regions).

Molecular docking

Ligand and protein preparation

Both ligands and protein were prepared to be used for docking. The 2D structures of the ligands were built in ChemBioDrawUltra [34] if not available and subsequently energy minimized in MOE [29]. The best P-gp homology model in complex with ligand was protonated, tethered, and minimized by the LigX tool in MOE [29] and the ligand was subsequently removed. The binding site of the target protein was also predicted by the I-TASSER server based on the similar binding site of template protein.

Docking

Potential P-gp inhibitors which were previously identified before by the QSAR models from two virtual screening databases (in-house and drug-bank) were docked into the ligand-binding pocket of P-gp homology model using the FlexX package in LeadIT 2.0.2 [74] to provide insights into molecular recognition via protein–ligand interactions. In this process, the triangle matching algorithm was chosen for the place base fragment, while the maximum numbers of solutions per iteration and per fragmentation were defined as 1000 and 200, respectively. Compounds having both good predicted pIC\(_{50}\) values and docking scores were more likely to be the hits for P-gp inhibitory activity.

Results and discussion

2D-QSAR models

A total of 1628 MOE and PaDEL descriptors were initially computed for the whole dataset of 499 compounds. Among them, five descriptors (Kier1, Kier2, Kier3, apol, and bpol) which were determined as duplicates were eliminated. All 400 training and internal test set compounds with no missing attributes were used for selecting features. The feature reduction in RapidMiner, including raw filter by the Remove Useless Attributes and Remove Correlated Attributes operators, and GA-based optimization by the Optimize Selection (Evolutionary) operator resulted in 383 attributes. This number of attributes which was still abundant for the model development was continuously reduced to 89 attributes with number of folds \(\ge \)10 % by the BestFirst algorithm in Weka. However, only the 34 most relevant attributes with number of folds \(\ge \)80 % (Supplementary Material 1, SM1-Table 1) were chosen to generate the machine learning models. All parameters in the variable selection process were set to the default.

Out of the selected attributes, the ones computed by PaDEL were in the majority compared with the ones computed by MOE (27/7) to represent for the whole dataset. MOE descriptors belonged to three classes: Adjacency and Distance Matrix Descriptors (2), Atom Counts and Bond Counts (1), and Partial Charge Descriptors (4), while PaDEL descriptors belonged to 10 classes: Autocorrelation (4); Barysz matrix (4); Burden modified eigenvalues (9); Atom type electrotopological state (4); Information content (1); Longest aliphatic chain (1); Molecular linear free energy relation (1); Ring count (1); Topological charge (1) and Topological distance matrix (1).

As mentioned above, the Numeric Predictor node with integrated machine learning algorithms was put in a simple stream in Clementine for automated modeling of the relationship between chemical structure (descriptors) and P-gp inhibitory activity (a numeric range target), using 300 training set compounds derived from the diverse and random partitions. Based on the 34 chosen descriptors, all six candidate models as mentioned in the machine learning methods (Neural Network, C&R Tree, CHAID, Regression, Generalized Linear and SVM) were created by default to predict these continuous numeric outcomes for each type of division (Supplementary Material 1, SM1-Table 2). Having the same results in both separation cases, only one (Regression) of the two models (Regression and Generalized Linear) was chosen for further analysis.

The generated models were compared together using the squared correlation coefficient (\({R}^{2})\) which indicates a stronger relationship when being closer to 1. In the diverse partition, the \({R}^{2}\) values in the training set were less than those in the test set in cases of the Regression, SVM, and Neural Network models and vice versa in cases of the CHAID and C&R Tree models. In the random partition, the \({R}^{2}\) value in the training set was less than that in the test set in case of the Neural Network model and vice versa in cases of the CHAID, C&R Tree, Regression, and SVM models. Furthermore, although the \({R}^{2}\) values of all models in the diverse division were approximate to those in the random division in case of the training set, most of these values in the diverse division were considerably greater than those in the random division in case of the test set. The obtained results demonstrated the diverse training set was more appropriate than the random training set to be used for the development of machine learning models.

Based on the \({R}^{2}\) values, the CHAID and C&R Tree models ranked best in both diverse and random training sets (85.80, 85.91, 82.56, and 84.00 %, respectively), while two pairs of models SVM, C&R Tree and Neural Network, C&R Tree performed best in the diverse and random internal tests, respectively (81.21, 81.20, 74.14, and 73.69 %, respectively). The greatest difference in \({R}^{2}\) between the training and internal test sets was found in case of CHAID with the random partition. Conversely, there was no remarkable difference in this metric between the diverse and random training sets as well as the diverse and random internal test sets in case of Neural Network. To avoid such limitations of individual models, the solution of combining predictions from multiple models was put forward in our study.

Apart from Generalized Linear, the remaining models were combined into a single aggregated model using the Ensemble node. To compare each of the individual models with the Ensemble model, the option “Filter out fields generated by ensembled models” was not selected. The ensemble scores for this range target were generated by averaging the scores for the individual models. One frequent benefit of this approach is the ability to yield more accurate predictions than can be gained from any one model [46]. Indeed, the Ensemble model with its aggregated predictions gained the high performance in the diverse training set (\({R}^{2} \approx 0.84, {Q}_{\mathrm{LOO}}^2 \approx 0.70, \left| {{R}^{2}-\hbox {Q}_{\mathrm{LOO}}^2 } \right| \approx 0.14, {r}_\mathrm{m}^2 \approx 0.80, {r}_\mathrm{m}^{{\prime }2} \approx 0.64, \overline{{r}_\mathrm{m}^2 }\approx 0.72\), and \(\Delta {r}_\mathrm{m}^2 \approx 0.16\)), while the individual models failed at least one of the mentioned validation criteria, viz \({Q}_{\mathrm{LOO}}^2 \) and \(\left| {{R}^{2}-{Q}_{\mathrm{LOO}}^2 } \right| \) in the CHAID and C&R Tree models; \(\overline{{r}_\mathrm{m}^2 }\) in the Regression, SVM, and Neural Network models, and \(\Delta {r}_\mathrm{m}^2 \) in the Regression model (Table 1). All six models had \(\left| {{R}^{2}-{R}_{\mathrm{Yi}}^2 } \right| \ge 0.2\) and met the criterion of \({R}_\mathrm{p}^2 (\ge \)0.5) in the y-randomization process (Table 1). Despite being not quite as well as the individual models in some concrete conditions, the combination of multiple models performed as the best in overall. The Ensemble model showed itself as robust one which is likely to perform better when applied to other datasets in general terms, without having to dig deeply into the specifics of any one model.

Table 1 Internal validation results of the QSAR models generated from the diverse training set

The QSAR models were subsequently validated by the internal and external test sets. For the internal test set, the Ensemble model continued performing as the best with \({Q}_{\mathrm{F}1}^2 \approx 0.83, {Q}_{\mathrm{F}2}^2 \approx 0.83, {Q}_{\mathrm{F}3}^2 \approx 0.81, {r}_\mathrm{m}^2 \approx 0.80, {r}_{\mathrm{m}}^{{\prime }2} \approx 0.64, \overline{{r}_\mathrm{m}^2 } \approx 0.72, \Delta {r}_\mathrm{m}^2 \approx 0.16\) and CCC \(\approx \) 0.90 (Table 2). For the external test set, the best model of all was surprisingly SVM whose predictions showed the square value of the correlation coefficient (\({R}^{2} \approx 0. 74\)) only higher than that of Neural Network (\({R}^{2} \approx 0.73\)) on the diverse training set (Table 1). In this case, the Ensemble model showed its performance that was comparable to the best model SVM with \({Q}_{\mathrm{F}1}^2 \approx 0.83, {Q}_{\mathrm{F}2}^2 \approx 0.82, {Q}_{\mathrm{F}3}^2 \approx 0.83, {r}_{\mathrm{m}}^2 \approx 0.82, {r}_\mathrm{m}^{{\prime }2} \approx 0.67, \overline{{r}_\mathrm{m}^2} \approx 0.74, \Delta {r}_{\mathrm{m}}^2 \approx 0.16\), and CCC \(\approx 0.90\) (Table 3). These results proved the benefits of combining multiple models for handling diverse problems in biological data modeling.

Table 2 Validation results of the QSAR models on the internal test set
Table 3 Conventional validation results of the QSAR models on the external test set

Gramatica and Chirico [63] in their quantitative analysis revealed the importance of evaluating the scatter plot of experimental and predicted data to discover unacceptable QSAR models hidden by good statistical values. The plots of regression of predicted \(\mathrm{pIC}_{50}\) values by observed \(\mathrm{pIC}_{50}\) values on P-gp for the Ensemble model are displayed with error bars in Fig. 2 in cases of the training set (a), the internal test set (b), and the external test set (c). The points clustered along the diagonal visually indicated a good correlation between observed and predicted values for all the datasets. In other words, the generated Ensemble model was a good fit for these data.

Fig. 2
figure 2

Experimental against predicted P-gp pIC\(_{50}\) plots for the Ensemble model: a in the training set; b in the internal test set and c in the external test set

For decades, the ligand-based approaches (SAR, 2D-QSAR, 3D-QSAR, and pharmacophore modeling) have been mainly employed for the discovery and design of novel and potential P-gp inhibitors when the structure of this protein has still not been resolved [26]. The molecular mechanisms of MDR modulation in general and P-gp inhibition in particular not being yet fully understood until now have been challenging QSAR studies of P-gp inhibitors [75]. Apart from binary classifiers for predicting P-gp inhibition (inhibitor: 1; noninhibitor: 0), many other 2D-QSAR studies [25, 7686] treating P-gp inhibitory activity as a range target (e.g., \(\mathrm{pIC}_{50}, {pK}_{\mathrm{d}})\) were also published in the past and summarized in Table 4. Using different methods from simple (multiple linear regression) to complex (neural networks, support vector machine, hybrid), these studies showed satisfactory statistical results and contributed to determine structural features necessary for P-gp- mediated MDR reversal. Nevertheless, the applicability of the reported models to other datasets is questionable because of their limited numbers of investigated molecules.

In our study, the AD determination process resulted in 7 outliers (C89, C288, C381, C411, C422, C433, and C472) among 300 diverse training set compounds; 1/100 internal test set compounds (C358); and 2/99 external test set compounds (C40, C407) residing outside the AD. Without excluding these compounds, two individual models (Regression, SVM) and the Ensemble model were categorized as “good” and three individual models (CHAID, C&R Tree, Neural Network) were categorized as “moderate” based on the estimation of MAE-based metrics after removing 5 % external test set objects with high residual values (Table 5; Supplementary Material 2, SM2-MAE). The problems related to classical \({R}^{2}\)-based metrics and CCC (their values which are influenced by the range and the distribution of response values around the training/test set mean of the test set compounds) may lead to the wrong conclusion on the model acceptability [65]. By determining the allowable error limit using the training set response range, the MAE-based criteria suggested by Roy et al. [65] help to examine the predictive error values. The determined AD and the checked MAE-based criteria in this study showed the internal and external test set compounds lied in both chemical and response domains of the training set. In other words, the “good” predictability of our QSAR models was asserted with higher confidence by the combination of conventional and new validation methods.

Table 4 A summary of 2D-QSAR models for P-gp inhibitory activity (range target) in previously published works and the present study
Table 5 Other validation results of the QSAR models on the external test set, using MAE-based criteria applied on 95 % data

Based on the same experimental assay, cell line, and cytotoxic agent, our data from different sources that are relatively comparable were merged to increase the database size and ensure sufficient structural diversity for the model development. Strict validation criteria also helped to confirm the high quality of the developed QSAR models, especially the Ensemble model which averaged the scores of the remaining models. From obtained results, this aggregated model is expected as a promising in silico tool which could be subjected to high throughput screening to yield rapid and accurate predictions for P-gp inhibitory potency of candidate drugs. Additionally, there was a good agreement between the present work and the previously published works in important molecular physicochemical properties for P-gp modulation such as the surface area, the lipophilicity, the longest aliphatic chain, and the positive charge (Table 4) that were represented by PEOE_VSA_FHYD, PEOE_VSA_FPNEG, PEOE_VSA_FPPOS, LipoaffinityIndex, nAtomLAC, and PEOE_RPC+, respectively, in our work. This information may help to orient the synthesis through suitable structural modifications resulting in activity enhancement.

Homology models

As a result of molecular modeling by the automated I-TASSER server, four P-gp homology models were generated for molecular docking in the next step. Four proteins whose PDB ID codes are 3g61A [10], 4m1mA [87], 4f4cA [88], and 3g5uA [10] were used by this server as the top threading templates. The quality estimation is shown in Table 6, with only the first model (model 1) having TM-score and RMSD values predicted. Since the correlation of C-score and quality of lower-ranked models were much weaker than that for the first model, the absolute quality (TM-score and RMSD) of the lower rank models could not be meaningfully estimated. However, the relative quality of lower rank models could be predicted based on the relative rank and their C-score information [68].

Table 6 Top 4 models with templates and estimated parameters predicted by I-TASSER

With its best quality according to I-TASSER indicators (C-score = 0.58, TM-score = 0.79, RMSD = 8.1, number of decoys = 1850, and cluster density = 0.1667), the model 1 which was built from the chain A of P-gp in mus musculus/house mouse (3g61) [10] was subsequently used for generating Ramachandran plot. In black-and-white, this plot showed the most favorable regions in the darkest gray and the less favorable regions in progressively lighter tones (Fig. 3). The statistics indicated that among 1150 nonglycine and nonproline residues, 78.7 % (905 residues), 16.4 % (189 residues), 3.0 % (35 residues), and 1.8 % (21 residues) were, respectively, found in the most favored regions [A,B,L]; the additional allowed regions [a,b,l,p]; the generously allowed regions [\(\sim {a},\sim {b},\sim {l},\sim {p}\)]; and the disallowed regions [XX]. Below 90 % of residues in the core regions as expected could be attributed to the resolution of the structure on which the homology model had been based (4.35 Å). For docking in virtual screening, its binding pocket was also predicted using the same template protein but from the chain B with the ligand 0JZ (cyclic-tris-(R)-valineselenazole/QZ59-RRR) (Fig. 4). This result was completely in accordance with the information revealed by Chang et al. [10] about the internal cavity of P-gp responsible for polyspecific drug binding.

Fig. 3
figure 3

Ramachandran plot of the best P-gp homology model, in which the most favored regions, the additional allowed regions, the generously allowed regions and the disallowed regions were labeled [A,B,L]; [a,b,l,p]; [\(\sim \)a,\(\sim \)b,\(\sim \)l,\(\sim \)p] and [XX], respectively. The darker area represented the more favorable phi–psi combination

Fig. 4
figure 4

P-gp homology model with the ligand-binding site predicted by I-TASSER

In the past, many efforts were made for homology modeling of human P-gp from the resolved structures of bacterial pumps such as MsbA [89], Sav1866 [90, 91], and BtuCD [92, 93] to enable structure-based drug design [26]. However, to obtain a high- quality homology model, the template protein must be of high resolution and have a high sequence identity with the target protein [26]. This explains why the crystallization and structure determination of mammalian P-gp had great expectations. The X-ray structure of mouse P-gp in complex with 0JZ which was used in this study has the resolution of 4.35 Å, 87 % sequence identity to human P-gp and \(\approx 100\,\%\) identity of residues in the binding cavity excepting mSer725/hAla729 [10, 94]. Although this structure has a register shift of one amino acid (Tyr303/Ile302), it seems to be a suitable template to generate homology models of human P-gp for docking rather than new and corrected crystal structures of mouse P-gp (PDB ID: 4KSB, 4KSC, and 4KSD) which did not have bound ligands [95], since a reference ligand (0JZ in this case) is necessary to make the binding site prediction in I-TASSER and to exactly specify this region for docking-based virtual screening in LeadIT.

Fig. 5
figure 5

Docking images of two most potential ligands into P-gp homology model

In silico screening

Bioactivity prediction The generated QSAR models were applied on two libraries of 87 in-house chalcones and 6874 drug-bank compounds to make predictions for their bioactivity (\(\hbox {IC}_{50}\) or \(\hbox {pIC}_{50}\)) on the multidrug resistance efflux pump P-gp, aiming to identify potential inhibitors with predicted \(\hbox {IC}_{50}\) values \(\le 15 \upmu {M}\) (threshold value) [96]. Based on 34 descriptors provided in Supplementary Material 1, SM1-Table 1, the compounds having no missing values were quickly searched for the unusuals in combination with 499 compounds of the whole dataset used for training and validating QSAR models, using the Anomaly Detection node in Clementine. The percentage of most anomalous compounds was set to 1 %. In addition, a principle component analysis (PCA) of all these compounds was also conducted with the maximum number of five components.

After the preliminary screening, 73 potential anomalies were detected from 6763 drug-bank compounds with no missing attributes (Supplementary Material 2, SM2-AD). Based on two first principal components, the distribution of the compounds belonging to different databases is presented in Supplementary Material 1, SM1-Fig. 1. The first scatter plot (Supplementary Material 1, SM1-Fig. 1a) showed the rationality of eliminating anomalous compounds which somewhat separated from the others in two-dimensional space. While the second scatter plot (Supplementary Material 1, SM1-Fig. 1b) showed the dispersion of 87 in-house chalcones within the chemical space of the database of 499 compounds for the development of prediction models, the third scatter plot (Supplementary Material 1, SM1-Fig. 1c) showed many points scattered outside this area. In other words, the QSAR models might be succeeded in bioactivity prediction for chalcones in the in-house database but are likely to inaccurately predict for some compounds in the DrugBank database.

The detailed results of bioactivity prediction on two mentioned databases by all generated QSAR models could be found in the Supplementary Material 2 (SM2-BAP). The prediction process by the best model (Ensemble) resulted in 22/87 in-house chalcones and 2374/6690 drug-bank compounds having their predicted \(\hbox {IC}_{50}\) values less than or equal to the threshold value of \(\hbox {15}\, \upmu \mathrm{M}\) to be considered as P-gp inhibitors. Based on the substituents, these chalcones could be categorized into five groups: 2\(^{\prime }\)-hydroxy, 4\(^{\prime }\)-bromo, methoxy, amino, and heterocyclic ring A (Supplementary Material 1, SM1-Table 3). Among 44 drug-bank compounds with their predicted \(\hbox {pIC}_{50}\) values \(\ge \)7 (Supplementary Material 1, SM1-Table 4), Amiodarone, Tacrolimus, Indinavir, Ritonavir, Cyclosporine, and Saquinavir are well-known inhibitors of P-gp according to the U.S. Food and Drug Administration [97].

All 22 hit chalcones with their predicted activity values were taken into account for the structure–activity relationship (SAR) hypotheses of P-gp inhibition: (i) the presence of ortho-hydroxyl group on the ring A of chalcone scaffold makes the bioactivity decrease (F7 and F29), while the para-halogen group substitution (bromo) on this ring leads to a contrary result (F18 and F14); (ii) the replacement of methoxy group by dimethyl amino group at the para position on the ring B helps to increase the inhibitory activity (F4 and F18); (iii) the heterocyclic ring A is not necessary for good inhibitors (F18 and F65); (iv) methoxy substituents at the ortho and para positions on the ring A (F29, F33, and F36) have a positive influence on the effect of inhibitors; and (v) the inhibitory potency is enhanced by halogen group (cloro) at the ortho position (F44 and F45) or methoxy group at the ortho or meta position (F4, F5 and F7) on the ring B (ortho substitution is better: F35, F36 and F37).

Molecular docking A set of 22 in-house chalcones and 44 drug-bank compounds derived from the in silico prediction was applied for docking to determine their possible binding modes and affinities with the best homology model of P-gp, using FlexX in LeadIT. In addition, docking study was also implemented with 65 remaining chalcones, Reserpine, Tariquidar, and Elacridar (small molecule inhibitors of P-gp) [22] for comparison purpose. With no flips, protonation, and torsions, the binding site of protein was defined by choosing the ligand in the predicted binding site (0JZ) as a reference to include amino acids within radius of 6.5 Å. A total of 120 small molecules including 87 chalcone derivatives and 33 drug-bank compounds were successfully docked into the active site of receptor. However, the docking algorithm failed on 11 other drug-bank compounds because of their complex and bulky structures. The docking results could be found in detail in the Supplementary Material 2 (SM2-MD).

With docking scores ranging from −23.853 to −10.828 kJ/mol, the in-house chalcones showed good binding abilities into the internal cavity of P-gp homology model. The consensus of prediction and docking results suggested F45 as the most promising chalcone for inhibiting P-gp function (Supplementary Material 1, SM1-Table 3). Despite of having predicted \(\hbox {IC}_{50}\) values >15 \(\upmu \)M, F57, F53 and F55 with docking scores of −23.853, −21.750, and −21.483 kJ/mol, respectively, were the derivatives possessing the highest binding affinities for P-gp out of all 87 chalcones. Higher docking scores in general displayed weaker binding abilities of the drug-bank compounds in comparison with the chalcone derivatives, although they had significantly lower predicted \(\hbox {IC}_{50}\) values (Supplementary Material 1, SM1-Table 4). Apart from the compounds unable to be docked, there were 7/33 drug-bank compounds having positive docking scores. As well as F45, some drug-bank compounds such as DB01578, DB0278, DB02009, DB04378, and DB01232 stood out for both prediction and docking results. Compared with docking scores of Reserpine, Tariquidar, and Elacridar which were −3.465, −13.262, and −22.858 kJ/mol, respectively, hit compounds derived from the DrugBank database (e.g., DB01578, DB02785, DB02009, and DB04378) and our private database (e.g., F44 and F45) emerged as potential candidates to modulate P-gp-mediated drug efflux and reverse MDR in cancer cells.

Docking poses of three in-house chalcones having the highest predicted \(\hbox {pIC}_{50}\) or the lowest predicted \(\hbox {IC}_{50}\) values (F45, F17, and F44); three in-house chalcones having the lowest docking scores (F57, F53, and F55); three drug-bank compounds having the highest predicted \(\hbox {pIC}_{50}\) or the lowest predicted \(\hbox {IC}_{50}\) values (DB01118, DB01578, and DB01362); three drug-bank compounds having the lowest docking scores (DB01232, DB02785, and DB01578) and three P-gp inhibitors (Reserpine, Tariquidar, and Elacridar) are illustrated in Fig. 5 (F45, DB01578) and Supplementary Material 1, SM1-Fig.F2 (other ligands). These images indicated key residues responsible for receptor–ligand-binding interactions (hydrogen bonding, arene-cation, and hydrophobic), which were His61, Gln946, Tyr950, Leu65, Phe194, and Gln195 in case of chalcones and Gln347, Phe343, Phe194, Gln195, Ser344, and Gln946 in case of drug-bank compounds. Amino acids Phe194, Gln195 in TM3 and Gln946, Tyr950 in TM11 reaffirmed the drug binding pocket located at the TM3/TM11 interface which was indicated before in the work of Chiba et al. [14]. Other methods such as photoaffinity labeling, site-directed mutagenesis, thiol probes and co-crystal structural studies of P-gp in the presence of mentioned ligands could be used to validate residues identified by our docking study.

Conclusions

In this work, we employed multiple different machine learning methods on a large and diverse database of compounds whose bioactivity values (\(\hbox {IC}_{50})\) are comparable to develop QSAR models for ligand-based design of P-gp inhibitors. The results of internal and external validation processes with strict validation criteria proved the predictivity of these models, especially the Ensemble model. In addition, high-quality homology models of P-gp were also generated for structure-based design. Based on these in silico tools, virtual screening was applied on two databases, one from in-house source and one from Drug Bank to discover new P-gp-mediated MDR reverters. Finally, structural optimization for more potent and safer drug candidates is required before further experimental investigations.

Supplementary material

The supplementary material associated with this article is presented in one PDF file (additional tables and figures) and one XLSX file (databases, external validation, anomaly detection, bioactivity prediction, and molecular docking results).