Introduction

The mutation or overexpression of EGFR (epidermal growth factor receptor) has been observed in many human tumors, and is under intense investigation as a novel anticancer molecular target [1]. The EGFR pathways can lead to the initiation of cancer proliferation and increased neoangiogenesis and metastatic potential. EGFR inhibition by inhibitors such as erlotinib and gefitinib results in a relatively effective treatment for patients for a while [2]. However, many patients will ultimately manifest disease progression because of drug resistance due to T790M mutation in the EGFR kinase domain especially in squamous head, neck carcinomas, ovarian, lung, and breast cancers [3]. Despite the rapid advances in EGFR oncology therapeutics over the past decade, substantial room for improvement remains. Several challenges exist in the effective inhibition of T790M clinically. Previously designed EGFR inhibitors did not demonstrate distinctly improved efficacy against T790M mutations which is partially attributed to the dosage limitation imposed by the drugs’ toxicity [4]. The design of EGFR-T790M inhibitors with high activity and low dosage has proven to be quite challenging since only few reports exist [5]. Therefore, more studies are needed to find potent, promising, and comparably efficacious inhibitors against EGFR-T790M resistance tumor. Recently, some novel quinazoline derivatives have been shown to selectively inhibit EGFR-T790M with high potency and good binding affinity [6] which prompted us to explore the relationship between the structures of quinazoline derivatives and EGFR-T790M inhibition. The cost-effective computational routines can speed up the procedure of drug design by predicting the biological activity of drug-like molecules and identifying receptor–ligand interactions [7]. Molecular docking and QSAR (quantitative structure activity relationship) techniques can be used to guide lead optimization and in silico evaluation for EGFR-T790M inhibitors. This may help find and design new targets that fall within applicability domain.

To the best of our knowledge, no QSAR study has yet been reported on quinazoline-based EGFR-T790M inhibitors. The main aim of the present contribution was to develop validated QSAR models to predict the inhibition profiles of quinazolines. In silico evaluation was performed keeping the following points in mind. First, molecular docking studies have been used to identify the position and orientation of ligands (inhibitors) in the active pocket of the EGFR-T790M receptor and to facilitate an interpretation of receptor–ligand molecular interactions. Second, various classes of molecular descriptors were calculated from the optimum ligand conformation obtained from docking results. Third, a straightforward non-linear descriptor selection method was used based on a GA-ANN (genetic algorithm-artificial neural networks) strategy to investigate the non-linear dependencies between descriptors and inhibitory activities [8]. Fourth, an interpretable MARS (multivariate adaptive regression spline) algorithm was employed to find sub-region equations and spline functions describing the non-linearity in selected descriptors [9]. Fifth, as a critical step to evaluate the predictive power and reliability of QSAR models, internal and external validations were performed using an external prediction set, leave-one-out (LOO) cross-validation (C.V.), leave-some-out (LSO) procedure, diversity analysis, Y-scrambling, and applicability domain together with other statistical parameters.

Methods and materials

Biological data set and molecular optimization

In the present QSAR analysis, a series of fifty-three quinazoline-based EGFR-T790M inhibitors, together with their inhibitory concentrations, were taken from the literature [6]. The inhibitory concentration \((\hbox {IC}_{50})\) value was converted to its corresponding logarithmic scale \(p\hbox {IC}_{50}\) value which was taken as the dependent parameter for QSAR analysis. The \(\hbox {IC}_{50}\), an inhibition activity parameter, is a measure of a compound’s inhibitory effectiveness and refers to the concentration required for a compound to reduce the level of EGFR activity by 50 %. The structures of EGFR inhibitors and their corresponding \(p\hbox {IC}_{50}\) values are presented in Table 1. The 2D structures of the inhibitors in Table 1 were sketched and their 3D structures were subjected to energy minimization using the molecular mechanics force field (MMFF) method implemented in the Molecular Operating Environment (MOE2014.09, Chemical Computing Group, Inc.) package with a convergence criterion of 0.01 \(\hbox {kcal mol}^{-1}\) and partial atomic charges. The optimized structures were used for molecular docking in the next section.

Table 1 Experimental and calculated inhibitor data using GA-ANN/MARS strategy for quinazoline derivatives together with their structures used in QSAR study

Energy minimization and molecular docking

The coordinates of the EGFR crystal structure were obtained from the Protein Data Bank (www.rcsb.org) (PDB ID: 2JIU). Molecular docking studies were carried out using MOE2014.09 in order to explore the interaction mechanism and probable binding mode of quinazoline-based EGFR inhibitors at the active pocket of the 2JIU protein crystal. The energy minimization algorithm of MOE was used to minimize the energy of the protein using the MMFF94x force field with conjugant gradient method [10]. For molecular docking, the ligand was considered fully flexible, while the protein was considered rigid. All the torsion angles in the inhibitors were also set free to perform flexible docking with a grid-point spacing of 0.3 Å. The docking calculations were performed by setting a \(70\times 70\times 70\) Å\(^{3}\) grid map centered in the binding pocket of EGFR using all the amino acid residues within the grid map. A total of 50 conformations were saved for each ligand using the default parameters in MOE. Simultaneously, all appropriate conformations and orientations were predicted with the best binding free energy. Then, the best docked conformation for each inhibitor was used to calculate molecular descriptors.

Molecular descriptor generation

Different types of molecular descriptors (constitutional, one dimensional, two dimensional, and three dimensional) were used for QSAR analysis. These molecular descriptors encoded significant structural features of molecules. A total of 287 descriptors were calculated using the E-Dragon software [11] based on molecular structures and the quantum theory of atoms in molecules after removing invariable descriptors for all molecules and correlated descriptors (R > 0.9). The list of these descriptors, their meaning, and the calculation procedures are provided in the literature [12]. These descriptors can be categorized into eighteen different descriptor families: Constitutional, Topological, 2D autocorrelations, Walk and path counts, Connectivity indices, Information indices, Edge adjacency indices, Randic molecular profiles, Geometrical descriptors, RDF, 3D-MoRSE, BCUT, Topological charge indices, WHIM, GETAWAY, Charge, Molecular properties, and Eigenvalue-based indices.

Fig. 1
figure 1

Flowchart of our GA-ANN/MARS strategy together with validations used in the present QSAR study

Descriptor selection and modeling tools

Descriptor selection is one of the most crucial steps in the development of QSAR models [13]. For the selection of m descriptors from the descriptor poll (n descriptors), there are \(\frac{n!}{m!(n-m)!}\) possible combinations. Many conventional descriptor selection methods provide one solution which may fall into a local minima and miss the global minima. Despite these methods, GA is an evolutionary stochastic method that can find the population of solutions. In each population, the chromosome of length k (the number of features) consists of zeros and ones indicating selected descriptors. The population in each generation is the result of the genetic manipulation of the chromosome found in the previous population through one-point crossover and mutation based on their fitness scores. This cyclic process was performed to obtain the best fitness score or satisfactory result. In the present study, artificial neural network (ANN) technique is defined as the fitness function in the GA due to its ability to measure a non-linear dependency between the descriptors (in the chromosome) and target variable \((p\hbox {IC}_{50})\). To calculate the fitness score, a multi-objective fitness function was used to account for both the residual errors and the number of selected descriptors according to the following equation:

$$\begin{aligned} \mathrm {Fitness}\,\mathrm {score}=\frac{1}{\mathrm{{RMSE}}_\mathrm{Train}+\mathrm{{RMSE}}_\mathrm{Test} +\sqrt{k}}, \end{aligned}$$
(1)

where \(\hbox {RMSE}_{\mathrm{Train}}\) and \(\hbox {RMSE}_{\mathrm{Test}}\) are the root mean square errors of the train and test sets, respectively, and k is the number of selected descriptors in the corresponding chromosome. Previous reports of GA-ANN strategies [14] were modified for effective descriptor selection as shown in Fig. 1. It should be mentioned that the Levenberg–Marquardt weight update function [15] was used in ANN due to its fast and remarkable performance. For more accurate descriptor selection, the dataset was divided into five subsets (S1, S2, S3, S4, and S5) as shown in Table 1, and a variable selection process was performed for different combinations of these subsets. Four subsets were used for calibration (train and test subsets) in developing the model and one subset was used for predicting and evaluating the generated model. The ANN models were developed using the train set, while the test set was used to prevent over-fitting. Ten possible combinations of train, test, and prediction subsets were used in the present study for descriptor selection. The most frequent selected descriptors were chosen as the most important descriptors describing the variance of the inhibitory activities.

For modeling purpose, MARS was employed as a simple and interpretable technique [16]. The main idea of MARS is dividing the whole space of each independent descriptor into various sub-regions and then defining a different mathematical equation for each region. This makes the MARS regression technique more flexible than the other regression techniques. For each descriptor, some break points (knot points) are defined using right and left side splines. These two left and right spline functions (basis functions) are used for relating the independent descriptors to the dependent variable. Generalized cross-validation (GCV) was employed to determine the importance of each basis function. Moreover, the MARS algorithm can be used to describe the non-linearity with little data preparation. The theory of this algorithm has been described elsewhere [9]. MATLAB Version 11.1 (Mathwork) [17] was used to run the GA-ANN algorithm and MARS toolbox.

Fig. 2
figure 2

a Docking pose of the most potent inhibitor (compound 38) in the EGFR-T790M active site. b 2D interaction of receptor compound 38 and surrounding amino acid residues

Results and discussion

As a necessary step in drug design, molecular docking studies can be used to identify receptor–ligand interactions in a protein’s binding pocket. The inhibitors listed in Table 1 were docked into the EGFR receptor using MOE. The most important interactions of EGFR with the most potent inhibitor (compound 38) from our molecular docking results are shown in Fig. 2a. A graphical receptor–ligand interaction is shown in Fig. 2b. As this figure shows, there are two hydrogen bonds between Glu762 and the inhibitor (compound 38). In addition, the amide group and ethyl ether also form two hydrogen bonds with Glu758. Another important hydrogen bond is between Asp761 and the N atom in the quinazoline ring. The inhibitor approaches residues Ala871 and Val876 from the \(\upalpha \)-face and makes direct non-polar contacts (van der Waals interaction); however, residues Tyr869, Gly873, Gly874, and Glu872 make polar contacts with the inhibitor. The two quinazoline rings can make a \(\uppi -\uppi \) interaction with Phe856. The polar interactions between the halide atoms (F & Cl) and residues Gly863 and Glu866 have been detected in docking results as shown in Fig. 2b. Lastly, the optimized conformation for each inhibitor was obtained from the receptor–ligand docked results. The obtained conformations of the inhibitors were used for molecular descriptor generation and further QSAR studies.

Fig. 3
figure 3

Scatter plot of five subsets (S1–S5) obtained from diversity analysis

In the descriptor selection procedure, a GA-ANN strategy was used with a population of 30 chromosomes for 1000 generations. All parts of this process are illustrated in Fig. 1. As mentioned before, the dataset was randomly split into five subsets (S1, S2, S3, S4, and S5) in such a way that they consisted of all range of \(p\hbox {IC}_{50}\) values considering chemical diversity. A diversity analysis was performed as illustrated in Fig. 3 to show that the inhibitors in each subset (S1–S5) can represent the whole data. In this figure, the mean distances of inhibitors were plotted against \(p\hbox {IC}_{50}\) illustrating the diversity of the inhibitors in the five subsets. The descriptor selection and model generation will be more stable using the diversity analysis. For ten possible combinations of subsets (10 runs), the whole GA-ANN process was carried out using different calibration (train and test sets) and prediction sets. The statistical results of the calibration and prediction sets are shown in Table 2 for ten runs using the GA-ANN algorithm. The average values of \(R_{\mathrm{Cal}}^{2}\), \(R_{\mathrm{p}}^{2}\), \(\hbox {RMSE}_{\mathrm{Cal}}\), and \(\hbox {RMSE}_{\mathrm{p}}\) are 0.793, 0.735, 0.438, and 0.544, respectively, which indicate that the performance of the GA-ANN is acceptable. The selected descriptors in each run and the repetition frequency for each descriptor in the GA-ANN are shown in Table 2. Each descriptor was at least repeated in seven runs from ten runs, which shows that the descriptor selection process is consistent and reliable. Inspection of this table also reveals that the five RDF030p descriptors (RDF descriptor, Radial Distribution Function weighted by polarizability), G2s (WHIM descriptor, 2nd component symmetry directional WHIM index weighted by electrotopological state), Mor04v (3D-MoRSE descriptor, 3D-MoRSE weighted by van der Waals volume), MATS2e (2D autocorrelation descriptor, Moran autocorrelation weighted by electronegativity) and RDF055m (RDF descriptor, Radial Distribution Function weighted by mass) appeared more frequently in 10 runs compared to the other descriptors.

The information about the selected descriptors is easily accessible from the x, y, z coordinates of the inhibitor atoms and other quantities attached to these coordinates, such as molecular volume, electronegativity, polarizability, covalent radii, interatomic distances, and distances from a specified origin. They reflect aspects related to molecular size, shape, charge, and the steric interactions. Hence, they encode different aspects of their inhibition mechanism. RDF030p and RDF055m are the radial distribution functions on a spherical volume of radius of 3.0 and 5.5 angstroms weighted by atomic polarizability and atomic mass, respectively. These descriptors are based on geometrical interatomic distance and provide valuable information about interatomic distances and ring types [18]. G2s is the geometric mean of the directional symmetries and it estimates dispersion and distribution of molecular electronic and topological state properties around the second component symmetry geometric center. This WHIM descriptor is built in such a way as to capture relevant molecular 3D information regarding electrotopological state. It should be mentioned that the electrotopological state is a measure of the electronic accessibility and can be interpreted as a probability of interaction with another molecule [19]. The Mor04v descriptor is among the 3D-MoRSE descriptors. In order to calculate this descriptor, the sum of all atomic van der Waals volumes is calculated using different angular scattering functions. The Mor04v descriptor shows the three-dimensional arrangement of the atoms influencing the size and shape of a molecule [20]. The MATS2e descriptor can be calculated using the Moran algorithm by summing up the products of atomic electronegativity of the terminal atoms in molecular structural graph. This descriptor encodes information of adjacency and distance of atoms and also describes how electronegativity is distributed along the topological structure of molecules [21]. To ensure that the selected descriptors encode different aspects of inhibitors, the correlation between the selected descriptors is shown in Table 3. According to this table, there are no significant correlations between these descriptors except for the RDF030p and RDF055m descriptors (correlation = 0.582).

Table 2 Results of ten runs and the descriptors selected in each run using GA-ANN strategy
Table 3 Correlation matrix of descriptors
Table 4 Coefficients of basis functions for each descriptor in MARS model
Fig. 4
figure 4

GA-ANN/MARS plot of calculated versus experimental \(p\hbox {IC}_{50}\) values for the calibration and prediction sets

Table 5 Statistical results of QSAR models

The five selected descriptors by the GA-ANN algorithm were used as inputs for developing the final MARS model to predict the \(p\hbox {IC}_{50}\) values of the fifty-three quinazoline-based EGFR inhibitors. Each descriptor in the MARS model has one or two basis functions. The coefficients, importance, and P values of these basis functions are given in Table 4. Each basis function in the MARS model is a regression equation with specific coefficients. One can easily calculate the relationship between the selected molecular descriptors and the inhibitory concentration of the quinazoline inhibitors using these basis functions. The importance of basis functions can be evaluated using the GCV parameter as shown in Table 4. The ranking order of the GCV values for the selected descriptors is RDF030p > MATS2e > Mor04v > RDF055m > G2s > G2s-RDF055m. The five selected descriptors, together with G2s-RDF055m interactions, have the highest effects on \(p\hbox {IC}_{50}\) values. As Table 4 shows, all descriptors are statistically significant with P values less than 0.05. The performance of the GA-ANN/MARS strategy was evaluated by plotting the predicted \(p\hbox {IC}_{50}\) versus experimental \(p\hbox {IC}_{50}\) for calibration and prediction sets as illustrated in Fig. 4. The predicted values of \(p\hbox {IC}_{50}\) by the GA-ANN/MARS strategy are listed in Table 1. This table, together with Fig. 4, shows that the predicted \(p\hbox {IC}_{50}\) is in agreement with the experimental \(p\hbox {IC}_{50}\).

Fig. 5
figure 5

Applicability domain plot for the calibration and prediction sets

The statistical performance and validation of the QSAR models were judged by means of (a) the external prediction set; (b) internal validation by LOO cross-validation; (c) complementary cross-validation by LSO procedure; (d) examining if the following conditions are satisfied: (1) \(R^{2}_{\mathrm{p}}> 0.6\), (2) \(Q^{2}_{\mathrm{LOO}}>\) 0.5, (3) \(r^{2}_{\mathrm{m}}> 0.5\), (4) \(r^{2}_{0}\) (i.e., imply regression through the origin) is close to \(R^{2}_{\mathrm{p}}\) such that \([(R^{2}_{\mathrm{p}}-r^{2}_{0})/R^{2}_{\mathrm{p}}]<0.1\); and (e) Y-scrambling [22, 23]. The results of these assessment procedures are given in Table 5 for the GA-ANN/MARS strategy. It should be noted that the external prediction set was not used in any step of the model development. The RMSE (=0.332) and \(R^{2}\) (=0.881) values for the prediction set indicate that 88.1 % of the variance in the inhibitory concentration of the quinazolines can be explained by the QSAR model. The internal validation was conducted by the LOO technique to ensure the reliability of the QSAR model. According to Table 5, it can be noticed that the \(Q^{2}_{\mathrm{LOO}}\) (=0.923) and \(\hbox {RMSE}_{\mathrm{LOO}}\) (=0.253) values are comparable. In the case of LSO, 80 % of the data were used to develop the QSAR model and the remaining 20 % of the data were used to evaluate the generated model. This repetitive process was performed 500 times with random subsets selected. The overall mean for this process \((R^{2}_{\mathrm{LSO}}=0.828\) and \(\hbox {RMSE}_{\mathrm{LSO}}=0.436)\) indicates that the constructed model is stable and valid. Other statistical parameters \(([(R^{2}_\mathrm{p}-r^{2}_{0})/R^{2}_{\mathrm{p}}]=0.017\) and \(r^{2}_{\mathrm{m}}=0.772)\) have acceptable values for the QSAR models. These results suggest that both interpolations and extrapolations of the inhibitory concentrations in the QSAR model are reasonably adequate. Y-scrambling was performed by the analysis of 200 cases of repetitive randomization of the response data \((p\hbox {IC}_{50})\). The low \(R^{2}_\mathrm{S}\) (=0.275) (scrambled results) and high \(\hbox {RMSE}_{\mathrm{S}}\) (=3.629) indicate that the QSAR model is not due to chance correlation. For the sake of comparison, a GA-PLS/MARS technique [24] was also used to develop the QSAR model. The statistical results of the GA-ANN/MARS strategy are better than those of the GA-PLS/MARS technique, especially the results of \(\hbox {R}^{2}_{\mathrm{p}}\) and \(\hbox {Q}^{2}_{\mathrm{LOO}}\) (Table 5). It is clear that the RMSE of the prediction set has been reduced by about 50 % using a GA-ANN/MARS strategy. Furthermore, the statistical parameters of \(R^{2}_{\mathrm{p}}\), F, \([(R^{2}_{\mathrm{p}}-r^{2}_{0})/R^{2}_{\mathrm{p}}]\), and \(r^{2}_\mathrm{m}\) have also been improved using this strategy.

Table 6 Chemical structures of newly designed EGFR-T790M inhibitors based on QSAR models

In this work, we use a leverage approach for the applicability domain definition [25]. The applicability domain is a defined squared area that depends on the descriptor values and the dataset. A Williams plot of standardized residuals vs. leverage values (h) was used for an immediate and simple graphical detection of the response outliers. The applicability domain is a measure to validate the reliability of the GA-ANN/MARS strategy for future predictions. The prediction is valid if \(h<h^{*}\) (critical value: \(h^{*}=3\,\,\hbox {p/n}\), where p is the number of descriptors in the model plus one and n is the number of inhibitors in the calibration set). The five selected descriptors in this work were used to calculate the leverage values. In the Williams graph (Fig. 5), the standardized residuals are plotted against the leverage value (h). Here, no compound has been identified as an outlier in prediction set and just two compounds in the calibration set (compounds 3 and 16) have h values higher than \(h^{*}\)(=0.47). Such compounds are somewhat structurally different from the other inhibitors (with substituted amide group at the 4-position of phenyl group in R3 branch as shown in Table 1). Therefore, they are considered as good leverage points since the information that these two compounds encode makes the QSAR model more precise. Other inhibitors in the calibration and prediction sets fall within the applicability domain.

Finally, several new inhibitors were proposed to show the practical application of the developed models. The \(p\hbox {IC}_{50}\) values of the new design inhibitors were predicted using the established QSAR models. Compounds 14, 20, 26, and 38, with the highest inhibitory concentrations, were considered as reference chemicals to design new inhibitors. The R1, R3, and R2 branches in the dataset were modified to propose new inhibitors. The predicted \(p\hbox {IC}_{50}\) values of eight promising new inhibitors, together with their structures, are shown in Table 6. The leverage values (hi) of the proposed inhibitors are lower than the critical value \((h^{*})\), which shows that the predicted \(p\hbox {IC}_{50}\) values are acceptable. These results reveal that the proposed QSAR strategy is good enough to be considered as an alternative way for evaluating new potent EGFR inhibitors.

Conclusions

In the present study, a non-linear descriptor selection strategy was employed to generate QSAR models to predict the inhibitory concentration of fifty-three newly synthesized quinazoline-based EGFR-T790M inhibitors. Conventional descriptor selection methods are relatively inaccurate and they only propose one solution without considering non-linearity. Therefore, a GA-ANN strategy was implemented as the non-linear descriptor selection technique with different subset combinations. A MARS strategy was used to define sub-region equations for each descriptor. Various methods were also used to validate the QSAR models including internal and external cross-validation methods. The ranking of the five molecular descriptors obtained from GA-ANN/MARS strategy is RDF030p > MATS2e > Mor04v > RDF055m > G2s > G2s-RDF055m. Furthermore, a molecular docking analysis revealed important receptor–ligand interactions for the most active inhibitor. Finally, the molecular information obtained from docking and QSAR analysis was employed to propose new inhibitors. The results of this study provide a guideline about the structural features of EGFR-T790M inhibitors, which can be used to design novel inhibitors and predict the inhibitory activity of non-examined chemicals.