Introduction

Diabetes is known as the seventh reason for death worldwide [1], and about 438 million people will suffer from this disease by 2030 [2]. In the diabetes category, diabetes type 2 is the most common illness which includes about 80–90% of diabetic cases [3]. There are two important factors with a major influence on diabetic problems. The first one is the insulin hormone which is released by the pancreas and converts glucose to the required energy of the cells, and the second one is the enzyme α-glucosidase that breaks long-chain carbohydrates into small ones such as glucose and fructose. In the body of a diabetic patient, enough insulin is not released or it does not work properly, while α-glucosidase continues its activity. This leads to the aggregation of glucose in the blood (hyperglycemia) which can hurt different organs of the body, especially the nerves and blood vessels [4]. Therefore, inhibiting the catalytic activity of α-glucosidase is considered as a solution to control the amount of glucose in the blood particularly in individuals with diabetes mellitus type 2 [5, 6]. Some different glycosidic inhibitors of α-glucosidase have been used such as miglitol [7], voglibose and acarbose [8]; although these inhibitors are effective, they have some side effects such as flatulence, diarrhea and abdominal discomfort and have to be used in combination of other medications to increase efficiency [9]. Thus, great attention is paid to discover or design novel and efficient inhibitors. Among various admixtures, heterocyclic compounds are notable options. They can be used to synthesize and produce new drugs due to their synergy with most of the molecular targets [10]. Pyridine as a heterocyclic compound and primitive section of many natural compounds have received much attention to be used in a new generation of drugs [11].

Pharmacists believe that the chemical properties of any segment in a drug depend on its structure. So, structural knowledge is required to anticipate pharmaceutical function. In order to save time and investment in the process of designing effective medicines, more useful methods than trial and error are required and QSAR serves as a beneficial computer tool for this purpose. This method makes a rational relationship between the structure of compounds and their properties and finally predicts the biological activities of the compounds to be prepared. QSAR is a mathematical relationship between biological activity and chemical properties of compounds in form of Yi = Fi (X1, X2,…, Xn) where Yi is the dependent variable (IC50) and Xi refers to a molecular descriptor as the independent variable [12,13,14,15,16].

Molecular docking as a complementary tool for QSAR modeling is an advantageous method to calculate the descriptors that contain significant structural information of the compounds. The main role of this method is to seek different orientations of ligands in protein active sites as the receptor. In this way, molecular docking generates a series of complexes and predicts the best orientation for ligand binding [17].

A survey on recent works about the inhibition of α-glucosidase reveals that different compounds have been studied to inhibit its activity using a computer analytical tool as QSAR [1, 4, 7, 18, 19]. But there is not any investigation on arylated hydrazinyl thiazole derivatives that possess inhibition properties against this enzyme. Hence, in this study, the required effective concentration for causing 50% inhibition (IC50) of α-glucosidase for 35 arylated hydrazinyl thiazole-based pyridine derivatives has been predicted by the QSAR models. In this process, two different modeling methods, namely MLR and LS-SVM, were used to predict the inhibitory activity. On the other hand, molecular docking has been used to interpret the binding interactions of the compounds and calculate all available descriptors. These compounds were considered as ligands, and molecular docking describes the different binding positions of the ligands in the active sites of the target protein (α-glucosidase). This work was especially aimed to build the easiest model for the descriptor. This goal needs to follow the QSAR and molecular docking procedures. Then validation of the calculated models is required. After all of these efforts to find the most simple model, the statistical results of the predictive model will be compared with previous works in this scope. Moreover, some new compounds have been designed by combining QSAR and molecular docking results with improved inhibitory activities.

Materials and methods

Data set

The data set is obtained from the research of Ali et al. [11] shown in Table 1. It consists of 39 arylated hydrazinyl thiazole-based pyridine derivatives which were synthesized by two-level reaction patterns. These heterocyclic compounds consist of favorable α-glucosidase inhibitory activity. On the other hand, the new synthetic compounds have constructive likeliness as for example pyridine ring, thiazole ring and hydrazine moiety. The remarkable point is the existence of the same amidine moiety as in the antidiabetic agent “metformin.”

Table 1 Chemical structures, experimental and predicted inhibitory activity (pIC50) values (µM)

In this work, 4 compounds containing NO2 (ionic compounds) have been removed and the remained 35 derivatives were studied. The IC50 values varied in a range of 1.4–168 µM. They were converted to their equivalent pIC50 (− logIC50) values. Figure 1 and Table 1 exhibit the chemical structures and experimental inhibitory activity values of these compounds.

Fig. 1
figure 1

The basic structure of different pyridine derivatives under study based on hydrazinyl

Geometry optimization of compounds

Three-dimensional (3D) structures of the compounds were pre-optimized based on minimum energy molecular geometries by the HyperChem package (Ver. 7.0) [20]. The RM1 optimization method was used for the initial optimization of compounds. The HyperChem output files were introduced to Gaussian software [21], and optimization of compounds was performed based on a semi-empirical (PM6) method with a frequency cycle to find the lowest energy level in every compound (the most stable state of a compound).

Molecular descriptors calculation and selection

QSAR modeling needs some favorite descriptors to describe the relationship between the chemical structure and activity of the molecules. There is various software with a different theoretical basis for this purpose. Here, Dragon software (Ver 7.0) [22] has been used to calculate descriptors. It contains about 4485 descriptors which are divided into several categories including topological and geometrical, ring descriptors, 2D autocorrelation, GETAWAY (GEometry, Topology and Atom-Weights AssemblY) descriptors, physical properties which include zero-, one-, two- and three-dimensional descriptors. In the first step, about 2100 descriptors were calculated for QSAR analysis. If two descriptors have a correlation coefficient higher than 0.9, one of them has to be excluded [23]. Also, all the duplicate and zero descriptors are useless descriptors that have to be removed. So, the number of remained descriptors is reduced to about 978 descriptors. To build the final QSAR model, these descriptors should be chosen proportional to the data set [24, 25].

Model construction and evaluation parameters

QSAR models were developed using the genetic algorithm (GA) technique. GA provided the final practical descriptors of the model, and GA-MLR as a linear method and LS-SVM as a nonlinear method were applied to construct the QSAR models. To evaluate the model, the data set was divided into two subsets: a test set and a train set. The model is built based on the train set, and its efficiency is analyzed based on its performance on the test set. The y-scrambling method was used to choose the test set. In this method, all compounds sorted with descending data and about 20% of the data was chosen as the test set (7–10 compounds).

Assessment of the model performance was performed via the leave-one-out (LOO) cross-validation method. This is the most popular method to evaluate a QSAR model. In this method, there is a sample set of n members. Each member would be set aside in turn, and the modeling would be applied to the other n − 1 remaining members. This process will be continued until all members are put aside once. Every time, the R2 parameter is evaluated and the values closer to unity lead to less error for activity prediction [26]. The applicability domain and some other important parameters such as RMSE and F have to be studied for a more thorough validation as an inevitable step of QSAR modeling.

As stated earlier, to verify a QSAR model, usually, the LOO cross-validation procedure is applied. The outcome represented by the cross-validated correlation coefficient (R2), which is calculated based on the below formula:

$$ R^{2} = 1 - \frac{{\sum \left( {y_{i} - \hat{y}_{i} } \right)^{2} }}{{\sum \sum ( - \bar{y})^{2} }} $$
(1)

Here \( y_{i} \),\( \hat{y}_{i} \) and \( \bar{y} \) are the actual, estimated and averaged (over the entire data set) activities, respectively [25]. According to the literature, a good model should pass the following conditions [19, 27]:

$$ Q^{2} > 0. 5 $$
(2)
$$ R^{ 2} > 0. 6 $$
(3)
$$ \left( {R^{ 2} - R_{0}^{ 2} } \right)/R^{ 2} < 0. 1\;{\text{or}}\;\left( {R^{ 2} - R_{0}^{2\prime } } \right)/R^{ 2} < 0. 1 $$
(4)
$$ 0.85 \le k \le 1.15 $$
(5)

Q2 coefficient of leave-one-out cross-validation, R2 squared correlation coefficient, k slope of the regression line through the origin, R 20 regression of the anticipated activities opposed to observed activities.

Other important statistical parameters that are required to have a perfect comparison between different models are defined as: S standard error of estimation, F Fischer ratio.

And RMSE values calculated as follows [27]:

$$ {\text{RMSE}} = \sqrt {\frac{{\sum \left( {y_{i} - y_{0} } \right)^{2} }}{{n_{s} }}} $$
(6)

where \( y_{i} \) the experimental value of the activity, \( y_{0} \) the predicted value of inhibitory activity using the model, \( n_{s} \) the number of molecules in the data set, lower values of “S” and “RMSE” together with a higher measure of “F” means that the model can forecast the biological activity with lower error, and it can reveal the high prediction potential of the QSAR models.

Applicability domain

The applicability domain is a theoretical space in which the predictions of QSAR are reliable. There are different approaches to determine the applicability domain, but here the most common method, i.e., the William plot is used. It involves the calculation of the standardized residuals versus leverage amounts. Calculation of the leverage (hi) for each compound and its threshold is defined in Eqs. (7) and (8), respectively. Compounds with leverage more than warning leverage (h*) usually had a great influence on the model. A point in the right side of h* with a residual more than 3 or less than − 3 is known as the over-fitted point.

$$ h_{i} = x_{i}^{\text{T}} \left( {X^{\text{T}} X} \right)^{ - 1} x_{i} $$
(7)
$$ h^{*} = \frac{{3\left( {k + 1} \right)}}{n} $$
(8)

In Eq. (7), xi is the descriptor vector of the query molecule and X is the k × n matrix containing the k descriptor values for the train molecules (n members).

In Eq. (8), k is the number of descriptors in the selected model, and n is defined as the number of objects in the train set [28].

Molecular docking

Molecular docking is an accurate approach to predict the binding affinity and orientation of ligands to the target molecules which is enzyme α-glucosidase in our study. Since the 3D structure of the protein was not available in the protein data bank, the homology modeling method is applied as an alternative solution. This method predicts the structure of an unknown protein based on the structure of similar proteins from the same family [29]. In this study, the homology modeling was used with the template 3A47 [30].

Then molecular docking was run by AutoDock 4.2 software [31]. For all docking parameters, standard values were used. A two-dimensional schematic representations of the docking results including binding sites and interactions of inhibitor with ligands were proposed using LIGPLOT [32].

QSAR and molecular docking could be applied for designing new inhibitors. According to the basic structures, i.e., arylated hydrazinyl thiazole-based pyridine scaffold, new inhibitors have been designed to reduce the inhibitory level. The results of QSAR and molecular docking of the main compounds were carefully investigated to detect the most effective basic structures. Then, the best structures were modified by replacing some of their branches with various useful components. So, some new basic structures were produced (about 126).

Results and discussion

MLR and LS-SVM models

To find a statistically rational QSAR model, the number of independent variables has to be determined through a reliable approach so that in this study the final number of model descriptors was set by the “breaking point” method. This method is based on the sloping trend of statistical parameters versus the number of descriptors. Figure 2 shows that the slope of the breaking point diagram starts to drop off from the fifth descriptor. As far as the smallest suitable number of descriptors is concerned, the breaking point is the optimum number of descriptors [19], which is 5 in this case.

Fig. 2
figure 2

Breaking point plot of the model to find the best number of descriptors to build the final QSAR model

GA was used to select the most effective descriptors in a huge space of different features. The selected descriptors were then assessed to be incorporated in the final model. Consecutively, the final models were built on the 5 selected descriptors as presented in Table 2.

Table 2 Molecular descriptors of the best model with 5 descriptors

The linear function including the selected variables was obtained using GA-MLR method as below:

$$ \begin{aligned} {\text{pIC}}_{ 50} & = - 2. 1 2 7- 0.00 7\left( {{\text{D}}/{\text{Dtr}}0 5} \right) + 0. 2 4 8\left( {\text{DELS}} \right) \\ & \quad + 2. 5 8 6\left( {\text{GATS4s}} \right) + 1 1. 2 7 7\left( {\text{G1p}} \right) - 1. 60 5\left( {\text{H4m}} \right) \\ \end{aligned} $$
(9)

The equation indicates descriptors G1p and GATS4s have the highest coefficients in the model, and they have a direct relationship with pIC50. To better illustrate the influence of these variables, their correlation with each other and pIC50 was calculated and the results are collected in Table 3. It has been demonstrated that descriptor DELS provides the highest correlation with the inhibitory activity which makes it a crucial descriptor to build the model. So, a mono-descriptor model named simple model was made with descriptor DELS.

Table 3 The correlation between 5 descriptors of the best-structured model and the inhibitory activity (pIC50)

Evaluation of the models

GA-MLR model includes 5 final variables as the most influential descriptors. To assess the nonlinearity relation between the descriptors and pIC50 a reliable model was constructed based on 5 selected descriptors by use of the LS-SVM method. The results of this model were significant. So, it is a good solution to compare the predictive ability of the model through two different methods.

One of the evaluation methods is the comparison of statistical parameters related to QSAR models. In this case, parameters such as Q2, R2, RMSE, F and S were calculated for the MLR and LS-SVM models. These results for the GA-MLR model based on 5 descriptors and 10 test compounds in Table 4 represent a good prediction capacity. The model has a high multiple correlation coefficient (0.951) and a low prediction error. Figures 3 and 4 illustrate the calculated and experimental values of pIC50 for the train and test data set, respectively. The maximum prediction error was a 5.028% error which is acceptable.

Table 4 The statistical results of GA-MLR and LS-SVM models with 5 descriptors
Fig. 3
figure 3

Comparison of predicted and experimental values of train data set with their specific error prediction in the MLR model with 5 descriptors

Fig. 4
figure 4

Comparison of predicted and experimental values of test data set with their specific error prediction in the MLR model with 5 descriptors

The regression line indicates the comparison between predicted and experimental values in Fig. 5. Also, the residual graph of the MLR model with 5 final descriptors is shown in Fig. 6. As it is obvious, congestion of compounds either for train or for test set shows they are well distributed, and none of them has unaccepted distance from the fitted lines.

Fig. 5
figure 5

The regression line of the MLR model with 5 descriptors

Fig. 6
figure 6

The residual graph of the MLR model with 5 descriptors

According to Fig. 7, in the applicability domain analysis, one point (compound 12) with a residual more than 3 in William plot was predicted with slightly higher error. These errors may be due to an error in experimental data. The other points all stayed in the determined applicability domain by William plot.

Fig. 7
figure 7

The William plot of MLR model with 5 descriptors

On the other hand, the statistical results of the LS-SVM model with 5 descriptors in Table 4 describe that the model can predict appropriately and it is more useful than its MLR model. Like the MLR model, in the LS-SVM model based on 5 descriptors, the residual graph in Fig. 8 demonstrates the proper distribution of the data set.

Fig. 8
figure 8

The residual graph of the LS-SVM model with 5 descriptors

As it is shown in Fig. 9, in the William plot of this model all compounds stand in the applicability domain. As a result, both linear and nonlinear models have an acceptable predictive capacity for inhibitory activity calculation. The predicted values of pIC50 using these models are displayed in Table 1.

Fig. 9
figure 9

The William plot of the LS-SVM model with 5 descriptors

To ensure the stability of these models, they were validated with different test groups and nearly all of them represented good results. Table 5 indicates the average statistical values of ten new LS-SVM models. These results prove that the fitness of the model is not dependent on the selected test set as by varying the test and train set it still can predict satisfactorily. Therefore the models present favorable statistical results to be trusted as reliable predictive models.

Table 5 The average statistical results of ten LS-SVM models with various random test groups of compounds based on 5 descriptors in the model

Descriptors analysis to explore a simple model

The effectiveness of each descriptor in the QSAR model is investigated with sensitivity analysis. In this method, a descriptor is eliminated and the difference between RMSE values in this state and the base case (with all descriptors) is observed. A greater difference means that the descriptor had a more profound role in the model [27].

Figure 10 describes the calculated sensitivity test values to find the most effective descriptors in the model.

Fig. 10
figure 10

Sensitivity test of model descriptors to find the most effective variable on the LS-SVM model based on 5 descriptors

Different descriptors as independent variables in the linear equation come from several categories in Dragon descriptors and thus convey different structural information about the compounds. DELS is a topological descriptor with a positive sign in the MLR equation that discloses basic information about the size of molecules, degree of branching, flexibility and the overall shape topological indices which are 2D descriptors based on graph theory concepts [19]. Another essential descriptor, GATS4s, is the Geary autocorrelation of lag 4 weighted by I-state, containing information about the distribution of inherent state along with the topological structure [33]. The higher value of this descriptor leads to higher pIC50. H4m is H autocorrelation of lag 4/weighted by atomic masses which is a GETAWAY descriptor [34] whose lower values cause higher pIC50. Descriptor G1p is the 1st component symmetry directional WHIM index/weighted by atomic polarizabilities [35]. It has a positive sign in the MLR equation, and thus, the pIC50 value increases at higher values of this descriptor (the higher value of pIC50 indicates a lower value of inhibitory activity). The last descriptor D/Dtr05 is a ring descriptor [36] with a negative sign in the linear equation and a negative effect on pIC50 based on the sensitivity test.

As expected (according to Table 3) DELS descriptor had the main role among all descriptors. Results show that H4m, GATS4s, D/Dtr05, and G1p stand in the next places, respectively.

Table 3 illustrates that the DELS descriptor has a high correlation (0.796) with pIC50 which is verified by sensitivity analysis too. Therefore, it seems necessary to make a comparative study between the base model and the model constructed with this descriptor. To evaluate the simple QSAR model constructed by the use of the LS-SVM method, the statistical results were derived, and as reported in Table 6, they present a satisfactory accuracy.

Table 6 The statistical results of the simple model with DELS descriptor based on the LS-SVM method

According to this table, R2 = 0.888, Q2 = 0.872, RMSE = 0.185 and F = 221.459, which means the simple model can be a favorable model to predict pIC50 values of the compounds with a high degree of reliability. Besides, the regression line in Fig. 11 and residual diagram in Fig. 12 show the acceptable dispersion of compounds by the simple model.

Fig. 11
figure 11

The regression line of the simple model with one descriptor

Fig. 12
figure 12

The residual diagram of the simple model with DELS descriptor

Not only the selected model with a specific test group had good statistical results, but also other different test groups were studied and showed acceptable ability to predict the inhibitory activity. The final results of models based on the DELS descriptor with different test groups are summarized in Table 7.

Table 7 The average results of six LS-SVM models with different test groups for the simple model

As the final result, in this research, a simple model with only one descriptor (DELS) through the LS-SVM method was extracted to predict pIC50 values of α-glucosidase inhibitors with good statistical features. However, the other calculated models using GA-MLR and LS-SVM methods had better statistical results albeit with 5 variables and the nonlinear model had even better prediction capability. The best R2 value in previous studies is 0.872, and most of these studies had used just a single linear or nonlinear method to build their QSAR models. Hence, it seems necessary to compare different linear and nonlinear models to find the best model for pIC50 prediction. Table 8 presents a summarized survey on various works in this scope, and it can be observed that this work has better results in comparison with recent studies in this field. Therefore, the presented models can be useful to predict the inhibitory activity of these special α-glucosidase inhibitors.

Table 8 Comparison between recent QSAR and molecular docking studies on the α-glucosidase enzyme

Homology modeling

The baker’s yeast α-glucosidase was applied in the homology modeling approach. A suitable structural template was found for homology modeling in the Protein Data Bank (PDB) at the National Center for Biotechnology and Information (NCBI). The amino acid sequence of the α-glucosidase was inputted using BLAST and PSIBLAST algorithms and was retrieved with 72.51% identification to build the homology model that comprises 584 amino acid residues from the SWISS-PROT protein sequence data bank (http://www.expasy.org/sprot/; Accession No.). Figure 13 shows sequence alignment between yeast α-glucosidase and the template 3A47 taken from SWISS-MODEL site. The structure of the simulated protein was designed and is depicted in Fig. 14.

Fig. 13
figure 13

Amino acid alignment in homology modeling of yeast α-glucosidase

Fig. 14
figure 14

Structure of the simulated protein with homology modeling method to use it in molecular docking study instead of real protein structure

The Ramachandra servera was used to evaluate the accuracy of amino acid placements which was determined to be equal to 97% according to Fig. 15. In other words 97% of the amino acids have been located in allowable zones which indicates the high quality of the forecasted structure.

Fig. 15
figure 15

The precision of amino acid replacement in the accurate zone via homology modeling

Molecular docking

Molecular docking was exerted on the compounds to calculate useful descriptors and considering different orientations of ligands in the α-glucosidase active site. All docking features were obtained by the use of AutoDock tools and binana [37]. Different models were established by these descriptors, but none of them had good statistical results as good as Dragon descriptors to apply the significant effect in QSAR modeling.

The different binding mode of ligands with protein was considered. The impressive interaction of the inhibitors with the diversified residues in the active site of the enzyme was gained.

Finding a rational relation between these compounds and their structures to understand how some compounds had the most activity depends on their structural properties, and often it is hard work. In this study, three of the most active compounds are shown in Fig. 16: compound 9, 19 and 28. The common residuals in these compounds comprise from Phe (177, 311, 157) and Arg (312, 439) groups. They had an effective role to improve pIC50 values. Also, it demonstrated the hydrophobic interactions between the enzyme and ligands, different atoms in the structures and their positions, the residuals, hydrogen binding and the other connections. The best binding position of protein with ligands in the active site receptor is useful to design and produce some new drugs.

Fig. 16
figure 16

2D representation of the most active docked structures in molecular docking study: compound 9, 19 and 28

Different descriptors of the calculated QSAR model already described physical and topological properties, geometry, ring structures and atom binding position have a significant effect on the inhibitory activity. Also, information from molecular docking processes can be used to understand the structure of the compounds with more details which helps QSAR explain compounds structurally and find the best compounds to produce medicine. So, according to the QSAR and docking findings, it is necessary to notice how atoms are gathered to construct the complexes.

Analysis of designed compounds

New inhibitors have been designed based on arylated hydrazinyl thiazole-based pyridine scaffold by QSAR and molecular docking approach. A study of the inhibitors reveals that halogen molecules (F and Cl atoms) and OH have a key role in increasing the inhibitory activity. The most active designed inhibitors are shown in Table S1 and Figure S1 (as supporting information in supplementary materials) with their structures and calculated pIC50 values using the presented MLR model based on 5 descriptors. All pIC50 values are better than the main inhibitors of the study. Of course, these values need to be verified experimentally after the synthesis of the designed compounds.

In the docking process, the correlation between free energy and pIC50 values was calculated for all designed compounds. Although the correlation has been improved in comparison with the main descriptors, it still does not have a significant value (− 0.226). Also, the interaction of molecules by different amino acids was investigated. The common residuals that have been repeated almost in all inhibitors are the Phe (157, 177, 158 and 311). Two structures with high activity had a hydrophobic interaction of His 239, Arg 312 and Asp 349. 2D representation of the most active new structures A1, A2 and B3 is shown in Fig. 17.

Fig. 17
figure 17

The representation of the most active designed compounds

Conclusion

In the present study, two different approaches, namely GA-MLR and LS-SVM methods, were applied to establish linear and nonlinear QSAR models to predict the biological activity of a set of arylated hydrazinyl thiazole-based pyridine derivatives. Among various descriptors calculated, the 5 most potent descriptors were selected via GA to build the final QSAR model. DELS descriptor among the selected descriptors had the highest correlation (0.796) with pIC50. It was able to build a QSAR model solely with favorable prediction ability. In previous studies on α-glucosidase inhibition, the best-reported value for R2 was about 0.872, while in the present study with a QSAR model with 5 final descriptors the value of R2 is 0.989 in the nonlinear model and it is about 0.888 in the simple model (using descriptor DELS based on LS-SVM method). Thus the presented models even the simple model can forecast the inhibitory activity of the compounds with higher accuracy than the previous modeling studies. Also, branching information and the size of molecules that come from the DELS descriptor had been considered as the most effective subjects on inhibitory activities of the compounds. Three of the best predicted pIC50 values belong to compounds 9, 19 and 28 all have an aromatic ring connected to two branches of Cl atoms next to each other which reveals the fundamental role of halogen atoms in the inhibition of enzyme activities. Finally, the most active designed compounds (addressed as A1 in this study) had the best pIC50 value of 9.22 comparable to the basic data set.