1 Introduction

Human immunodeficiency virus type 1 (HIV-1) is a lentivirus that caused the acquired immunodeficiency syndrome (AIDS) in humans (Douek et al. 2009; Weiss 1993). The virus encodes three enzymes that are required for viral replication: reverse transcriptase, protease, and integrase (IN). Although there are reports on limited use of antiviral targeting reverse transcriptase and protease, the development of resistant strains has caused a great havoc and limited their effectiveness (Imamichi 2004). Therefore, there is a vital requirement for the discovery of new antiviral agents directed against alternative sites in the viral life cycle, such as the HIV-1 integrase, and hence, HIV-1 integrase has come to light as an attractive target for anti-HIV therapy, because this enzyme is necessary for stable infection and its known functional analogs are not found in the human host (Imamichi 2004). There are reports on a wide variety of integrase inhibitors; however, the Food and Drug Administration (FDA), USA, has approved only a very few (Anthony 2004; Johnson et al. 2004). And hence, the discovery and development of integrase inhibitors have led to the first FDA approved integrase inhibitor—raltegravir (Fikkert et al. 2003; Grobler et al. 2002; Murray et al. 2007). Even though raltegravir succeed in suppressing the virus in HIV patients (Murray et al. 2007), but HIV phenotypes resistance to raltegravir have evolved very rapidly (Malet et al. 2008). Therefore, there is a vital requirement for developing novel integrase inhibitors with alternative mechanisms of action.

In the present investigation, a set of 47 HIV-1 integrase inhibitors with their known experimental IC50 values was retrieved from the NCBI Bioassay database. These sets of diverse 47 compounds were used as the training set to construct the quantitative structure activity relationship (QSAR) model using multiple regression (MR) and partial least square (PLS). And the generated QSAR models were used to predict a set of compounds obtained from a chemical similarity search of the training set at the NCBI PubChem database subjected to the Lipinski rule of five parameters (Lipinski 2000). A molecular docking simulation was carried out against the training set and test set compounds and its subsequent protein–ligand interaction analysis of the top docking hits. In additon, a molecular dynamics simulation was performed for the best docked compounds to check the stability and dynamic behavior of the docked protein–ligand complexes.

2 Materials and method

2.1 Data set and QSAR modeling

A set of diverse HIV-1 integrase inhibitors (47 compounds) with known experimental IC50 values was retrieved from the NCBI Bioassay database (Wang et al. 2012). These 47 compounds were used to construct the QSAR model training set. The chemical structure, PubChem Compounds ID and the IC50 values of these 47 compounds are shown in supplementary material table ST1 and ST2. For the generation of the QSAR model, initially, the two-dimensional file format (SDF) of the training set compounds was optimized using MM2 force field method (Ulrich and Norman 1982) and converted to its corresponding three-dimensional file format (sybyl mol2) using (ChemOffice 2010). These 3D geometries of the compounds were then used to construct the QSAR model with two different methods, viz., multiple regression (MR) and partial least square (PLS) method using MDL QSAR 2.3 (MDL QSAR 2006). Multiple regression computes the least squares fit in an independent variable ‘Xk’ to a dependent variable ‘Y’. It is commonly used in problems with a small number of independent variables that are not close to linearly dependent (Fisher 1922). In the multiple regression QSAR model, several models were obtained and the best model was chosen after baring 15 outlier molecules, and hence, only 25 molecules were considered in the regression model (shown in supplementary material Table ST1). The regression model quality was then accessed by multiple R-squared and cross validated with RSS. Furthermore, a randomization test was performed to check the validity of the descriptor set. Finally, the cross-validation regression coefficient Q-Squared and the non-cross-validation coefficient R-Squared were used to estimate the model qualities.

On the other hand, a second model was constructed using partial least squares (PLS) regression. PLS is used where regression, principal components are used. In PLS, instead of forming egression equations, it produces a sequence of models with a different number of variables for predicting the dependent variable (Ildiko and Friedman 1993). Employing PLS QSAR model, the best model was chosen after barring 15 outliers, and hence, only 25 molecules were considered for the constructing the QSAR model (shown in supplementary material Table ST2).

2.2 Test set prediction

For predicting the IC50 of the test set compounds, a chemical similarity search was performed for each of the training set at the NCBI PubChem database for related structures and analogs (Bolton et al. 2008). The search parameters were set at 95% similarity subjected to the Lipinski rule of five filters (Lipinski 2000). In addition, only the compound whose IC50 value has not been reported in the PubChem Bioassay was retrieved (55 compounds). These compounds were used as the test set for predicting their IC50 values. The CID and 2D structures of these test set compounds are shown in Table ST1 and ST2 of supplementary material.

2.3 Molecular docking computation

Molecular docking simulation was carried out using Molegro Virtual Docker. For molecular docking simulations, the three-dimensional crystal structure of HIV-1 integrase catalytic core domain (Kessl et al. 2012) (PDB ID: 4DMN) was retrieved from the Protein Data Bank (http://www.rcsb.org/). The protein was then imported in the Molegro Virtual Docker (MVD) (Molegro APS 2011) and the search space was set for the active site residues (Asp64, Asp116 and Glu152). Furthermore, the binding site was set inside a restricted sphere of radius 13 Å (X: −34.57, Y: −10.85, Z: 5.52) with the MolDock Score (GRID) scoring function having a grid resolution of 0.30 A°. Then, the test set and training set compounds along with experimental known inhibitor raltegravir (for validation purpose) were also imported in MVD. The bond flexibility of the compounds was set along with the side chain flexibility of the protein for search space (Asp64, Asp116 and Glu152) was also set with to a tolerance of 1.10 and a strength of 0.90 for docking simulations. RMSD threshold for multiple cluster poses was set at 2.00 Å. The docking algorithm was set at a maximum iteration of 1500 with a simplex evolution size of 50 and a minimum of 10 runs were performed for each compound. The pose generation energy threshold was set to 100, and the simplex evolution maximum step was set to 300 by a neighbor distance factor of 1. The best pose of each compound was selected for the subsequent ligand–protein interaction energy analysis.

2.4 Molecular dynamics simulation

Molecular dynamics simulation was carried out using Gromacs 5.0.1 (Van Der Spoel et al. 2005) operated in Ubuntu Linux 14.0 LTS. MD Simulation was carried out for the enzyme 4DMN and the top docking complexes 4DMN-CID69851424, 4DMN-CID66861330, and 4DMN-CID54714968 complexes, respectively. In MD simulation studies, initially, the protein and the protein–ligand complexes were solvated using the TIP3P water model and the solvated structures were energy minimized using the steepest descent method, terminating when the maximum force is found in smaller than 100 kJmol−1nm−1. All simulations were performed in the NPT ensemble at constant temperature (300 K) and pressure (1 bar) with a time step of 2 fs. NVT was performed for 1 ns (nanoseconds), and the minimized structure were equilibrated with a timescale of 20 ns (nanoseconds). The molecular dynamics simulation was performed to both the protein and the protein–ligand binding complex for 20 ns to understand the dynamic behavior of the protein and their stability.

2.5 ADME toxicity studies

Molecular docking may result with a compound having a favorable binding affinity and strong interactions. However, an ideal oral drug should possess certain factors such as quick gastrointestinal tract absorption. The potential drug candidate should also specifically distributed to its target and metabolized in such a way that it does not instantly eliminate its activity, and removed in an appropriate manner without causing any damage (Ghasemi et al. 2015). Hence, ADME–toxicity (absorption, distribution, metabolism, excretion, and toxicity), which describes the nature of a pharmaceutical compound within an organism, is an important step in the drug discovery process. It is also evident that poor human pharmacokinetics (ADME–toxicity) is a main reason for the majority of the drug failure (Alavijeh et al. 2005). For these purposes, ADME–toxicity predictions were calculated for the top docking hits using ACD/I-Lab 2.0 (Advanced Chemistry Development, Inc). ACD/I-Lab 2.0 is a web-based service that provides instant access to spectral and chemical databases, and predicts properties, including physicochemical, ADME, and toxicity characteristics. In addition, comparative analyses were performed for the LD50 mouse (intraperitoneal, oral, intravenous, subcutaneous) and probability of health effect of blood, cardiovascular system, gastrointestinal system, kidney, liver, and lung.

3 Results and discussion

3.1 QSAR modeling

QSAR modeling was carried out using MDL QSAR and two optimized models were generated using MLR and PLS. First, the generated regression equation of the MLR is given below:

$${\text{pIC}}_{ 50} = \, 0.000 4 9 4 9*{\text{SHBint5}}\_{\text{Acnt}} {-} \, 0.00 2 8 9 5*{\text{nelem}} + \, 0.00 2 1 4 2*{\text{xvp6}} + \, 0.000 1 2 2 4*{\text{SdO}}{-} \, 0.00 9 2 7 1*{\text{SsF}}{-} \, 0.000 8 9 8 7*{\text{SHsOH}} + \;0.134736$$
(1)

where SHBint5_Acnt is the count of internal hydrogen bonds with five skeletal bonds between donor and acceptor, nelem is the number of chemical elements, xvp6 is the valence 6th order path Chi indices, SdO is the sum of all (=O) E-State values in molecule, SsF is the sum of all (–F) E-State values in molecule, and SHsOH:nSum of all [–OH] E-State values in molecule].

The regression quality of the optimized model generated using MR is described with a multiple R-Squared value of 0.9666 with a standard error of estimation of 0.0003275 (the smaller the better). The equation has an F-statistic value of 86.9 with a P value of 2.657E−012. The multiple Q-Squared value is 0.932 and the RSS value for cross validation is 3.936E−006. The R-Square and Q-Square values closer to unity give a conclusion that the training set is very well described by the regression equation, which is statistically very significant and the cross-validation showed that the constructed model is very efficient for predicting the IC50 values.On the other hand, the generated equation for the PLS is given below:

$${\text{pIC}}_{ 50} = \, 0. 3 8 5 7*{\text{tp1}} + { 1}. 1 4 1*{\text{tp2}} + \, 0. 2 9 4 9*{\text{tp3}} + \, 0. 3 1 4 9*{\text{tp4}} + \, 0. 4 8 9 2*{\text{tp5}}$$
(2)

The regression quality of the optimized model generated using PLS is described with a multiple R-Squared of 0.9578 and an RSS value of 2.531E−006. The multiple Q-Squared value is 0.8793, and the RSS value for cross validation is 7.242E−006 with a skewness of −0.4975, which describes that the training set is very well, and the cross validation showed that the constructed model is efficient for predicting the IC50 values.

The IC50 experimental and predicted using MR and PLS is shown in Table 1 and also the regression graph is shown in Fig. 1 for multiple regression and Fig. 2 for partial least square method.

Table 1 IC50 values (experimental) and IC50 values (predicted) generated using multiple regression (MR) and partial least square (PLS)
Fig. 1
figure 1

a Regression line and b regression model graph of IC50 (predicted) vs IC50 (experimental) generated employing multiple regression (MR)

Fig. 2
figure 2

a Regression line and b regression model graph of IC50 (predicted) vs IC50 (experimental) generated employing partial least square method (PLS)

3.2 IC50 prediction

Applying both the optimized QSAR models, the IC50 values for the 55 test set compounds were predicted, which is shown in Table 2. From Table 2, it is revealed that there is not much variation in the IC50 values predicted applying MLR and PLS methods. Hence, confirmed the reliability of both the model for the IC50 value prediction.

Table 2 Predicted IC50 values of the test set compounds using multiple regression (MR) method and partial least square (PLS) method

3.3 Molecular docking analysis

Molecular docking simulation was carried out using MVD for the test set and training set compounds using MVD. MVD is a molecular visualization and molecular docking software based on a differential evolution algorithm; the solution of the algorithm takes into account the sum of the intermolecular interaction energy between the ligand and the protein and the intramolecular interaction energy of the ligand. The docking energy scoring function is based on the modified piecewise linear potential (PLP) with new hydrogen bonding and electrostatic terms included. Full description of the algorithm and its reliability as compared to the other common docking algorithm can be found in the literature (Thomsen and Christensen 2006). The molecular docking score indicates that the test set compounds have favorable Rerank score than the training set as well as from the experimentally known integrase inhibitors raltegravir (Table 3). The test sets, compound, viz., CID69851424, CID66861330, and CID54714968 have Rerank score of −74.99, −68.85, and 66.93, respectively. While CID25263165 (training set) has a Rerank score of −66.93 compared to −34.18 of raltegravir. The Rerank score is a weighted combination of the terms used by the MolDock score mixed with a few addition terms. The Rerank Score includes the Steric term which is the Lennard–Jones approximations to the steric energy, while the MolDock score uses a piecewise linear potential to approximate the steric energy. The coefficients for the weighted Rerank Score are given in the Rerank Weight column, and the weighted terms and their summations are given in the Rerank Score column. The Reranking score function is computationally more expensive than the scoring function used during the docking simulation, but it is generally better than the docking score function at determining the best pose among several poses originating from the same ligand (Thomsen and Christensen 2006).

Table 3 Molecular docking score of the test set and training set compounds

In addition, CID69851424 and CID54714968 possessed a hydrogen bonding energy of −10.05 kJ mol−1 and −7.49 kJ mol−1, respectively, which accounts for the attractive inter molecular binding affinity and binding interaction. To understand the in-depth molecular interaction, ligand–protein interaction analysis was carried out for the CID69851424, CID66861330, and CID54714968CID. The interaction analysis revealed the compounds have interaction with Asp64 (shown in Table 4) which is an active site residue. The snaps of the ligand–protein interaction analysis are shown in Figs. 3, 4, and 5 for CID69851424, CID66861330, and CID54714968CID, respectively.

Table 4 Molecular interaction analysis of the top docking hits
Fig. 3
figure 3

a Binding mode of CID69851424 (yellow) to 4DMN active site region. b Predicted bonded interactions (dashed lines) between CID69851424 and Asp64, Cys65, Thr115, Asp116, and Asp120 residues

Fig. 4
figure 4

a Binding mode of CID_66861330 (yellow) to 4DMN active site region. b Predicted bonded interactions (dashed lines) between CID_66861330 and Asp64 residues

Fig. 5
figure 5

a Binding mode of CID_54714968 (yellow) to 4DMN active site region. b Predicted bonded interactions (dashed lines) between CID_54714968 and Asp64, Asp116, and Asn155 residues

3.4 Molecular dynamics simulation

Molecular dynamics simulation was performed for 20 ns using Gromacs 5.0.1 to understand the conformational changes of the protein–ligand binding complex and the protein to understand the amendment occurring in the dynamic environment. The RMSD plot for 4DMN-CID69851424 complex, 4DMN-CID66861330 complex, 4DMN-CID54714968 complex, and 4DMN is shown in Fig. 6. The RMSD plot clearly explains the variations of the HIV-1 integrase enzyme and protein–ligand binding complexes. The average RMSD showed ~0.27 Å for 4DMN-CID66861330 complex, ~0.26 Å for 4DMN-CID69851424 complex, and ~0.29 Å for 4DMN-CID54714968 complex, respectively, while the HIV-1 integrase enzyme (4DMN) showed an average RMSD deviation of ~0.30 Å, which is an indication revealing that the protein–ligand complex is more stable in the dynamic equilibrium condition. Furthermore, the dynamics simulation revealed the 4DMN-CID69851424 complex is more stable than the 4DMN-CID66861330 complex and 4DMN-CID54714968 complex in the dynamic environment.

Fig. 6
figure 6

MD Simulation analysis depicting the RMSD graph for 4DMN and 4DMN-ligand complexes

3.5 ADME–toxicity studies

From the ADME–toxicity analysis calculated using ACD/I-Lab 2.0 (Advanced Chemistry Development, Inc 1994), the three top docking hits from the test set compounds are readily soluble in water compared to the market available drug raltegravir (shown in Table 5). For absorption, CID66861330 and CID69851424 showed 100 and 53% passive absorption, respectively, compared to raltegravir (7%). Moreover, the apparent volume of distribution of these three compounds is higher than that of raltegravir (Table 5). The comparative graph plot on LD50 mouse and probability of health effect is shown in Figs. 7 and 8, respectively. The LD50 mouse oral indicates these compounds can be administered orally with more or less similar values with raltegravir. Last, the probability of health effects depicted CID69851424 is likely to have an overall low impact on health effects. Overall, the ADME toxicity analysis comes to a conclusion that these compounds in general have enhanced pharmacological parameters than the market approved drug raltegravir.

Table 5 Predicted ADME–toxicity parameters for CID69851424, CID66861330, CID54714968, and market approved raltegravir
Fig. 7
figure 7

Comparative analysis on LD50 mouse (intraperitoneal, oral, intravenous, subcutaneous) for CID69851424, CID66861330, CID54714968, and raltegravir

Fig. 8
figure 8

Comparative analysis on probability of health effect on blood, cardiovascular system, gastrointestinal system, kidney, liver, and lung for CID69851424, CID66861330, CID54714968, and raltegravir

4 Conclusion

IC50 values were predicted for a set of HIV-1 integrase inhibitors employing QSAR models generated using multiple regression and partial least square method. The prediction indicated that there is not much variation between these two methods, and cross validation confirms that these models can be used to predict the IC50 values. Furthermore, a molecular docking analysis against HIV-1 integrase showed the compounds have favorable molecular interaction than the market approved raltegravir. Furthermore, the molecular docking simulation was validated by molecular dynamics simulation for 20 ns, where the docked ligand–protein complex showed more stability than the HIV-integrase enzyme. Last, ADME–toxicity prediction confirms that the docked compounds have enhanced pharmacological parameters than raltegravir. Hence, the authors proposed for the clinical trial of these reported compounds, viz., CID69851424, CID66861330, and CID54714968.