Introduction

Acquired immunodeficiency syndrome (AIDS) is a disease caused by human immunodeficiency virus (HIV). Since its first report in 1981, it has caused more than 30 million deaths worldwide (Curran, 1983). Most of the current approved drugs are developed targeting the two essential enzymes, viral reverse transcriptase and protease (Johnson et al., 2010). HIV-1 integrase (HIV-1 IN), the enzyme essential in viral replication, has emerged as most promising target as it does not have human homologue, and former two targets are eloped with emergence of highly resistant viral strains (Jaskolski et al., 2011). HIV-1 IN comprises of single polypeptide chain with 288 amino acid residues and three domains, N-terminal domain (residues 1–54), catalytic domain (residues 55–209) and C-terminal domain (residues 210–288). HIV-1 IN catalyzes the insertion of viral DNA into the host genome in two biochemical steps (a) 3′-processing which removes two nucleotides from each 3′-ends of the viral cDNA to produce reactive 3′-hydroxyl ends and (b) strand transfer which joins 3′-viral DNA ends into the host DNA through a nucleophilic transesterification reaction (Schroder et al., 2002). Several HIV-1 IN inhibitors have been reported, but only two compounds, raltegravir and elvitegravir, strand transfer inhibitors, are approved for clinical use (US Food and Drug Administration, 2014). Major impediments in the development of HIV-1 IN strand transfer inhibitors are cross-resistance due to the similar mechanism of action of developed agent, HIV-1 IN polymorphism which is less studied and the fact that the agent has to act on HIV-1 IN—viral DNA complex. The inhibitors of 3′-processing step are thought to suppress both 3′-processing and strand transfer steps (Korolev et al., 2011). Zhao’s (1997) report on coumarin derivatives opened the avenue for coumarin-based HIV-1 IN inhibitors. Many coumarin derivatives were reported subsequently as HIV-1 IN inhibitors as 3′-processing and strand transfer inhibitors (Mao et al., 2002; Chiang et al., 2007; Tewtrakul et al., 2007; Liu et al., 2009; Kostova et al., 2006; Al-Mawsawi et al., 2006; Bailly et al., 2005; Olmedo et al., 2012; Mahajan et al., 2009; Spino et al., 1998). Hansch (Hansch and Fujita, 1964) and Wilson (Free and Wilson, 1964) established QSAR, a tool which establishes the quantitative relationship between structural, physicochemical and conformational properties with biological activity. Since its inception, QSAR approaches have evolved from 2D-QSAR to more complex 4D quantitative structure–activity relationship (4D-QSAR). QSAR approaches are further classified as receptor-dependent and receptor-independent QSAR methods based on the descriptors calculated with or without receptor. Comparative molecular field analysis (CoMFA; Cramer et al., 1988) and comparative molecular similarity indices analysis (CoMSIA; Klebe et al., 1994) are most acclaimed 3D-QSAR approaches exploited in successful design of many therapeutic agents. Hopfinger et al. (1997) proposed a 4D-QSAR where conformational flexibility of ligand and freedom of alignment in three dimensional spaces as required in 3D-QSAR is utilized. In this approach, grid cell occupancy descriptors, GCODs, are calculated by aligning the set of molecules in defined grid box which resembles active site of receptor. GOCDs are the impetus of interaction pharmacophore elements, IPEs for atom sets like polar positive, polar negative, aromatic, hydrogen bond acceptor, and hydrogen bond donor. Partial least squares (PLS; Dijkstra, 2010) are one of the chemometric methods commonly envisaged in many QSAR models. In present work, 4D-QSAR approach called LQTAgrid-QSAR (LQTA, Laboratório de Quimiometria Teórica e Aplicada) introduced by Martins et al. (2009) has been used to build a 4D-QSAR model. Such 4D-QSAR model was developed in a search of important structural features required in the coumarin derivatives and subsequent design of more active molecules. In this approach, conformational ensemble was generated for individual compound called conformational ensemble profile (CEP) by molecular dynamics (MD) simulations using GROMACS 4.6.3 (Berendsen et al., 1995; Hess et al., 2008; Pronk et al., 2013).

Experimental

Methods

Dataset

Fifty-seven coumarin derivatives presented in Table 1 have been taken from different literatures (Zhao et al., 1997; Nunthaboot et al., 2006; Mao et al., 2002, Chiang et al., 2007, Su et al., 2006; Al-Mawsawi et al., 2006). These coumarin derivatives have been reported in these literatures as 3′-processing inhibitors. All the derivatives selected have biscoumarin component common in their structure. Based on the substitution pattern in these derivatives, 13 compounds were selected as test set and remaining 44 compounds as training set. From the literature, the reported 3′-processing IC50 in µM was transformed to pIC50 by taking negative logarithm of IC50 values in moles.

Table 1 Structures of coumarin derivatives studied

Computer hardware and software

Computational work was done on Ubuntu Linux 12.0 and windows XP operating system. Various software used include AutoDock 4.2 with MGLTools 1.5.6 (Morris et al., 2009), AutoDock Vina (Trott and Olson, 2010), Marvin Sketch, UCSF Chimera 1.8rc (Pettersen et al., 2004), Discovery Studio 3.5 (Accelrys Inc.), ArgusLab 4.0.1, Firefly 801 Quantum Chemistry Package (Granovsky, Firefly version 8.0; Schmidt et al., 1993), MATLAB 7.7.0 (R2008b) (MathWorks, Inc.), PLS_Toolbox version 7.51 (Eigenvector Research, Inc.), Pymol version 1.3 (from Schrodinger, LLC), LQTAgrid (Martins et al., 2009), ERM algorithm, a MATLAB .m script (Ballabio et al., 2014) and GROMACS 4.6.3.

Geometry optimization

General 4D-QSAR methodology is depicted in Fig. 1. Structures of all compounds were drawn using Marvin Sketch. Geometry optimization was carried out in ArgusLab 4.0.1 on semi empirical quantum mechanical basis with parameterized model number 3 (PM3) hamiltonian, until restricted closed-shell Hartree–Fock self-consistent field formalism converses to 10−10 kcal/mol and steepest descent geometry search criteria until gradient converses to 10−6 kcal/mol. Gasteiger partial atomic charges of optimized molecules were computed in UCSF Chimera.

Fig. 1
figure 1

Schematic representation of 4D-QSAR methodology

GROMACS coordinates and topology files

PRODRG2 server (Schuttelkopf et al., 2004) was used to generate coordinate and topology files of all the compounds from dataset. As the structures were optimized, energy minimization option in PRODRG was not chosen. Appropriate add hydrogen or hybridization patch was used during the generation of coordinate and topology files. The Gasteiger charges computed in UCSF Chimera were manually loaded in PRODRG-generated topology files.

Molecular dynamics simulation

Molecular dynamics simulation was carried out in GROMACS 4.6.3 in order to obtain CEP of each compound. During MD simulation, GROMOS96 ffG43a1 force field was used in an explicit water model in a cubic box of 1 Å volume. MD simulation included heating the system at 50, 100, 200 and 350 K for 20 picoseconds (ps) simulation time with 1-femtosecond (fs) step size. Particle mesh Ewald (PME) method was used to compute long-range electrostatics, and the Van der Waal (VWD) interaction energies were calculated with a cutoff radius of 1 Å. Compound and solvent water were separately coupled during the simulation. The pressure and temperature of the system were controlled by Parrinello–Rahman coupling and velocity rescaling thermostat (V-rescale), respectively. The system was then cooled down to 300 K. The trajectory generated was recorded every 2000 simulation steps, which is 2 ps simulation time.

Alignment

Conformations of all the compounds generated at 300 K simulation were subjected to alignment. Compound 16, the most active among all compounds, was chosen as reference compound. The atoms chosen for alignment included the common structural component (biscoumarin part) as shown in Fig. 2. The CEP of all other conformations of the rest of the compounds was aligned against the reference compound. In the alignment step, trajectories generated at 20–100 ps time with 2-ps increment were aligned. The alignments of conformers of the few most active (36, 29, 35 and 34) and least active compounds (3, 33, 7, 4 and 17) from dataset aligned with conformers of most active compound 18 are shown in Fig. 3.

Fig. 2
figure 2

Structural component (biscoumarin part) used in alignment

Fig. 3
figure 3

Alignment of conformers generated during MD simulation (CEP) a aligned CEP of most active (reference) compound 16; b CEP of active compound 36 aligned with compound 16 (compound 16 shown blue color); c CEP of active compound 29; d CEP of active compound 35; e CEP of active compound 34; f CEP of least active compound 3; g CEP of less active compound 33; h CEP of less active compound 7; i CEP of less active compound 4; j CEP of less active compound 17 (Color figure online)

Descriptors of interaction energy

The grid box of size 18 × 18 × 18, large enough to accommodate the CEPs, was chosen. LQTAgrid module was used with a hypothetical N-terminal of protein as –NH3+ probe to generate matrix of interaction energy descriptors. The electrostatic property in terms of Coulombic (C) potential function and steric 3D property in terms of Lennard-Jones (LJ) potential function was generated as a descriptor matrix of 11,664 descriptors containing 5832 LJ and 5832 Coulombic potential-based descriptors.

Variable selection and model development

The dimension of the descriptor matrix generated by LQTAgrid was 57 × 11,664. Descriptor matrix was refined by eliminating descriptors having correlation lower than 0.3 leaving 4248 descriptors. V-WSP variable reduction MATLAB routine which is an unsupervised variable reduction based on V-WSP algorithm (Ballabio et al., 2014) was subsequently applied on 4248 descriptors with 0.85 absolute correlations which gave 357 most suitable descriptors. The dataset was split into training set of 44 compounds and test set of 13 compounds. Test set compounds were selected so as to include highest and lowest activities as well as structural diversity. The PLS model was built using PLS_Toolbox (Eigenvector Research, Inc) in MATLAB workspace. The model built by using absolute values, ten latent variables and Venetian blinds with six splits which includes one sample per split was found to be the best model.

Docking studies

Till date, neither full-length structure of integrase from HIV-1 nor integrase complex with its DNA counterpart is available. Due to emergence of resistance strains against HIV-1 IN catalytic site inhibitors like raltegravir, development of allosterical site inhibitors can be extremely advantageous approach (Al-Mawsawi et al., 2006). Through structure-based design, a new allosteric region in HIV-1 IN has been identified and is available at RCSB protein data bank (PDB code, 3NF7). The QSAR methodology proposed in this work is receptor independent, but to understand the structural features important for HIV-1 IN inhibition, docking studies were carried out using AutoDock Vina from molecular graphics laboratory, Scripps Research Institute. The accuracy of docking was validated by docking the co-crystallized ligand, 5-[(5-chloro-2-oxo-2, 3-dihydro-1H-indil-1-yl) methyl]-1, 3-benzodioxole-4-carboxylic acid, into the active site. The root-mean-square deviation (RMSD) between docked pose and original co-crystallized pose was 0.104 Å (Fig. 4).

Fig. 4
figure 4

Docked conformer of ligand (red) and original pose of co-crystallized ligand (green) (Color figure online)

Results and discussion

4D-QSAR model

The objective of current 4D-QSAR studies was to build the best 4D-QSAR model with good predictive abilities. The strategy of MD simulation and subsequent generation of interaction energy contributions was thought to emulate interaction of important residues at the binding site of HIV-1 IN with compounds under investigation. The number of refined descriptors was 357, and multiple linear regression (MLR), a multivariate chemometric tool, cannot be a good model. This is because the number of independent variable matrix exceeds dependent variable vector, and generated model could have the over-fitting. PLS regression is best in this situation, and in current investigation, PLS regression was carried out. The data were split into 44 training set compounds and 13 test set compounds with the help of random_select.m script for MATLAB. The PLS model development was carried out by using PLS_Toolbox from Eigenvector Research, Inc. Data were preprocessed by autoscaling the raw data. The leave-one-out cross-validation method was found inappropriate for current data as the data constitute more than 20 samples. Thus, Venetian blind cross-validation method with six sample splits with one sample per split was adopted. The model with 10 latent variables was found best among 20 models. This PLS model showed R 2 calculated = 0.903015, R 2 cross-validated = 0.599553 and R 2 predicted = 0.688525. All these regression coefficients are within acceptable limits. The full model including test set and training set showed R 2 = 0.853. The other statistical findings include root-mean-square error (RMSE) calculated = 0.21276, RMSE predicted = 0.371579 and prediction bias = −0.15362. The pIC50 values predicted with the residuals are shown in Table 2. The plots of predicted activity against measured activity, scores on latent variables and measured activity against residuals are shown in Figs. 5, 6 and 7, respectively.

Table 2 Measured pIC50 values and predicted pIC50 values for test set and training set compounds along with residuals
Fig. 5
figure 5

Predicted pIC50 values against measured pIC50 values for training set and test set compounds

Fig. 6
figure 6

Scores on latent variables

Fig. 7
figure 7

Predicted pIC50 residuals against measured pIC50 values

Docking studies

During development of current 4D-QSAR model, NH3 + probe was exploited while generating interaction energy contributions. The interaction energy contribution between NH3 + field point and CEP at each grid point of 18 × 18 × 18 grid box spaced at 1 Å grid was generated. VAL150, LYS188, ARG199, HIS183, ILE151, MET154 and LEU158 are important residues in the binding site of HIV-1 IN. First three residues are important for hydrogen bonding, and other residues are important for hydrophobic interactions. Docking studies were carried out using AutoDock Vina on heteroatom stripped and optimized model protein of HIV-1 IN catalytic domain (PDB, 3NF7). During docking, the grid box of size 11 × 11 × 11 was used spaced at 1 Å and with x, y, z center 9.918, −26.384, −10.818, respectively. The VWD interaction constitutes Coulombic contributions, and electrostatic, hydrophobic interaction constitutes steric LJ potential contributions in 4D-QSAR descriptors. The interactions of docked conformers of few most active and least active compounds from dataset with important residues are presented in Table 3 and shown in Fig. 8.

Table 3 Important interactions of top five highly active and inactive compounds at binding site of HIV-1 IN
Fig. 8
figure 8

Interaction of active and less active compounds with important residues

Docking studies revealed that for HIV-1 IN inhibitory activity electrostatic, hydrophobic interactions are necessary between ligand and the important residues, LYS188, HIS183 and ARG199. The most active compounds were found to make important VWD interactions with VAL79, VAL77, ARG199 and GLU157, whereas least active compounds could not make such interactions. The water molecule, HOH272, at binding site was found to make interaction with most of the HIV-1 IN inhibitors. The 4D-QSAR studies resulted in a model where the descriptors 17_17_9_NH3 +_Coulombic (C), 17_17_8_NH3 +_C, 18_15_9_NH3 +_C, 18_16_18_NH3 +_LJ, 8_13_11_NH3 +_LJ, 18_15_10_NH3 +_C, 17_15_12_NH3 +_LJ, 18_17_8_NH3 +_C, 18_17_6_NH3 +_C, 17_16_10_NH3 +_C were found most contributing to the final model. Out of these, seven descriptors are Coulombic contribution descriptors and three are LJ steric interaction energy contributions. These descriptors as field points on the structures of most active compound 16 and least active compound 3 and the VWD and electrostatic, hydrophobic surfaces around these molecules are shown in Fig. 9.

Fig. 9
figure 9

Interpretation of interaction energy descriptors. a Interaction energy descriptors around most active compound 16; b Interaction energy descriptors around least active compound 3; c VW surface; d Hydrophobic surface around most active compound 16; e VW surface; f hydrophobic surface around least active compound 3

Conclusion

4D-QSAR model was built using GROMACS-based MD simulation on coumarin derivatives as HIV-1 IN inhibitors. In the model, interaction energy descriptors were constructed on CEP of each compound. PLS regression was carried out on selected descriptor matrix of 357 descriptors. The selected model with ten latent variables showed R 2 calculated 0.903015, R 2 cross-validated 0.599553 and R 2 predicted 0.688525. The results of 4D-QSAR are in good agreement with the docking studies, suggesting that the VW interactions are important for higher activity of these compounds. The 4D-QSAR model generated can be used for development of HIV-1 IN 3′-processing inhibitors. As a part of future work, we are designing coumarin-based molecules with substituents which will mainly contribute to VW interactions.