Introduction

Malaria is a lethal disease caused by Plasmodium falciparum (Pf), one of the deadly species of the plasmodium parasite [1]. According to WHO, around 3.3 billion cases and 438,000 deaths were reported in 2015 [2]. Although there has been a decline in the malaria transmission over the past 15 years (2000–2015) [3], most of the countries face widespread of malaria due to the emergence of drug resistance [4,5,6,7,8]. Owing to this reason, WHO has considered malaria as the first priority tropical disease [9]. Therefore, the scenario has drawn attention of the researchers toward the design of new strategies/drugs to accelerate the process of eradication of malaria. However, it is a well-known fact that the drug discovery and development is a laborious, time-consuming and an expensive process. Thus, the application of computer-aided molecular design and development of computational methods for lead generation and optimization are of enormous significance to reduce the overall cost and time allied with drug discovery program [10]. In this pursuit, a class of computational techniques known as pharmacophore modeling has been extensively applied [11,12,13]. This technique has proven to be very effective for lead identification and has been fundamentally divided into structure-based (SB) and ligand-based (LB) approaches. LB methods solely rely on the information extracted from the set of ligands, whereas SB methods depend on the molecular recognition between the protein–ligand complexes. However, it is well reported in the literature that the LB method suffers from drawbacks as the generation of the model depends on the ligands only [14]. LB method solely depends on the selection of the appropriate training set which affects the performance of the model [15,16,17,18,19,20,21]. One way to overcome this drawback is to switch to SB approach which imposes necessary constraints vital for activity and selectivity [14, 22, 23]. In general, SB methods exploit the information of either apo-structure (protein only) or single protein–ligand complex to the hypothesis generation. However, if more than one receptor-ligand complexes are available, then multicomplex-based pharmacophore modeling would be the finest to incorporate all the important interaction patterns simultaneously [24, 25]. Therefore, this strategy was employed to search the representative hypotheses of the selected enzymatic druggable targets of Pf.

In the present study, six protein classes of Pf viz. Oxidoreductases, Hydrolases, Transferases, Lyases, Isomerases and Ligases were chosen. However, only 16 different enzymes (from five protein classes) were selected on the basis of experimental inhibitory activity. Subsequently, the protein–ligand complexes of 16 groups were subjected to the pharmacophore generation using multicomplex-based approach [24, 26,27,28]. The generated models were typically validated by using a focussed database made up of experimental actives of the selected targets. Eventually, the generated pharmacophore models from each group were pooled based on the feature types. A model from each pool was selected and then clustered by using Euclidean distance method. The aspiration was to provide an insight about the pharmacophore similarity among the selected enzyme classes and share of features among the inhibitors. Rationale behind the current work was to search for the representative pharmacophores and to provide guiding principle for the design of stringent pharmacophore that can be employed for the virtual screening (VS). We anticipate that the present study (Scheme 1) will advance the knowledge of SB pharmacophore modeling approach as well as provide useful suggestion for their application in VS.

Scheme 1
scheme 1

Diagrammatic representation of the workflow adopted to conduct the current study (PL-complex is protein–ligand complex)

Materials and methods

Selection and preparation of protein–ligand complexes

All the protein–ligand complexes were manually selected from the RCSB Protein Data Bank (Supplement Table 1). However, only those complexes were selected which possess the following criteria: (1) the crystallized ligands/inhibitors must possess experimentally determined activity in terms of IC50, Ki or Kd measures, (2) the protein must belong to the enzymatic protein class, (3) the target must be crystallized with more than one inhibitors/ligands, (4) the crystallized inhibitors/ligands must belong to the same condensation site and (5) the protein–ligand complex with highest resolution should be taken, if solved at different crystallographic resolutions. This resulted in the selection of five protein classes for the present study (Fig. 1). The selected complexes were retrieved and subsequently grouped on the basis of enzyme commission number. The grouped protein–ligand complexes were then prepared by employing Protein Preparation Wizard of Accelrys-Discovery Studio (DS) [29].

Fig. 1
figure 1

Pictorial representation of the selected sixteen enzymes of five protein classes of Plasmodium falciparum viz. Oxidoreductases, Hydrolases, Transferases, Lyases and Isomerases

Superimposition and pharmacophore generation

The alignment and superimposition of the prepared protein–ligand complexes from each group were carried out by exploiting Align and Superimpose Proteins module of Accelrys-DS [29]. The rationale behind the superimposition was to integrate the essential common interactions in a single coordinate file. The choice of reference for the superimposition of proteins was made on the basis of highest crystallographic resolution (Table 1). Subsequent to the superimposition, all the protein chains were deleted except the reference in order to avoid the repetition of 3D-coordinates of the proteins. The superimposed conformers from each group were then subjected to the pharmacophore generation by using the HipHop algorithm of the Common Feature Pharmacophore Generation protocol of Accelrys-DS [29]. Owing to the use of crystal bound conformers, the conformational flexibility of the ligands was disabled prior to pharmacophore construction. In addition, the inter-feature distance was set to 2Å to consider close chemical features during pharmacophore generation. The pharmacophore features viz. hydrogen-bond acceptor (A), hydrogen-bond donor (D), hydrophobic (H), hydrophobic aliphatic (Z), hydrophobic aromatic (Y), positive ionizable (P), negative ionizable (N) and ring aromatic (R) were requested for the model construction.

Table 1 List of the selected druggable proteins of five enzymatic classes of Plasmodium falciparum along with their abbreviation, EC number, PDB IDs and crystallographic resolution

Database preparation and pharmacophore validation

A curated database comprising of 3705 experimental actives of the 16 selected targets was made to test the performance of the generated pharmacophore models (Supplement Table 2). The molecules of the database pertaining to the selected 16 druggable targets were extracted from the literature as well as retrieved from the BindingDB [30] and ChEMBL [31] (release June-2017) databases. Due to the unavailability of the experimental actives of triosephosphate isomerase (TP), human inhibitors were taken into consideration. All the chosen molecules were then collated into a single structure data format (SDF) file which was prepared by employing CHARMm force field [32]-based BEST conformational generation method of Build 3D Database protocol of Accelrys-DS [29]. This method generates 255 conformers from each molecule within a threshold of 20 kcal/mol of energy above the global minima and eliminates the structural duplicates to maintain the consistency in the dataset. The quality of all the constructed pharmacophores was evaluated by using the prepared database. Each model was mapped at default parameters by employing the Search 3D Database module of Accelrys-DS [29]. The statistical parameters viz. Ht (total number of hits retrieved), Ha (total number of actives retrieved), %A (recall of actives), %RA (precision), sensitivity, specificity, area under curve of the receiver operating characteristic plot (AUC-ROC), enrichment factor (EF) and Güner Henry scores (GH) were calculated. To accomplish these parameters for the hypotheses of a particular target, all the molecules in the database were presumed to be inactives except its own inhibitors.

Pooling, pharmacophore-clustering and ligand-similarity search

All the constructed hypotheses of the selected Pf targets were pooled on the basis of pharmacophore features. The pooling was carried out to avoid the possibility of the repetition of features for a particular target. The pooled-hypotheses from each target were then subjected to clustering by using the web-interface tool CIMminer [33]. Euclidean distance method and complete linkage algorithm were selected during clustering. The equal width binning algorithm was applied to color the distribution of the resultant clusters. However, to generate feature-based 2D-clustered image map (CIM), the descriptors (D, A, H, Z, Y, P, N and R) of the pooled-hypotheses along with their frequency were compiled in a [8 × 56] matrix as an input (Supplement Table 3). The rationale behind the clustering of was to analyze the pharmacophore similarity across the developed models.

On the other hand, the ligand-similarity search was carried out to check the proximity and similarity between the actives and presumed inactives retrieved by a particular hypothesis. To accomplish this, fingerprint-based Tanimoto Similarity Coefficient was calculated by using Find Similar Molecules by Fingerprints protocol of Accelrys-DS [29]. The combined clustering and ligand-similarity approach demonstrates the effectiveness to identify and analyze the interesting pattern of the arrangement of the pharmacophores and elucidates the reason accountable for the random distribution of the inhibitors across the generated models.

Results

Common feature pharmacophore construction

Multicomplex-based pharmacophore models for sixteen Pf enzymes crystallized with different inhibitors from five protein classes were generated by employing the Common Feature Pharmacophore Generation protocol of Accelrys-DS [29]. However, prior to the generation of pharmacophore models, all the protein–ligand complexes of a particular enzyme were superimposed keeping a representative protein as the reference (Table 1). The rationale behind the superimposition was to transform the 3D coordinates of different protein–ligand complexes to a common frame (Fig. 2). Subsequently, the pharmacophores were constructed by exploiting the bioactive conformations of the crystallized ligands of each enzyme. A total of 153 multicomplex-based pharmacophore models were generated (Supplement Table 4 and Supplement Fig. 1) and subsequently subjected to screening by using the prepared focused database. The main objective was to analyze the distribution pattern of the inhibitors across the Pf enzymatic proteome. From the screening pattern, it was observed that the molecules have shown affinity for the off-targets (Fig. 3, Supplement Table 5). As stated, it seems nearly impossible for a model to screen its own experimental actives without mapping the large number of presumed inactives.

Fig. 2
figure 2

Superimposed protein–ligand complexes of selected druggable proteins of five enzymatic classes of Plasmodium falciparum

Fig. 3
figure 3figure 3

Graphical representation of the number of inhibitors mapped by a particular pharmacophore model from the focussed dataset

For instance, hypothesis-3 of l-lactate-dehydrogenase (LD) and all the hypotheses of Orotidine-5′-phosphate-decarboxylase (OPD) retrieved 100% of its actives as well as 88.58% and ≤ 1.95% of presumed inactives, respectively (Supplement Table 6). However, few models like model-2 of 1-deoxy-d-xylulose 5-phosphate reductoisomerase (DXR), all models of M17-leucyl aminopeptidase (MLA), cell division control protein (CCP), triosephosphate isomerase (TP) and five models of purine nucleotide phosphorylase (PNP) were unsuccessful in retrieving even a single active molecule from the database. One of the potential ways to overcome this problem is to impose necessary constraints to the generated stringent-pharmacophore models. Keeping this in view, the current study was accomplished by exploiting Pf enzymatic proteome and thus expands the application domain of the pharmacophore modeling.

Validation metrics

The main objective of the pharmacophore-based virtual screening is to increase the probability of retrieving such molecules from the database which are likely to be active in the experimental findings. Accordingly, virtual screening (VS) is considered as a promising technique for the filtering of large set of molecules in the database. Thus, selection of the model for conducting the VS becomes an essential step. However, to evaluate the quality of the hypothesis, various quality metrics such as EF, %A, %RA, specificity, sensitivity, GH and AUC-ROC were used. To obtain these parameters, the common practice is to use a validation set comprising of experimentally confirmed inhibitors (actives) and experimentally reported non-inhibitors (inactives) of the particular target. In contrast, the present work exploited a focussed dataset made up of the experimental actives of the 16 selected enzymes. The aspiration was not to statistically validate the pharmacophores rather to check the real-world distribution of the inhibitors and their affinity for the off-targets. Thus, all the pharmacophores (153) were initially assessed by employing a test set comprising of the actives and presumed inactives. This method was used to estimate the sensitivity, specificity and AUC-ROC plots of all the hypotheses. In general, sensitivity and specificity indicates the ability of the model to identify the actives and to exclude the inactives from a dataset, respectively, whereas AUC-ROC predict the accuracy of the model to pick actives prior to presumed inactives. From the Supplement Table 6, it is obvious that a wide range of values were obtained for the above-mentioned parameters. Except the hypotheses obtained for enzymes Plasmepsin-2 (PL-2) and OPD, all have displayed discrepancies in terms of sensitivity, specificity and AUC.

In addition, the EF and its descriptors (%A and %RA) were also calculated. Generally, EF quantifies the recognition of the actives explicitly compared to the presumed inactives and its higher value corresponds to the better performance and reliability of the model. It is evident from the Supplement Table 6 that the diverse range (0–48.85) of EF values was shown by the hypotheses of the selected targets. However, a significantly good EF values were obtained for all the hypotheses of PL-2 (4.16–4.54) and OPD (40.26–48.85) enzymes. This behavior was in one-to-one correlation with the sensitivity, specificity and AUC values of the hypotheses of PL-2 and OPD. In general, the criteria for the ideal model are to display sensitivity and specificity ≈ 1, high EF and AUC value with a steep slope of ROC curve. It is imperative to mention that the models obtained from the OPD fits well within the ideal model criteria and thus can be used for the VS (Supplement Table 6 and Supplement Fig. 2). However, the rest of hypotheses from other targets have shown affinity for the diverse inhibitors thus necessitates the use of stringent-pharmacophore models. To accomplish this, pharmacophore-clustering and Tanimoto-based ligand-similarity studies were conducted.

Cluster and similarity analysis

The similarity between the pharmacophore models of the selected protein targets was provided by carrying out pharmacophore clustering. The key objective was to study the distinct patterns and their relation with the specificity of the developed models. However, instead of comparing 3D-coordinates of the feature types, the composition of features was chosen. The genesis of this postulate stems in the literature where 3D-pharmacophore-based clustering was performed to understand the intermolecular interactions [34].

Starting from the lower hierarchy of the dendrogram (Fig. 4), the pooled-hypotheses of PL (5 pools) were found to form two clusters, comprising of two and three hypotheses, respectively. None of the hypotheses were observed to cluster with the hypotheses of other targets, thus displaying the distinct pattern of feature types. Essentially, all the clustered hypotheses comprise of six features; however, the lower hierarchical cluster differs with respect to one feature, while the higher hierarchical cluster differs by two features. It is imperative to mention that 71.33–75.08% of PL-2 inhibitors were retrieved from the database via clustered hypotheses (Table 2). This clearly indicates the selectivity of these models toward its own inhibitors. Despite the significant specificity, these six-feature pharmacophore models were found to extract the small number of the inhibitor molecules from the off-targets such as aminopeptidase (AM1), dihydrofolate reductase (DHFR), dihydroorotate dehydrogenase (DHODH), enoyl-acyl carrier reductase (ENR), DXR, CCP, LD, MLA and TP (Table 2). This concern of non-specificity can be resolved by incorporating either P or N to the six-feature hypothesis. The next lower hierarchical clustered target seen in the cladogram was fatty acid synthesis protein (FAS). All the hypotheses pertaining to this target were pooled into two pools, and the hypothesis from each pool was observed to cluster with DHODH and DHFR, respectively. Owing to the simple pharmacophoric requirements (3-feature), these clustered hypotheses were able to retrieve the inhibitors from all the targets except OPD (Table 2). The demonstration of pharmacophore specificity/sensitivity using 3-feature hypotheses may not be the sufficient indicator to recognize the molecules from pharmacophore-based VS [25]. Traversing forward from FAS, the four pooled-hypotheses of DHODH were found to assemble into three clusters with FAS, DHFR and ENR. The hypotheses clustered with FAS consist of 3 (RHA) and 4 features (RHHA) of DHODH differing with respect to H feature. However, the pooled-hypotheses clustered with DHFR and ENR comprise of AHHH and AHY features, respectively. All the clustered hypotheses were successful in retrieving 78.12–96.50% of its own inhibitors along with the significant contribution from all chosen targets except OPD (Table 2). The lack of sensitivity can be attributed to the simple pharmacophoric requirements. From the cladogram (Fig. 4), it is expected that the removal of A and addition of R to the clustered hypotheses may increase both the specificity and sensitivity. Recently, a study has shown the importance of such features in enhancing the specificity of the pharmacophores of PfDHODH [24]. Similarly, the pooled-hypotheses of DHFR were found to cluster with DHODH, FAS, ENR and AM1 and thereby formed four clusters (Fig. 4). The retrieved 5- and 4-feature hypotheses showed difference of H feature. Akin to DHODH, the high specificity (76.43–84.08%) and low sensitivity were prevalent in all the pooled-hypotheses of DHFR. However, it is clear from the dendrogram that the presence of N/P in conjunction with R may enhance the sensitivity of the clustered-pharmacophore models.

Fig. 4
figure 4

Dendrogram representing the feature-based 2D-clustered image map of pooled-pharmacophores based on the composition of the features

Table 2 List of the number of inhibitors retrieved by the pooled-hypotheses from the prepared focussed dataset

For the sake of brevity, the discussion pertaining to non-meaningful pooled-hypotheses was not made. These include all the pooled-hypotheses of AM1, CCP, spermidine synthase (SS), pooled-hypothesis-3 of LD and 1, 2, 3 of deoxyuridine-5′-triphosphate nucleotidohydrolase (DUTPase). For such targets, it is imperative to prioritize the limited number of chemical features (typically 3–7) to construct a practical hypothesis for the VS experiments [14]. All the meaningful hypotheses of LD were seen to be grouped in the same cluster and differed by one feature type. It is obvious from the results that the N feature has shown dramatic effect on both sensitivity and specificity of the models (Table 2). However, due to simple pharmacophoric entailments, the sensitivity was very low as compared to specificity. We expect the insertion of H and donor features to the existing features may balance the sensitivity and specificity of the models for this target. The adjoining clusters target ENR dwells one, 4 features and three, 3 feature pooled-hypotheses were observed to be grouped into three different clusters. The two lower hierarchical pooled-hypotheses were clustered with DHODH, FAS and DHFR, whereas the upper hierarchical hypotheses were not found to be clustered with any of the hypotheses. Akin to the above-mentioned targets, the models of ENR were not able to discriminate between the actives and inactives thus resulted in insignificant specificity and sensitivity (Table 2). However, clustering analysis suggests that the addition of H and N to the obtained pharmacophores may enhance the sensitivity of the ENR models.

Likewise, the clustering of pharmacophores displayed a single cluster of the pooled-hypotheses of PNP. Each hypothesis comprises of 7-feature types. Despite the significant specificity, these models have shown insignificant sensitivity ranging from 0 to 17.33%. The obvious reason seems to be the complexity of 7-feature hypotheses and the removal of either A, D or R feature may increase the sensitivity. In contrast, the 7-feature pooled-hypotheses of TK were found to cluster into three small clusters with SS only. The sheer dominance of A and D features has made these hypotheses highly specific. Correspondingly, these features were also accountable for low sensitivity ranging from 13.64 to 19.70%. The removal of additional D/A features may lead to augment the sensitivity of these models.

All the significant pooled-hypotheses of DUTPase comprise of 7 features and were found to form three clusters. The first two clusters from the lower hierarchy solely consist of the significant and non-significant pharmacophores of the same target, whereas the last cluster comprises of one significant hypothesis each of DUTPase and DXR. The significant hypotheses have notably excluded the inactives and actives (Table 2). The obtained clustering pattern suggests the omission of excessive D/A features in order to increase the sensitivity of the significant pharmacophores. Adjacent to this cluster, the six-feature pharmacophores of DXR were found to form two clusters, one with the aforementioned target and the other with OPD. The simple pharmacophore requirement of the lower hierarchical clustered hypothesis was accountable for insignificant sensitivity and specificity. However, the presence of N feature has drastically increased the specificity of these models (Table 2). It is therefore expected that the insertion of R feature instead of A may balance the sensitivity and specificity of these models.

Similarly, all the pooled-hypotheses of MLA were found to form a single cluster with each other only. Owing to the presence of a group of complicated features, none of the clustered hypotheses were able to extract any molecule from the focused database. Thus, we expect the removal of single N and P features from the clustered hypotheses may enhance the sensitivity of the models. However, the increase in sensitivity may be at the cost of decrease in specificity of these pharmacophores. Traversing forward, a single pooled-hypothesis obtained for OPD was found to cluster with DXR. The hypothesis consists of NDAAA features and showed the momentous discrimination between the experimental actives and inactives. Therefore, this hypothesis can be employed for the VS experiments. The last hierarchical cluster comprises of three 7-feature pooled-hypotheses of TP, and all the hypotheses were not able to retrieve a single active molecule from the database. The obvious reason seems to be the presence of excessive N features. Thus, we expect the omission of excess N may enhance the sensitivity of the models.

Beyond the pharmacophore similarity, the other reason that may be accountable for low sensitivity and specificity of the pooled-hypotheses is the similarity among the inhibitors. It is obvious from the results (Table 3) that the fingerprint similarity between the actives and inactives recognized by the pooled-hypotheses ranges from 0 to 100%. Most of the pooled-hypotheses of the chosen targets showed > 50% similarity between the mapped actives and inactives. This necessitates the identification of structurally diverse candidates for the effective inhibition. However, the inhibitors of some targets like MLA, CCP, PNP, SS, OPD and TP have not shown the considerable similarity (0–15%) between the mapped actives and inactives. The primary reason seems to be the inability of the pooled-hypotheses to retrieve its own inhibitors which in turn can be related to the intricacy in the pharmacophoric features of the generated models. On the other hand, the pooled-hypotheses of PNP and SS were able to retrieve extremely fewer corresponding inhibitors and therefore displayed very low similarity between the actives and inactives.

Table 3 Table representing the percentage similarity between the mapped inhibitors of the same and different enzymatic proteins from the generated pooled-hypotheses of Plasmodium falciparum proteome

Discussion

Distribution of Pf structural data

The reliability of a pharmacophore model on the basis of statistical parameters can only be achieved, if there is a balance in the pharmacophoric requirements, adequate number of inhibitors in the dataset and satisfactory number of co-crystallized protein targets for the model construction [25]. Therefore, the selected Pf targets were classified into three groups viz. highly explored, moderately explore and least explored, based on the number of inhibitors and the crystallized protein–ligand complexes. The threshold for the categorization was manually chosen based on the overall distribution.

Classification on the basis of number of inhibitors

All those targets for which more than 500 inhibitors were reported in the literature were categorized into highly explored targets (AM1, MLA and PL-2). Similarly, the targets for which 100–500 and less than 100 inhibitors were reported were categorized into moderately explored (DHFR, DHODH, DXR and ENR) and least explored (most of the targets), respectively (Fig. 5). For the last two categories attention has to be paid toward the synthesis of diverse inhibitors.

Fig. 5
figure 5

Graphical representation of the number of inhibitors and number of crystallized protein–ligand complexes of the respective enzymes of Plasmodium falciparum. Green, blue and red colors represent highly, moderately and least explored targets, respectively

Classification on the basis of number of crystallized protein–ligand complexes

The targets which possess more than 10 crystallized protein–ligand complexes were classified as highly explored, whereas the targets crystallized with 5–9 and less than 5 were classified into moderately explored and least explored, respectively (Fig. 5). Majority of the targets have been crystallized with more than 10 different inhibitors thus belong to the highly explored category. On the other hand the targets viz. PNP, TK, FAS and DUTPase, MLA, CCP, fall within the domain of moderately explored and least explored category, respectively.

Overall, the targets viz. PL-2 and AM1 qualified both in terms of number of inhibitors and the crystallized protein–ligand complexes. However, owing to the simple pharmacophoric requirement AM1 was not able to discriminate between actives and inactives. On the other hand the systems of concern include DUTPase and CCP due to their protein–ligand and inhibitor deficiency. The resolution of these deficiencies is a challenge and utmost concern for the generation of multicomplex-based pharmacophore models in order to reflect quantative structure–activity relationship.

Conclusions

In summary, the multicomplex-based pharmacophore modeling approach was exercised to construct the pharmacophore models of the 16 selected Pf targets. A total of 158 hypotheses were generated and subsequently screened against a focussed database made up of experimental actives and inactives. It was observed that most of the inhibitors have shown affinity for the off-targets. Therefore, various statistical parameters were calculated and correlated with the robustness of the generated models. Subsequently, all the generated models were pooled and then clustered to analyze the pharmacophore similarity across the selected Pf targets. The essential features accountable for the specificity were prioritized, and the rationale behind the non-specificity was highlighted. Both pharmacophore and ligand similarities were found to be accountable for the present distribution of actives and inactives with the dominance of later. Based on pharmacophore clustering and ligand similarities, the solutions were offered to reduce the off-target affinities. Additionally, the promising targets and the valid pharmacophores that can be employed for the virtual screening were highlighted. Overall, the study emphasized the need for the construction of stringent pharmacophore models and the synthesis of structurally diverse inhibitor molecules. Despite the advances in recent past for the combat of malaria, still limited wet lab and molecular modeling efforts have been endeavored for the development of efficient inhibitors. We expect that the present contribution will be helpful for the construction of stringent pharmacophore hypotheses of the selected targets which can be exploited as an efficient pharmaceutical filter and a coherent inhibitor strategy.