Keywords

1 Introduction

Virtual Screening (VS) methods can be divided into structure-based (SBVS) and ligand-based (LBVS) methods [4, 12, 14] This work focuses on similarity LBVS methods [2, 16]. In these techniques, the starting point is a source drug whose shape, electrostatic potential, or other descriptor is known. This source ligand or crystal will be the target, and the virtual screening methods try to find the more similar molecules in an extensive database or chemolibrary. When calculating the electrostatic similarity between the target and a compound in the database, the more used methodology in the literature consists of optimizing in terms of shape by using Rapid Overlay of Chemical Structures (ROCS) [11], selecting a number N of compounds with the highest shape similarity values, and then evaluate them in terms of electrostatic similarity. Based on the assumption that a more realistic description of the compound bioactivity during the optimization procedure may help to obtain better predictions, a new version of OptiPharm was implemented in [10], which involves the direct optimization of the electrostatic similarity. As the results showed, the new methodology provided better predictions in electrostatic potential than the classical ones. In this work, we go a step forward and propose new improvements in OptiPharm aimed to reach even better predictions. To do so, we firstly include the flexibility of the molecules in the optimization procedure.

Protein flexibility is necessary for metabolism, transport, and function biological effects. Except for simple molecules such as \(O_2\), both ligands and receptors are flexible molecules, which means that there is not a single three-dimensional representation of these molecules, but many. The conformational richness increases exponentially with the molecule’s size, i.e., the more atoms (and therefore bonds, angles, and torsions) it possesses, the more degrees of freedom there are. These degrees of freedom are not additive but multiplicative, giving rise to many possible conformational states (see Fig. 1).

Fig. 1.
figure 1

A rigid molecule (a) can generate different conformations (b). An example for the DB00331 target from the DrugBank database is shown here. The Target structure has been painted green for both figures. (Color figure online)

For this reason, most of the studies are based on ligands where flexibility is considered to assume the protein to be almost rigid or with partial flexibility, so that they only rotate a maximum of the possible rotatable bonds [1, 5, 7]. In some cases, the solution is to perform the Virtual Screening considering the molecule as rigid, and then apply a process where the flexibility is studied for the number of rotatable bonds allowed by the algorithm [5]. This process sometimes consists of varying in a discrete way each of the angles of the rotatable bonds to find the best solution. Following any of these methods, the computational time is reduced, but many solutions keep unexplored. What we propose in this work is a previous analysis of the molecules. First, many descriptors are calculated, and the most representatives are selected to compute the difference between the target and the molecules in the database. Then best molecules are filtered and selected as flexible molecules and are applied a conformational generation process. This methodology allows exploring all the conformation of each compound widely but saves time by discarding uninteresting molecules.

Apart from considering the flexibility of the molecules, the new OptiPharm incorporates mechanisms of interest that enhance the search and helps to reduce the computational cost. As the results will show, all these improvements help provide better predictions in the molecules.

The rest of the paper is organized as follows. Section 2 describes the scoring function considered in this study and resumes the main ideas of OptiPharm, focusing on the new procedures and strategies developed. Section 3 summarizes the computational and scientific context taken into consideration for the experiments. Finally, Sects. 4 and 5 show the main results and conclusions inferred.

2 Methods

2.1 Electrostatic Similarity Scoring Function

The electrostatic similarities are obtained by numerical solution of the Poisson equation [3], viz:

$$\begin{aligned} \nabla \{\epsilon (r)\nabla \phi (r)\} = - \rho _{mol}(r) \end{aligned}$$
(1)

where \(\phi (r)\) is the electrostatic potential, \(\epsilon (r)\) is the dielectric constant, and \(\rho _{mol}(r)\) is the molecular charge distribution. Electrostatic similarity between two compounds is compared by determining \(E_{AB}\):

$$\begin{aligned} E_{AB} = \int {\phi ^A(r)\phi ^B(r)\varTheta ^A(r)\varTheta ^B(r)\mathbf {dr}} \approx h^3\sum _{ijk}\phi ^A_{ijk}\phi ^B_{ijk}\varTheta ^A_{ijk}\varTheta ^B_{ijk} \end{aligned}$$
(2)

where \(\varTheta \) is a masking function to ensure potentials interior to the compound are not considered part of the comparison. The integral appearing in (2) is a volume integral, computed using a grid-spacing parameter, h.

Notice that the accuracy obtained from (2) depends on the number of atoms in the two compared molecules. To measure the similarity between compounds, regardless of the number of atoms that they are composed of and the descriptor used, the Tanimoto Similarity [6] value is computed as follows:

$$\begin{aligned} Tc_{E} = \frac{E_{AB}}{E_{AA}+E_{BB}-E_{AB}} \end{aligned}$$
(3)

where \(E_{AB}\) is the A molecule overlaid onto B molecule. \(E_{AA}\) and \(E_{BB}\) is the overlap of the molecules A and B, respectively.

2.2 OptiPharm Algorithm

OptiPharm is a recent software designed explicitly for LBVS problems. It implements a global evolutionary optimizer capable of calculating the similarity between two compounds, a target and a query. To do so, it uses different methods in the optimization process to gradually adjust the position of the query while the target fixes its position. The interested reader is referred to [9, 10] for an in-depth description of the original algorithm. In this work, we present a new version of OptiPharm. In the following, we briefly describe the new contributions.

To explore the solution space, OptiPharm works with a user-defined population of size M, which applies reproduction, selection, and improvement methods to each member of the population. A member or solution of this population represents the rotation and translation of the query molecule. Originally ten parameters were used to represent this modification, which means to work in a 10-dimensional search space. This paper presents a new version of OptiPharm, where the search space is reduced to 6 dimensions. The main change consists of replacing the use of quaternions with a semi-sphere parametrization, which simplifies the definition of the rotation axis. Consequently, searchability is enhanced due to the reduction of the search space dimension. Nevertheless, not only that, this new system avoids the repetition of the same rotation axis already explored.

This new mechanism provides improved freedom of exploration. In addition to reducing input parameters, the new version incorporates some problem knowledge, such as a mechanism to keep the angular variables between 0 and \(2\pi \) in a continuous circular. So, if during the search an angle \(\alpha \) takes a value greater than \(2\pi \), it is updated to the \(\alpha -2\pi \) value. In the previous version of OptiPharm, this value was updated to the maximum value of \(2\pi \).

Fig. 2.
figure 2

Procedure to rank rigid molecules.

2.3 Methodology

Procedure for Rigid Molecules

The process is trivial when working with rigid molecules and will be referred to throughout this paper under the name Rigid. As explained in the previous section, OptiPharm allows getting the best overlapping between two molecules, the target, and the query, to maximize the electrostatic similarity score. Consequently, when this procedure is repeated for each molecule in the database, their similarity score can be known. After that, the last step sorted the molecules by their similarity value. This procedure returns the most similar ones of interest since they can be successful potential drugs because they are the most similar to the target. Figure 2 shows this process to obtain a ranked list of compounds.

Procedure for Molecular Conformations

Working with flexible molecules with some rotational bonds implies modifying the methodology to obtain the similarity between a target molecule and a query molecule in the database. Multiple alternative conformers of this molecule are constructed by modifying the rotatable bonds with various rotation angles. This procedure simulates the flexibility in the rotatable bonds of a given molecule. However, the number of molecules in the database grows dramatically, and consequently, so does time.

A solution to this problem is to first discard those not promising compounds in the database and then generate conformations to the remaining molecules. This process, which we have called Flexible throughout the document, is explained in Fig. 3. In this figure, first, the descriptors are calculated for each molecule in the database. In this work, more than 4,000 different descriptors are obtained. However, many are not relevant or are repetitive, so different machine learning metrics are applied to discard them. In particular, variance and correlation are applied, and those descriptors whose values are 0 in most cases have been removed. This filter reduces the number of descriptors to a more limited number.

Fig. 3.
figure 3

Descriptor selection procedure.

The obtained descriptors are then used to filter the molecules in the database. As can be seen in Fig. 4, the Euclidean distance between the query molecule and each of the targets is calculated. Once they are available, the compounds are ordered from smallest to largest by the distance value, and the best M compounds are selected based on an empirical cutoff value. Finally, several conformations are generated for the selected query compounds and the target.

Fig. 4.
figure 4

Molecule filtering process based on Euclidean distance and subsequent generation of conformations of selected molecules.

After generating multiple conformations for target and query, an optimization procedure is run using OptiPharm. Figure 5 shows such a procedure using an example where only three conformations have been generated for both the target and the query.

Fig. 5.
figure 5

Procedure for obtaining maximum similarity when working with conformations of molecules.

As shown in Fig. 5, an extensive comparison is performed, which involves running \(nt\,\times \,nq\) times OptiPharm algorithm instead of just one for rigid molecules. In this exhaustive comparison, nt represents the number of conformations of the target molecule, and nq is the number of conformations of the query molecule. Once the maximum similarity of each of the comparisons has been calculated, the algorithm searches for the highest value and provides it as the final similarity result between the two flexible molecules.

Once the flexible target has been compared with all molecules in the database, they are ordered based on their conformation computed similarity value.

Consequently, new query compounds with a high similarity value can be identified while they are not detected when working with rigid molecules.

3 Materials

Hardware. The experiments of this work have been carried out using a cluster of 8x Bull Sequana X440-A5: 2 AMD EPYC Rome 7642 (48 cores) and 512 GB of RAM memory and 240 GB SSD.

FDA Database. The database used in this work was obtained from Drugbank, v.5.0.1 [15]. Specifically, a subset of 1,751 molecules validated by the Food and Drug Administration (FDA) has been used. The FDA is a federal agency of the U.S. Department of Health and Human Services responsible for protecting and promoting public health by controlling, among other things, prescription and over-the-counter pharmaceutical drugs (medicines). The original database has been downloaded from https://go.drugbank.com.

Software. The new version of OptiPharm, described in Sect. 2.2 is the optimization algorithm used to find the maximum similarity between two compounds. It has been configurated to consider the hydrogen atoms of each molecule. In addition, all the heavy atom radii have been set to 1.7Å. Furthermore, all compound pairs are centered and aligned. Consequently, the molecule centroids have been located at the coordinates center of the search space. Finally, each molecule has been aligned so that its longest axis has been oriented at X-axis and the shortest along the Z-axis. The input parameter set used in OptiPharm have been: \(N=200,000\) function evaluations, \(M=5\) starting poses, \(t_{max}=5\) iterations, and \(R = 1\) as the smallest possible radius.

Additionally, software OMEGA [8] has been the generator selected to obtain the conformations of targets and queries in the database. The maximum number of conformations for a given compound was limited to 500, though the obtained number was smaller in many compounds due to a small number of rotatable bonds.

Finally, Dragon (6.0.38) has been used to calculated 4885 descriptors for each molecule in the database.

4 Results

In this section, we will show the results obtained by the new methodology, and we will compare them with the ones obtained for the original OptiPharm in [10]. For illustration, we will only depict the outcomes obtained for the molecules DB00381 and DB00876.

Table 1. Top-10 most similar compounds in electrostatic to the target DB00381.

Following our methodology, we first calculate the 4885 descriptors for all molecules in the database. Subsequently, this group is reduced to 757 by applying the statistical metrics. With these numbers, it is computed the Euclidean distance between each molecule and the target. Later, the molecules are shorted according to that distance in ascending order. Only those molecules ranked within 10% of the shortest distance were selected. For the 175 molecules that remain, several conformations are generated. In particular, the sub-database obtained for each target consists of 47,983 and 38,833 conformations for the targets DB00381 and DB00876, respectively. In addition, target DB00381 resulted in 383 conformations, and DB00876 in 154.

Fig. 6.
figure 6

Maximum similarity solution for Target DB00381 when working with rigid molecules. Figures (a) and (b) represent the target and query compounds and their electrostatic fields. Figure (c) represents the optimal overlapping between the two compounds where \(Tc_E\) is maximum.

Fig. 7.
figure 7

Maximum similarity solution for Target DB00381 when working with flexible molecules. Figures (a) and (b) represent the more similar conformations of target and query compounds and their electrostatic fields. Figure (c) represents the optimal overlapping between the two compounds where \(Tc_E\) is maximum.

Tables 1 and 2 compare the main results obtained for the methodologies Rigid and Flexible for Targets DB00381 and DB00876. In particular, they show the 10 queries with the greatest similarity provided for both versions. More precisely, for Rigid methodology, we provide its name, Query, and the corresponding similarity value \(Tc_{E}\). For Flexible, we indicate the pair target-query that has obtained the best match, i.e. we identify those two molecules by also indicating their corresponding conformation number. This information is depicted in columns Target conformation and Query conformation. Finally, we show at column \(Tc_{E}\) the scoring function value obtained for each match. For the sake of comparison, we also indicate in column \(Rk_{R}\) the position that the Query conformation occupies in the list obtained by Rigid, and its corresponding scoring value in column \(Tc_E^R\).

The results show an improvement in the quality of the solutions. As can be seen in Table 1, the most similar compound found following the Flexible methodology (0.762) improves twice the value of the Rigid (0.377). Moreover, the most similar compounds in the rankings are different, i.e., DB00630 for the Rigid and DB09237 for the Flexible. Additionally, this result can be seen graphically in Figs. 6 and 7 where the molecules, their optimal overlapping, and electrostatic fields are represented using VIDA [13].

If the last two columns of Flexible are analyzed, it is clear that there is no good overlapping when rigid molecules are used. As seen in the \(Rk_R\) column, most of the top molecules in Flexible are below the position 1000th in the Rigid list.

Table 2. Top-10 most similar compounds in electrostatic to the target DB00876.
Fig. 8.
figure 8

Maximum similarity solution for Target DB00876 working with rigid molecules. Figures (a) and (b) represent the target and query compounds and their electrostatic fields. Figure (c) represents the optimal overlapping between the two compounds where \(Tc_E\) is maximum.

Fig. 9.
figure 9

Maximum similarity solution for Target DB00876 working with flexible molecules. Figures (a) and (b) represent the more similar conformations of target and query compounds and their electrostatic fields. Figure (c) represents the optimal overlapping between the two compounds where \(Tc_E\) is maximum.

Table 2 shows results along the same line as Table 1. In this study case, the top solution improves from 0.532 to 0.861 finding different compound as well. Figures 8 and 9 show graphically the most similar compounds for both methods.

The improvement obtained in the previous results is due to the solutions, including flexibility. However, as we previously stated, this increases the computational time considerably. However, the time has been ostensibly reduced thanks to the Flexible methodology employed in this work. The average time for an optimization process is 29 s for these target molecules in the current hardware. If we focus on the target DB00381 with a database of 38, 833 conformations from the initial 175 molecules, each molecule generated on average 221 conformations, although the maximum could be 500 for each one). If this value is extrapolated to the whole database, \((1751*221=) 386,971\) conformations could be generated. So, the time saved with the filter applied by discarding unpromising compounds is 116 days per target, and whether 500 conformations are generated, 264 computation days would be saved.

5 Conclusions and Future Work

In this work, we have improved the software OptiPharm by considering molecule flexibility. Apart from that, the new version includes several mechanisms to reduce the computational effort. In particular, we have reduced the number of optimization parameters and the range of freedom in some of them. Consequently, the search space decreases, and the number of function evaluations needed to find the optimal similarity drops. Besides, we have analyzed and applied descriptors to filter the initial database. We have used statistical metrics such as variance, correlation, or Euclidean distance.

The results have shown that the new OptiPharm can obtain solutions with higher scoring values than the original one, meaning that new query compounds with a high similarity value can be identified. These compounds are not detected when working with rigid molecules. In addition, the descriptor filter allows to drastically reduce the run time, saving for the study at hand up to 264 days.

Future work proposes implementing a conformation generation algorithm as an internal procedure of OptiPharm and including new scoring functions.