Introduction

The innovation process in the pharmaceutical industry is driven by the release of new drugs onto the market, as this process often involves modifying the structure of a known drug in an attempt to produce a new drug that is as active or more active towards a target receptor. [1]. First, pharmaceutical companies search for “hits;” a hit is a compound that shows activity towards a specific target under study, which is evaluated by performing biological and toxicological assays and structure–activity relationship (SAR) studies. The hit then becomes a lead compound. In this context, even though investment in the innovation process has increased, the availability of new drugs on the market has not followed suit. This is partly because the regulatory requirements associated with the approval of a new drug have also been increased in order to prevent pharmacological accidents (such as the birth defects resulting from thalidomide use by pregnant women). Furthermore, the high cost of the biological assays and the other methodologies used are limiting factors, especially for startups. This makes it critical to develop new approaches to identifying lead compounds [2].

The most common strategies used to identify active compounds are analog design and systematic screening [36]. Analog design is a strategy widely used by research groups. It involves synthesizing analogs of active compounds that are currently on the market; these analogs are known as “me-too compounds.” Since proposed me-too compounds have very similar structures to active compounds, they have a good chance of also being active towards the desired target. This biological activity can be improved by optimizing the compound. However, analog design is generally considered to produce only incremental innovation. Amoxicillin is a good example of a drug obtained using this approach—it shows improved bioavailability compared to penicillin, which permits it to be administered orally. In other words, the analog design approach is ligand-focused.

On the other hand, systematic screening involves searching for evidence that a particular molecule or set of molecules, which may be natural or synthetic, has/have significant biological activity [7]. This is an exhaustive and time-consuming pharmacological investigative process. It is repeated until a compound with biological activity is identified. Recently, this methodology was mechanized. Such a high-throughput screening (HTS) process can screen several thousand compounds simultaneously using 30–50 different biochemical assays. When a hit compound is found, it can be submitted to a lead optimization process in the hope that it can ultimately be used as a lead compound. This approach has a good rate of success. However, this methodology can only be implemented by pharmaceutical companies or consolidated research groups. Efavirenz, delavirdine, and nevirapine are examples of antiviral drugs obtained by HTS that inhibit the reverse transcriptase of HIV. In this approach, drug discovery is target-focused [8, 9].

Even though HTS helps to get new drugs on the market, this approach is limited by its high cost, which has motivated the development of new technologies based on high-throughput screening and combinatorial synthesis. Considering the large number of biological targets available in the Protein Data Bank (PDB) [10] and the diverse libraries of compounds such as ZINC [11] that can be used to generate new drugs, structure-based virtual screening has shown itself to be a useful new strategy for identifying novel bioactive substances via molecular docking [12, 13]. Docking is an in silico method that is employed to identify hit compounds for a three-dimensional structure-of-interest receptor. Docking programs measure the affinities of small molecules (ligands) for a molecular target to determine their interaction energies with the target [14]. In addition, visualization software can be used to show the intermolecular interactions responsible for molecular recognition, such as those associated with the complex between the ligand and the receptor. As a result, docking can identify the most promising hits for biological assays and decrease the cost of drug development.

Docking approaches can be applied in two different contexts. First, a library of ligands normally obtained from analog design can be submitted to docking simulation to find the best hit for a specific molecular target. This approach is known as virtual screening (VS) [15]. In contrast, the pharmacological activity of a specific ligand such as a new natural compound can be searched for by performing docking simulations with a set of targets. This approach is called inverse virtual screening (IVS) [16, 17]. IVS, for instance, helped identify dorzolamide as an anhydrase carbonic inhibitor that could be used in cases of glaucoma [9].

Molecular docking programs were first used in the early 1990s. Back then, there were high expectations that this approach would support the development of new drugs. However, after a few years of use, the community noticed that the ranking functions used in these programs did not accurately predict the free energy of binding. In the last decade, these tools have made use of advances in technology and changes in docking methodologies [18]. These improvements in technology, including better processors and software, have permitted the implementation of IVS with a staggered docking methodology and a set of molecular targets. This approach is called virtual high-throughput screening (vHTS), and it simulates HTS experiments but is faster and more affordable [1, 7]. vHTS motivated the development of the Octopus software described in the present paper.

Octopus is an in-house automated workflow management tool that performs vHTS. It integrates MOPAC2016 [19], MGLTools [13], PyMOL [20], and AutoDock Vina [21] in order to perform molecular docking through a user-friendly interface. Unlike other platforms, such as Raccoon2 [22] and PyRx (http://pyrx.scripps.edu), Octopus can simulate the molecular docking of an unlimited number of ligands against an unlimited number of molecular targets. Further, neither Raccoon2 nor PyRx permit the refinement of ligands using MOPAC2016. In addition, Octopus includes a databank of 42 molecular targets (called the Our Own Molecular Targets Data Bank, OOMT [23]) against malaria, dengue, and cancer. These targets have been parameterized in the Protein Data Bank Partial Charge (Q) & Atom Type (T) (PDBQT) format [23]. OOMT was validated based on the root-mean-square deviation (RMSD) and the area under the ROC curve (AUC) using two different docking methodologies: AutoDock Vina and DOCK 6 [24].

Search algorithms in virtual screening software

Most of the software used for molecular docking can be categorized based on the analysis of ligand flexibility and the search process strategy used (including systematic searches or random searches or those based on simulation [12]).

In a systematic search or incremental algorithm, a set of values is determined for each degree of freedom. The goal is to apply a combinatorial method for all molecular degrees of freedom through incremental ligand construction at the receptor site. Thus, the algorithm probes for different conformations of the same molecule [25]. In incremental algorithms, the ligand is fragmented and one of its fragments is positioned at the binding site of the molecular target during docking. The fragments are successively added until the molecule is completely rebuilt. Conformational ensembles comprise tools that use a molecular motion database which stores a set of conformations for each molecule and submits it to the docking process. During docking, each conformation is considered static. These approaches are effective at exploring the conformational space, but they can converge to a local minimum rather than the global minimum [21].

In contrast, deterministic algorithms do not rely on random data. Thus, the result is predetermined by the input data. The simulation methods of molecular dynamics and energy minimization are examples of the deterministic search algorithms used in docking. These methods have a high computational cost. In molecular dynamics simulations, atoms and molecules interact for a predetermined time, and we observe if they continue to interact after a particular time has elapsed or if the interaction is lost. Energy minimization algorithms apply an energy minimization strategy to an initial conformation of a molecule to find its minimum-energy conformation during the docking process.

Other random strategies used in molecular docking include generic algorithms and Monte Carlo methods. Genetic algorithms are implemented as a computer simulation where a population of abstract representations is mutated to search for better solutions. Each individual represents a possible solution to the problem. For each new generation, the adaptation of each solution is evaluated. Thus, some individuals are selected for the next generation, and they are recombined or mutated to generate new individuals. This process is repeated to find better solutions until it is finalized. A Monte Carlo method uses a statistical methodology based on a large set of random samples to get results that approximate reality [26]. Thus, Monte Carlo methods perform a sufficiently high number of successive simulations to allow probabilities to be calculated heuristically. When used as docking methods, Monte Carlo methods randomly generate an initial conformation of the ligand and calculate its binding energy. Based on this initial conformation, a new configuration is generated. If the binding energy for the new configuration is less (i.e., more negative) than that for the initial conformation, then it is automatically accepted as the reference for the next iteration. Otherwise, another evaluation is performed to check whether it should be used as the reference. This process is repeated until the desired number of iterations is reached.

In general, however, most virtual screening software packages utilize a combination of these approaches. Table 1 summarizes the strategies used for selected docking tools.

Table 1 Search algorithms used in docking software (adapted from [12] and [25])

Scoring functions in virtual screening software

Docking software packages use scoring functions to estimate the strength of noncovalent interactions between a ligand and a molecular target via mathematical methods [52]. Scoring functions are one of the most important elements of structure-based drug design. However, despite their widespread use, estimating the strength of interaction between a ligand and a molecular target remains a major challenge in docking methods.

There are three basic important applications of scoring functions in molecular docking. The first is the determination of the binding site and the binding conformation for a molecular target and a ligand. Another is the prediction of the binding affinity between a protein and a ligand. Finally, they can also be used to identify potential drugs for a given protein from large databases.

There are three types of scoring function [12, 25, 52, 53]: force-field, empirical, and knowledge-based. Force-field (FF) scoring functions are calculated based on the intermolecular interactions between the atoms of the ligand and those of the molecular target, such as van der Waals, electrostatic, and bond stretching/bending/torsional forces. FF scoring functions are usually based on experimental data and follow the principles of quantum physics [12]. However, these methods do not consider the solvent in their calculations. They also lack a physical model that describes entropic contributions, which leads to imprecision in the results generated by the scoring function.

Empirical scoring functions estimate the binding free energy based on weighted structural parameters obtained after adjusting the scoring functions based on the experimentally determined binding constants for a set of complexes [53]. This creates a training dataset of some protein–ligand complexes with known affinities [12]. Thus, linear regression is performed to predict the values of some variables [52]. Constants known as weights are then generated using the empirical function to use as coefficients to adjust the terms of the equation. Each term of the function describes a type of physical event involved in the formation of the ligand–receptor complex. Thus, hydrogen bonding as well as ionic, nonpolar, desolvation, and entropic effects are all considered.

In knowledge-based scoring functions, the binding affinity is calculated using the sum of the interactions between the ligand atoms and target atoms [53]. These functions obtain statistical data (i.e., the frequencies of specific intermolecular ligand-receptor interactions) on large databases (such as the Protein Data Bank). For example, if a hydrogen bond is present in 90% of the relevant cases, this bond is weighted more heavily in the equation of the force field. They use pairwise energy potentials extracted from known target–ligand complexes to obtain a generic scoring function, and generally assume that intermolecular interactions occur near atoms or functional groups, as such intermolecular interactions occur more frequently and are more likely to favorably contribute to the binding affinity. The final score is given as a sum of the scores of all individual interactions. Table 2 summarizes the types of scoring functions used in various docking tools.

Table 2 Scoring functions used in docking software (adapted from [12, 25, 52, 53])

Octopus

Octopus is software for virtual high-throughput screening (vHTS) developed in Shell Script, Python, HTML, and CSS. It offers fast and user-friendly docking simulations. It integrates MOPAC2016 [54], PyMOL [20], MGLTools [13], and AutoDock Vina [21] via an inteface that is intuitive and self-assessing (i.e., Octopus takes the output of Mopac and prepares it automatically for use as input to other programs).

In general, docking software is suitable for carrying out a simulation of one ligand docking into a specific molecular target. However, Octopus can automatically perform virtual high-throughput screening (vHTS) of N ligands docking into M molecular targets, i.e., it can perform simulations of an unlimited number of compounds docking into a set of molecular targets.

The main advantages of Octopus relative to MOPAC2016, PyMOL, MGLTools, and AutoDock Vina are its automation, ease of use, speed, and error reduction. If Octopus is not used, each of the four programs mentioned must be managed by a human operator. Also, the output of one program must be used as input for the next program, which often requires user action, introducing delays into the screening process and the possibility of user-generated errors. Therefore, there is also a need to check for human error at each step that requires user action. Also, the steps necessitating user action must be performed for each ligand–target combination. However, in Octopus, as soon as one of the programs is completed, the next is executed automatically without user intervention. Consequently, Octopus reduces the possibility of user-generated error because it reduces human interactions.

In the Octopus protocol, MOPAC2016 refines the ligands, PyMOL visualizes the ligand geometry, MGLTools determines the rotatable bonds and assigns net atomic Gasteiger–Marsili charges, and AutoDock Vina performs the molecular docking. Finally, the results are compiled and presented as binding energies for ligand–receptor combinations (Fig. 1). The protocol used by Octopus is summarized in the “Methods” section.

Fig. 1
figure 1

Octopus vHTS results in html format. The yellow row shows reference values obtained from the redocking of crystallographic ligands. All values are in kcal/mol

Methods

In this section, we describe the steps performed in the Octopus protocol:

  1. 1.

    First, directories of ligands and targets are chosen (all the ligands and all the targets must be placed in separate directories). For instance, the ligand directory could be derived from the ZINC platform, as shown in Fig. 2.

    Fig. 2
    figure 2

    Library of ligands obtained from the ZINC platform

  2. 2.

    When choosing the target directory (Fig. 3), a previously parameterized molecular target databank called Our Own Molecular Targets (OOMT) [23] that is included in Octopus can be utilized. The OOMT databank comprises various receptors from the Protein Data Bank (PDB), and it includes specific targets for cancer, dengue, and malaria. The main objective of the OOMT databank is to facilitate virtual screening studies using molecular docking at specific molecular targets. Appropriate biological assays can then be performed based on the results of the molecular docking. The OOMT databank has a configuration file with X, Y, and Z coordinates, and a grid box size delimiting the region for molecular docking simulations and the reference binding energy according to the crystallographic ligand.

    Fig. 3
    figure 3

    Select molecular targets from the OOMT

  3. 3.

    As mentioned before, the 3D structures of ligands can be obtained from ZINC [11]. If the ligands come from a known public database, then we can proceed to step 4. Otherwise, if the ligand has been generated using the MarvinSketch program [56] or another application, then Octopus will carry out out ligand refinement using the run_MOPAC software developed in Python. This software reads the net atomic charges of the atoms of all the ligands in PDB format into the ligand folder. Next, all of the ligands are refined by the semi-empirical parametric method 7 (PM7) [55] implemented in MOPAC2016 using a routine minimum search (EF) [19]. The user is asked to check how many alpha and beta electrons are present in each molecular orbital after energy minimization of the ligands. This reduces the likelihood of accepting incorrect structures (i.e., free radicals) for subsequent calculations. This process is can be applied for ZINC databank structures converting from smile format (only) to pdb format through babel software using the keyword gen3d. The automated workflow of run_MOPAC is shown in Fig. 4.

    Fig. 4
    figure 4

    The automated workflow of run_MOPAC

  4. 4.

    In this step, ligands are converted from PDB to PDBQT format while assigning the rotatable bonds, the Gasteiger–Marsili net atomic charges [56], and only the hydrogens on polar atoms (oxygen and nitrogen) are retained; the other hydrogens are removed [13].

  5. 5.

    Visual inspection of the geometries of the ligands is then performed using PyMOL [20].

  6. 6.

    In this step, the ligands in PDBQT format are submitted to molecular docking by AutoDock Vina [21], which executes until all of the ligands have been docked into the targets. Configuration files follow the AutoDock Vina protocol, with exhaustiveness set to 24 [57].

  7. 7.

    Finally, the binding energy results for each molecular target are generated in CSV or HTML format. This makes it simple for the user to determine whether the ligand is capable of interacting with a specific molecular target. Figure 1 shows an example of the results obtained by Octopus in HTML. First, complementary information about the experiments (number of ligands, number of targets, date and time of experiment) is shown. The default crystallographic values for the binding energies between the ligands and targets are also displayed.

In addition, the entire process can be repeated while storing the previous results. A summary of the Octopus algorithm is presented as a six-step workflow in Fig. 5.

Fig. 5
figure 5

Automated workflow of Octopus

The interface of Octopus

Octopus has a user-friendly interface. Figure 6 shows the start interface of Octopus. Five selection options are available: (1) inverse virtual screening without run_MOPAC; (2) inverse virtual screening with run_MOPAC; (3) run_ MOPAC; (4) tutorials; and (5) install software.

Fig. 6
figure 6

User-friendly interface of Octopus: the main menu

Inverse virtual screening without run_MOPAC must be used when the PDB file is downloaded from a public databank. Steps 1, 3, 4, 5, and 6 of Octopus are performed in this protocol (Fig. 5). Inverse virtual screening with run_MOPAC must be used when the PDB file is generated with the MarvinSketch program. In this case, all six steps of Octopus presented in Figure 5 are performed (run_MOPAC refines a set of ligands when they are generated by the user using a tool such as MarvinSketch). Tutorials on manual installation and the use of all applications associated with Octopus are available. Install software is used to install other applications available in Octopus.

Octopus can perform IVS in automatic or manual mode. In automatic mode, the entire experiment is performed without user intervention after choosing the ligand and molecular target directories. PyMOL is not executed in this case. In manual mode, user intervention is required after every step shown in Figure 5. In addition, the entire process can be repeated while storing the previous results. To test out this Octopus process, we used it to perform two case studies examining the metalloprotease activities of (phenylamino)urea derivatives and the antimalarial activity of a pyrazole derivative [58] (see the next section).

The docking approach is limited by the flexibility of the receptor, which is generally considered to be rigid, and the fixed bond angles and lengths generally assumed for the ligands. Consequently, improper results can be obtained for molecular targets when using the induced-fit mechanism. This issue can be resolved by using an ensemble of protein structures or flexible docking [22]. These tools are complementary to docking methods as they reduce computational costs. Even though Octopus uses rigid receptors from the OOMT databank, all molecular targets are evaluated based on RMSD and AUC values to gauge the accuracy that can be achieved. In addition, explicit water molecules (which participate in two hydrogen bonds) were retained in the docking simulation, whereas water molecules in the molecular targets were removed [24]. Docking using receptors with flexible side chains will be considered in subsequent versions of the program.

Results

This section discusses two successful applications of Octopus. In the first case study, the IVS process was applied to determine the metalloprotease activities of (phenylamino)urea derivatives. In the second (which has been reported previously), the process was applied to check whether a particular pyrazole derivative possesses antimalarial activity.

Successful case study 1: metalloprotease activities of (phenylamino)urea derivatives

A set of 22 (phenylamino)urea derivatives (“LSO&ME” compounds) were submitted to Octopus. Docking results from the IVS approach suggested that, among the 40 molecular targets studied, the metalloproteinases were feasible targets. The matrix metalloproteinases (MMPs) are zinc-dependent enzymes that have collagen (present in the extracellular matrix) as one of their substrates. They participate in the tissue remodeling process. Moreover, they are involved in tumor metastasis because they are overexpressed in some types of tumors. The IVS methodology showed that binding energies with the metalloproteinase with PDB code 1GKC ranged from −8.0 kcal/mol to −9.5 kcal/mol [59]. The corresponding crystallographic binding energy was −6.6 kcal/mol. 1GKC recognized LSO&ME007, with interactions including hydrogen bonds, van der Waals interactions, and intramolecular π-stacking. This molecular target is a metalloprotease involved in cancer pathology; it was evaluated previously based on the RMSD and the ROC curve [24], yielding values of 0.55 Å and 0.60, respectively. RMSD values of <2.0 Å imply good pose fidelity [21], while AUC values of >0.5 enable the methodology to distinguish between true- and false-positive compounds. In other words, docking studies of this system should be evaluated by performing a corresponding experimental study.

Figure 7 summarizes the intermolecular interactions between 1GKC and two ligands in the form of 2D diagrams. L-Valinamide (Fig. 7a) and LE&007 (Fig. 7b) present similar molecular interactions in terms of van der Waals and hydrogen bonds, although LE&007 shows additional intermolecular interactions, such as π–π stacking and T-shaped stacking. In addition, the interaction (at a distance of 2.39 Å) between the zinc atom of 1GKC and the lone pairs of the carbonyl moiety of LE&007 is highlighted in the figure. These additional molecular interactions with LE&007 help to explain the binding energies of the (phenylamino)urea derivatives with the metalloproteinases. The compounds of interest were studied in a biological assay, and LE&007 was found to inhibit 80% of the enzymatic activity of the metalloproteinase 1GKC.

Fig. 7a–b
figure 7

2D diagrams of the binding between 1GKC and two ligands, as visualized using Discovery Studio Visualizer 4.5 [60]: a the crystallographic ligand L-valinamide; b LE&007

Following the IVS experiments, the effects of the (phenylamino)urea derivatives (LSO&ME compounds) on the proteolytic activities of MMP gelatinases were measured by gelatin zymography performed according to a previous report [61]. The samples were dissolved in dimethyl sulfoxide (DMSO) at 6 mg/mL, and 10 μL were applied to a well of gel containing the substrate-rich MMPs: saliva (20 U of protein) in sample buffer (SDS 2.5 wt% and saccharose 1 wt%). This corresponded to the same quantity of saliva was used as the standard, and this represented 100% of the active enzymes. Electrophoresis (PROTEAN II, Bio-Rad, Hercules, CA, USA) was conducted under reducing conditions (0.025 M Tris, 0.192 M glycine, and 0.1% SDS, pH 8.5) at 70 V and 4 °C for 3.5 h.

After electrophoresis, the gels were washed for 1 h with Triton X-100 (2.5 g%) to remove the SDS, and then submerged (with stirring) in an activation buffer (Tris–HCl 0.05 M, CaCl2 0.6 g%, pH 8.0) for 16 h at room temperature. Next, the gels were stained (0.25% Coomassie Blue R-250, methanol 45%, and acetic acid 10%) for 1 h and then bleached (using 30% ethanol/10% acetic acid) for another hour.

The compounds LSO&ME005, LSO&ME004, and LSO&ME007 suppressed the activity of MMP-9 by approximately 80%, and partial inhibition was observed when LSO&ME004 was applied. LSO&ME005 and LSO&ME028 suppressed the activity of MMP-2 by approximately 55%. Inhibition of gelatinase activity was measured by comparing the decrease in the amount of undigested bound substrate in solutions containing MMPs and the LSO&ME compounds with the decrease in the amount of undigested bound substrate observed in solutions of MMPs that did not contain the LSO&ME compounds.

Successful case study 2: antimalarial activity of a pyrazole derivative

Several reports have shown that pyrazole derivatives possess biological activities (e.g., [62]). Hence, our group performed VS of the pyrazole derivative Tx001. Octopus showed that this compound can complex with a model of Plasmodium falciparum ATP6 (PfATP6) [63], with a binding energy of −8.6 kcal/mol (as compared to −7.7 kcal/mol for the binding energy of thapsigargin (TG)—a natural compound that is an inhibitor of PfATP6) calculated for docking into the hydrophobic cavity of this model. The complex Tx001–PfATP6 was then evaluated by molecular simulation utilizing an implicit solvent model, and the system was observed to reach equilibrium in 30 ns. The potential energy of the system decreased during the simulation to approximately −5500 kcal/mol. The main ligand–PfATP6 interactions were van der Waals, electrostatic, and hydrogen bonding between the guanidinium moiety of Tx001 and Ile752 of PfATP6. Finally, Tx001 was evaluated for antimalarial activity, and it presented a good inhibitory concentration (IC50) of 8.2 μM. Its antimalarial activity is therefore stronger than that of chloroquine (IC50 = 0.38 μM), a widely used antimalarial drug, which motivated us to optimize this ligand. Second-generation derivatives of Tx001 are currently being evaluated [58].

Conclusions

Drug development is a difficult task for small academic groups. Thus, applying a theoretical approach can increase the “hit” rate, and these hits have the potential to become lead compounds for new therapies. This motivated us to develop Octopus as a tool for the vHTS of multiple compounds against a set of molecular targets. It can also reduce the number of biological assays needed to determine a pharmacological mechanism. It is limited principally by the time to draw the structures of the ligands as well as the choice of desired targets. The entire Octopus protocol can run automatically, although computational chemists are still needed to visually inspect the intermolecular interactions.

In this manuscript, we also showed two successful examples of the application of Octopus to find molecular targets. Octopus identified a new hit compound, LE&007, that can be optimized to generate a new lead compound for antineoplastic drugs, and it was also used to determine the antimalarial activity of the pyrazole derivative Tx001. Neither LE&007 nor Tx001 were lead candidates originally identified for these diseases. Thus, Octopus provides a second chance to find a use for these compounds as lead compounds.

Finally, Octopus provides a user-friendly Linux-based interface for MOPAC2012, PyMOL, and AutoDock Vina. Work to enhance Octopus by adding a new molecular dynamics simulation code is also in progress. Octopus can be obtained from www.drugdiscovery.com.br