Keywords

1 Introduction

1.1 Translational Bioinformatics

Translational bioinformatics is the evolution of conventional in silico science that deals with storage, analysis, and knowledge extraction from voluminous genomic, proteomic, sequence, and structural data. Translational bioinformatics takes account of research in the development of novel techniques for the integration of clinical and biological data that serves as a source input to designed algorithms and includes the methodology to transform the biological observations into desired knowledge that benefits the scientists, clinicians, and patients that we will see in this chapter. Complicated biological network mechanisms of disease and structure of molecules involved pose several experimental challenges in the drug discovery processes. These complications arise from independent operation of the different parts involved in drug development process with little interaction between clinical practitioners, academic institutions, and pharmaceutical industries (Portela and Soares-da-Silva 2015). Specially, the research in drug development is purpose specific and performed by highly specialized scientists and researchers in their respective fields considering few inputs from clinicians and medical practitioners in strategy design for future therapies (Portela and Soares-da-Silva 2015). Translational research is a road map in which novel therapies will link the experimental discoveries with computational techniques in delivering the clinical needs to the market. Theoretical/computational techniques offer valuable visions in experimental discoveries with pharmacological and pathophysiological mechanisms and virtual development of new prospects in designing and synthesis of novel and better molecular entities with time and cost-effectiveness (Raza 2006).

2 Supporting Resources

2.1 Online Database

Sequence database such as NCBI, EMBL, or UniProt imparts a mammoth contribution to disease, diagnosis, and drug development industry. Structure database such as Protein Databank incorporates structures evaluated by the 3D crystallography, NMR, and hybrid technology and plays a key role in the structural bioinformatics (Berman 2008). SCOP (Hubbard et al. 1999) and CATH (Oreng et al. 1997) classify the structure on the basis of structural and domain features, whereas PDBsum describes the graphical overview of the deposited 3D structure in a more precise form (Laskowski et al. 1997).

Database that handles reaction and kinetics between the genes, proteins, enzymes, and chemical components with their signal activity is known as metabolic pathway database. MetaCyc (http://metacyc.org) holds experimentally identified biochemical pathways which can be used as a reference data set for the metabolism design and analysis (Zhang et al. 2005). KEGG (http://www.genome.jp/kegg/) is a database for understanding complex functions of the biological system such as cell, organism, and ecosystem by combining the knowledge from genomic and molecular information. KEGG executes a computational representation of the biological system in a wired network diagram (system information) consisting of molecular building blocks of genes and proteins (genomic information) and chemical substances (chemical information) (Kanehisa et al. 2002). The BioCyc database data sets contain a group of organism-specific pathway/genome databases (PGDBs). They provide reference to genome and metabolic pathways of a few thousand organisms (Caspi et al. 2011). BRENDA (BRaunschweig ENzyme DAtabase) is an enzyme database established in 1987 at the Helmholtz Centre for Infection Research, formerly known as German National Research Centre for Biotechnology, and is currently maintained by the Department of Bioinformatics and Biochemistry at the TU Braunschweig. BRENDA is supplemented by enzyme-specific data classified by their biochemical reaction (Scheer et al. 2011). Other databases are also available such as Panther (Thomas et al. 2003), Reactome (Croft et al. 2010), HumanCyc (Miles et al. 2010), Mint (Licata et al. 2012), etc.

2.2 Small Chemical Structure Database

The online free access chemical databases assist the scientific community in identifying the previous experimental and nonexperimental chemical entities which can be an auxiliary/further tested for similar or different therapeutic applications. Online publically available small chemical structure databases such as PubChem (Bolton et al. 2008), DrugBank (Wishart et al. 2006), ZINC database (Irwin and Shoichet 2005), eMolecules (https://www.emolecules.com/), etc., listed in Table 2.1 regularly share their information on the basis of knowledge exposure. More than thousands of structures are deposited annually in these public databases with millions of compounds tested for known or unknown activities (http://depth-first.com/articles/2011/10/12/sixty-four-free-chemistry-databases/).

Table 2.1 List of chemical structure database

3 Chemical Data Mining Strategies

The exhaustive and fast designed algorithms compete in the identification of structurally similar compounds. Methodology including structural similarity searching and clustering of small molecules plays an important role in screening of compounds with identical or common scaffold in drug discovery pipelines. To search, analyze, and assemble the diverse compounds from a public database is critical to enable the full utilization of existing resources. However, most of the software in this area is only commercially available, and open source is at high demand with optimum accuracy and precision. The long-term goal of the ChemmineR project is to narrow this resource gap by providing free access to a flexible and expandable open-source framework for the analysis of small molecule data from chemical genomics, agrochemical, and drug discovery screens (Cao et al. 2008). Based on screening data from PubChem BioAssay database, Pouliot et al. used reported adverse event data with experimental molecular data and generated a logistic regression model to correlate and predict post-marketing ADRs (Shah and Tenenbaum 2012; Pouliot et al. 2011). In a similar way, an existing data mining algorithm was enhanced by using molecular fingerprints with chemical information that codifies the structural features or functional groups to augment the ADE signals generated from adverse event reports (Shah and Tenenbaum 2012; Vilar et al. 2011).

National Cancer Institute (NCI) database is one of the first amalgamated public efforts in distributing the large data sets according to their bioactivity information in a searchable database format for the cancer and HIV research community (Voigt et al. 2001; Ihlenfeldt et al. 2002; Couzin 2003). ChemBank, PubChem, ZINC, and other public databases have also joined the race in screening the database on the basis of structure similarity and biological activity. Online and open-sources software are useful resources in cheminformatics software development (Girke et al. 2005).

Liu et al. (2012) demonstrated the ability to predict adverse drug reactions (ADRs) by integrating chemical, biological, and phenotypic properties of drugs. They showed that data fusion approaches are promising for large-scale ADR predictions in both preclinical and post-marketing phases (Shah 2012).

4 Genomic Technologies

The completion of human (Homo sapiens) and mouse (Mus musculus) genome sequence projects has increased the number of gene annotations and made it possible for bioinformaticians to develop new approaches that help experimental researchers tackle biological problems (Jin et al. 2004).

Microarray technique also known as chip-based technique was launched in the early 1990s which helped the scientists to monitor the expression of many genes concurrently, and this technology became a powerful and gold standard tool for analyzing, studying, and understanding the expression and regulation of a number of genes in parallel (Tavera-Mendoza et al. 2006). Analyzing multiple genes at the same time revealed detailed genomic and proteomic information which may lay the foundation for identification of novel target or receptor. The outputs from the microarray analysis strengthen the translational research in drug discovery and development method by generating the results from chip-based technology. Microarrays have been used to slice up nuclear receptor functions both in normal and disease states, in tissues, and in cell models. Numerous studies on nuclear receptor gene regulation for identification of downstream signaling pathways have been carried out in an experiment (Tavera-Mendoza et al. 2006). In a similar experiment, activation of PPAR is studied in a high cholesterol context trailed by microarray studies and results in a potential target gene of triglyceride-lowering drugs (Tavera-Mendoza et al. 2006; Frederiksen et al. 2004).

4.1 Next-Generation Sequencing (NGS)

The main application of sequencing technology is to sequence out biological data from an organism, including molecular cloning, gene identification comparative studies, and evolutionary studies. The first-generation sequencing method such as “Sanger sequencing” has been estimated to cost US$2.7 billion for the Human Genome Project (HGP), whereas the identical procedure costs only US$1.5 million with the next-generation sequencing (NGS) method (Morini et al. 2015).

In the past few years, the NGS-based procedure has expanded its growth and application by attracting the attention from the most cutting-edge technologies. Technological advancement and increased automation, in the field of benchtop sequencing and high-throughput sequencing, have also decreased the cost and facilitated the use of sequencing technology by laboratories of all sizes involved in studies ranging from plants to human diseases (Benjamin 2015). NGS refers to those DNA sequencing methods that came after capillary-based Sanger sequencing (first generation) back in 2005. Current next-generation DNA and RNA sequencing companies include Illumina (TruSeq, HiSeq), Life Technologies (Ion Torrent, SOLiD), Complete Genomics (DNA nanoball sequencing), 454 Sequencing (pyrosequencing), and Oxford Nanopore Technologies (GridION) (Carlson 2012).

4.2 NGS and Personalized Medicine

Sudden cardiac death (SCD) is commonly defined as a natural death from unexplained cardiac causes. Young athlete’s community is the most affected group by SCD. The most common factor identified is the adrenergic stress during the competitive sports activity for arrhythmias and SCD in the presence of inherited cardiac disease such as cardiomyopathy, primary arrhythmia syndrome, or vascular diseases. Hence, study and molecular analysis of cardiac channelopathies and cardiomyopathies would allow early diagnosis and prevention of SCD in a significant percentage of young individuals. To gain a fruitful result, one should design an appropriate and well-defined NGS diagnostic protocol and must verify in a validation phase that all the details such as mutation identified in a previous group of individuals by Sanger sequencing method must also be detectable by new advanced sequencing techniques. By contrast, novel variants identified by NGS must also be confirmed by Sanger sequencing to evaluate the reproducibility of the NGS approach (Fig. 2.1) (Morini et al. 2015).

Fig. 2.1
figure 1

NGS protocol for sudden cardiac death conditions

Research published in Nature Medicine reports that NGS sequencing has revealed genomic alterations directly associated with clinically available therapeutics or a relevant clinical trial of a targeted therapy in 72% of 24 non-small cell lung cancer (NSCLC) tumors and in 52.5% of 40 colorectal cancer (CRC) tumors. Two novel gene fusions, KIF5B-RET in NSCLC and C2orf44-ALK in CRC, were among the alterations that might be treated by drugs. The fusion of C2orf44 and ALK produces an overexpression of anaplastic lymphoma kinase (ALK), the target of crizotinib (Xalkori), approved for the treatment of ALK-positive NSCLC, which suggests the possibility that ALK-positive CRC patients may respond to ALK-inhibitor treatment (Fig. 2.2) (Carlson 2012).

Fig. 2.2
figure 2

Test result for genomic alterations (Carlson 2012)

5 Structure-Based Drug Discovery

In recent years, structure-based drug discovery (SBDD) is a rapidly rising methodology in overall drug discovery and development industry. The boom of genomic, proteomic, and related structural data has delivered a number of novel targets and future prospects in lead discovery. In early 1980s the capability of rational drug design with protein structure was an unidentified object to structural biologists. The first success stories of SBDD were published in the early 1990s, and it now becomes an integral and major subject of inquiry in many research and academic organizations (Amy 2003; Roberts et al. 1990; Erickson et al. 1990; Dorsey et al. 1994).

The iterative process of SBDD principally initiates with identification, cloning, purification, and 3D structure determination of the target protein or nucleic acid by any of the following methods: X-ray crystallography , NMR, homology modeling, or various hybrid technologies. Known or calculated active sites are positioned by the computer-based algorithms and targeted by known and unknown 3D chemical compounds, ligands, or drugs identified by specific industry, organization, academic, and research groups from private and public databases. The generated complexes are ranked on the basis of binding energy, pharmacophoric interaction points, and types of interaction such as hydrogen bonding, electrostatic interaction, van der Waals interaction, etc., given in Eq. 2.1. The optimum-generated complexes are then tested with the suitable biochemical assay and knowledge is generated for further evaluation. One with the least micromolar inhibition in in vitro conditions reveals a path to scientists that the compound can be optimized to increase its potency. A repeated cycle of design, synthesis, testing, and evaluation process to a lead compound may produce a patentable market product in binding and specificity to the target (Fig. 2.3) (Amy 2003).

Fig. 2.3
figure 3

Diagrammatic representation of a structure- and ligand-based drug discovery pipeline

Binding energy:

$$ \Delta G=\left({V}^{L- L\ }\mathrm{bound}-{V}^{L- L}\ \mathrm{Unbound}\right)+\left({V}^{P- P}\ \mathrm{bound}-{V}^{P- P}\ \mathrm{Unbound}\right)+\left({V}^{P- L\ }\mathrm{bound}-{V}^{P- L\ }\mathrm{Unbound}+\Delta {S}_{\mathrm{conf}.\kern0.5em .\kern0.5em .\kern0.5em .}\right) $$
(2.1)

where P refers to the protein, L refers to the ligand, V represents the pair-wise evaluations mentioned above, and ΔS conf denotes the loss of conformational entropy upon binding (Ruth et al. 2007).

In comparative docking analysis between known and unknown compounds with respect to a common target, ideally, the generated ligand poses (conformations) that are closest to the experimental or known structure conformation should be ranked highest. In order, the analysis could be achieved by quantifying the similarity between a native ligand and a generated ligand pose, where root-mean-square deviation (RMSD) can be calculated between both the ligand structures (Raschka 2014):

$$ \mathrm{RMSD}\ \left( a, b\right)=\sqrt{\frac{1}{n}\ {\sum}_{i=1}^n{\left({a}_{i x}-{b}_{i x}\right)}^2 + {\left({a}_{i y}-{b}_{i y}\right)}^2 + {\left({a}_{i z}-{b}_{i z}\right)}^{2\ }\ } $$
(2.2)

where a i refers to the atoms of molecule 1 and b i to the atoms of molecule 2. The subscripts x, y, and z denote the x-y-z coordinates for every atom.

5.1 Molecular Docking

The molecular docking is a computational technique to model the interaction between a protein macromolecule known as receptor or target and a small chemical entity/ligand/drug molecule/a protein macromolecule depending on the type of study a scientist carries out. It elucidates the behavior of a ligand molecule with the active site of a receptor protein and its fundamental biochemical process. The docking process involves two basic steps: prediction of ligand conformation within the active site of receptor protein and finally the assessment of binding energies (Meng et al. 2011; McConkey et al. 2002).

Fischer originally proposed a docking mechanism for ligand-receptor binding studies, which is the lock-and-key model, where a ligand fits into a receptor as a key and the receptor behaves as a lock. The primary early docking studies were based on this theory and receptor and ligand were considered as rigid bodies. Koshland put forward an “induced-fit” theory that takes the lock-and-key model a step further and suggests that there is a continuous change in the receptor protein conformation because of the interaction between the ligand and the protein . The theory proposes to treat both ligand and receptor as a flexible entity during docking that could describe the binding events more accurately than under rigid conditions (Fischer 1894; Kuntz et al. 1982; Koshland 1963; Hammes 2002).

Site-specific docking strategies significantly increase the docking efficiency . In many conditions the binding site is unknown. One can predict the putative binding site using commercial software such as SYBYL-X Suite (SYBYL-X-SuiteS: YBYL 8.0), SiteMap – Schrodinger (Halgren 2007), BioPredicta – VLife Molecular Design Suite (MDS) (www.vlifesciences.com), Discovery Studio (Dassault Systèmes BIOVIA 2015), FLEXX (Rarey et al. 1996), Molegro Virtual Docker System (Thomsen and Christensen 2006), ICM-Pro – Molsoft (An et al. 2005), etc. This can also be performed using online servers, e.g., Cast P (Dundas et al. 2006), GRID (Goodford 1985; Kastenholz et al. 2000), POCKET (Levitt and Banaszak 1992), SurfNet (Laskowski 1995; Glaser et al. 2006), PASS (Brady and Jr Stouten 2000), and MMC (Mezei 2003). Docking without any assumption about the binding site is called blind docking.

The main application of molecular docking lies in the structure-based virtual screening for identification of new active compounds for a particular target protein. Molecular docking technique takes a path of translational science and combines the computational output and experimental data in analyzing various biochemical reactions and interactions to study the biological system (Kubinyi 2006; Kroemer 2007; Venhorst et al. 2003; Williams et al. 2003; Meng et al. 2009).

High-throughput screening (HTS) has low rates of success to identify the optimum novel inhibitors of DNA gyrase. Boehm et al. applied de novo design methodology and successfully obtained several new inhibitors (Boehm et al. 2000). Initially, 3D complex structure of DNA gyrase with known inhibitors, ciprofloxacin and novobiocin, was analyzed and patterns of common residual interactions were calculated. Both inhibitors donate one hydrogen bond to Asp 73 and accept one hydrogen bond from a conserved water molecule. In addition, lipophilic fragments are required in the molecule to have lipophilic interaction with the receptor protein. Based on the existing knowledge, LUDI and CATALYST were employed to search and identify similar chemical structure in the Available Chemical Directory (ACD) and Roche Compound Inventory (RIC), resulting in 600 compounds. Close structural analogs of these compounds were considered and 3000 compounds were tested using biased screening. One hundred fifty compounds were selected and clustered into 14 classes of which 7 classes were proved to be the novel and true inhibitor . Succeeding hit optimization was strongly dependent on 3D structures of the binding site and generated a potent DNA gyrase inhibitor (Boehm et al. 2000).

Retinoblastoma (RB), a cancer of the eye, occurs in young children. Researchers have reported their lab findings that fatty acid synthase (FASN) is a promising diagnostic/prognostic and therapeutic target for retinoblastoma. Three inhibitors that target various domains of FASN and are potential anticancer drugs (i.e., cerulenin, triclosan, and orlistat) were considered in the previous studies (Vandhana et al. 2011; Kuhajda et al. 1994; Steven et al. 2004). The experimental data for cerulenin, triclosan, and orlistat gave an IC50 of 3.54 μg/ml, 7.29 μg/ml, and 145.25 μM, respectively, with a dose-dependent decrease in the viability of retinoblastoma cancer cells (Deepa et al. 2010). The crystal structure KS-MAT didomain of human FASN [PDB ID: 3HHD] was also used for docking with cerulenin (Pappenberger et al. 2010) and revealed the binding energy of −5.82 kcal/mol. As there are no data available for enoyl reductase from human FASN in public database, the crystallized structure of ER domain [PDB ID: 2VZ8] was considered as a template for human ER domain. Furthermore, this model was subjected for docking with triclosan and exhibited a binding energy of −5.73 kcal/mol (Deepa et al. 2010). Pemble et al. considered crystallized 3D complex structure of the human TE domain with orlistat (PDB-ID: 2PX6) in his experiment. Based on the crystal structure , data re-docking was performed using auto dock and binding energy was found to be −2.97 kcal/mol. All these findings have indicated the predictive accuracy of the in silico methods adopted (Pemble et al. 2007).

6 Ligand-Based Drug Discovery

The identification of new lead molecule from millions of compound via traditional approach is time consuming and very costly. Since the 1960s, scientists from diverse life science background have put enormous efforts to identify the quantitative parameters that determine the biological activity, in what is known as QSAR /QSPR studies (Nantasenamat et al. 2009). The origin of QSAR was long back in 1863 by Cros in the field of toxicology, where he proposed the relationship between toxicity of primary aliphatic alcohol with their water solubility (Nantasenamat et al. 2009). Crum-Brown and Fraser hypothesized the relationship between chemical constitution and physiological action in 1968 (Crum-Brown and Fraser 1868). A separate discovery was led by Richet (1893), Meyer (1899), and Overton (1901) and showed a linear correlation between lipophilicity (e.g., oil-water partition coefficients) and biological effects (e.g., narcotic effects and toxicity) (Nantasenamat et al. 2009). Hammett (1935, 1937) presented a method to account for substituent effects on reaction mechanisms through the use of an equation which took two parameters into consideration, namely, (i) the substituent constant and (ii) the reaction constant (Nantasenamat et al. 2009; Crum-Brown and Fraser 1868).

Hammett quantified the effect of substituents on any reaction by defining an empirical electronic substituent parameter (σ), which is derived from the acidity constants, Ka’s of substituted benzoic acids (Fig. 2.4) (https://web.viu.ca/krogh/chem331/LFER%20Hammett%202012.pdf).

Fig. 2.4
figure 4

The Hammett equation relates the relative magnitude of the equilibrium constants to a reaction constant ρ and a substituent constant σ Eq. 2.3

$$ \log \left(\frac{KX}{KH}\right)=\rho \sigma\ \mathrm{or}\ pKH- pKX=\rho \sigma $$
(2.3)

For the ionization of benzoic acid in pure water at 25 °C (the reference reaction), the constant ρ is defined as 1.00. Thus, the electronic substituent parameter (σ) is defined as

$$ \sigma = \log \left(\frac{KX}{KH}\right) $$
(2.4)

The reaction constant is a measure of how sensitive a particular reaction is to changes in electronic effects of substituent groups (1–5). The reaction constant depends on the nature of the chemical reaction as well as the reaction conditions (solvent, temperature, etc.). Both the sign and magnitude of the reaction constant are indicative of the extent of charge buildup during the reaction progress. Reactions with ρ > 0 are favored by electron-withdrawing groups (i.e., the stabilization of negative charge). Those with ρ < 0 are favored by electron-donating groups (i.e., the stabilization of positive charge). The greater the magnitude of ρ, the more sensitive the reaction is to electronic substituent effects (Nantasenamat et al. 2009).

In 1956 Taft proposed an approach for separating polar, steric, and resonance effects of substituents in aliphatic compounds (Nantasenamat et al. 2009). In 1964 Hansch and Fujita put forward their linear Hansch equation using the contributions of Hammett and Taft that stood as a mechanistic basis of QSAR /QSPR development. Hansch et al. in late 1960s identified the nonlinear (parabolic) dependence of biological activity with log P and gave the following equation:

$$ \log \left(1/ C\right)= a \log P- b\left( \log {P}^2\right)+ c $$
(2.5)

where 1/C = measure of biological activity, log P = log of octanol-water partition coefficient, and a, b, and c = regression coefficients (Nantasenamat et al. 2009; Corwin and Toshio 1964).

6.1 Quantitative Structural Activity Relationship (QSAR)

The discovery of clinically germane inhibitors is a challenging assignment, and the quantitative structural activity relationship (QSAR) methodology has become a very expedient and principally widespread technique for ligand-based drug design and discovery. More than 1000 2D and 3D molecular descriptors are discovered and identified by the scientific community; a few are listed here such as Individual (Mol.Wt, Volume, H-AcceptorCount, H-DonorCount, RotatableBondCount, XlogP, slogp, smr, polarizabilityAHC, and polarizabilityAHP), Retention Index (chi), Atomic valence connectivity index (chiv), Path Count, Chi Chain, Chiv Chain, Chain Path Count, Cluster, Path Cluster, Kappa, Element Count, Dipole Moment, Electrostatic, Distance Based Topological, Estate Numbers, Estate Contributions, Information Theory Index, Semi Empirical, Hydrophobicity XlogpA, Hydrophobicity XlogpK, Hydrophobicity SlogpA, Hydrophobicity SlogpK, and Polar Surface Area (http://www.vlifesciences.com/support/QSAR_Descriptor_Definations_faqs_Answer.php).

6.1.1 Model Development

QSAR is among the most extensively used computational technique for ligand-based design, and Bohari et al. have recently reviewed the application of a variety of molecular descriptors like quantum chemical, molecular mechanics, conceptual density functional theory (DFT), and molecular docking-based descriptors for predicting biological activity (Bohari et al. 2011). A summary of relevant data analysis method, regression analysis, and model validation process is provided below along with some examples.

6.1.2 Data Analysis Method

Principal components analysis (PCA) and cluster analysis are two widely used methods in 2D and 3D QSAR data analysis. PCA was first invented by Karl Pearson in 1901 and is one of the most popular and primary data reduction techniques. PCA aims at data transformation from large multidimensions to low-dimensional representation, known as data reduction (Pearson 1901; http://www.doc.ic.ac.uk/~dfg/ProbabilisticInference/IDAPILecture15.pdf). Cluster analysis technique is used to partition the data set (with typical molecular properties) into class and categories.

6.1.3 Regression Method

Regression analysis is a statistical process for estimating the relationships among dependent and independent variables by the use of modeling techniques implementing on several variables.

Partial least square (PLS) regression technique is used when the number of descriptors (independent variables) is greater than the number of compounds (data points) and/or there are any factors leading to correlations between variables (Martens and Naes 1989; Höskuldsson 1988; Eriksson et al. 2001).

Multiple linear regression (MLR) is an easily interpretable mathematical expression and primary method to construct QSAR/QSPR models, but it often fails in modeling highly correlated data sets. A few new methods have been developed using MLR such as best multiple linear regression (BMLR), heuristic method (HM), genetic algorithm-based multiple linear regression (GA-MLR), stepwise MLR, factor analysis MLR, and so on. Other methods such as self-learning and machine learning algorithms have also been developed to fit the data into the equations such as neural network (NN), support vector machine (SVM) , and its variants: least square support vector machine (LS-SVM), grid search support vector machine (GS-SVM), potential support vector machine (P-SVM), and genetic algorithms support vector machine (GASVM) (Liu and Long 2009).

6.1.4 2D QSAR (Girgis et al. 2015)

Girgis and his team synthesized a total of 19 dispiro [3H-indole-3,2′-pyrrolidine-3′,3″-piperidines] (Fig. 2.5) of which 11–19 analogs were screened against HeLa (cervical). Compounds 13, 14, and 16 reveal higher potency (IC50 = 4.87, 5.75, and 7.25 μM, respectively) against HeLa (cervical) cell line than the standard reference cisplatin (IC50 = 7.71 μM) (clinically used against cervical carcinoma). See Table 2.2.

Fig. 2.5
figure 5

Synthesized dispiro [3H-indole-3,20-pyrrolidine-30,300-piperidines] derivatives (Girgis et al. 2015)

Table 2.2 Antitumor properties of the synthesized compounds 11–19 (tested against HeLa)

Structure–activity relationships (SAR) based on the experimental antitumor activity against HeLa (cervical carcinoma) reveal that the nature of the substituent attached to the phenyl group at C-4′ and consequently the exocyclic olefinic linkage seem to be a controlling factor governing the antitumor properties. Substitution of this phenyl group by fluorine atom enhances the observed antitumor properties more than two chlorine atoms, as exhibited in pairs 11, 13 (IC50 = 16.69, 4.87 μM, respectively) and 12, 14 (IC50 = 12.71, 5.75 μM, respectively) (Tables 2.3 and 2.4).

Table 2.3 Observed and predicated values of training set compounds 11, 13, 15–17, and 19–44 according to the multi-linear QSAR models
Table 2.4 Observed and predicated values of external test set compounds 12, 14, and 18 according to the multi-linear QSAR models

The basic idea behind QSAR is to generate a relationship between the chemical structure of an organic compound and its physiochemical properties. In the partial pharmacologically active data set mentioned in the present study, external data points were also considered. Spiro-alkaloids with similar scaffold are considered as an external data point and their biological activities were determined, but the same standard technique is earlier followed in the present study.

For the QSAR model development, compounds 11, 13, 15–17, and 19 were considered from Table 2.2 in addition to compounds 20–44 from Table 2.3. Thirty-one derivatives of spiro-alkaloids were used as a training set. The test set (external data set for validation) from synthesized analogs was considered representing high and low potent antitumor active agents 12, 14, and 18 (Table 2.2). Selected compounds geometry is optimized using molecular mechanics force field (MM+), followed by a semiempirical AM1 method implemented in the Hyperchem. A total of 728 two-dimensional molecular descriptors were calculated using CODESSA-Pro software including constitutional, topological, geometrical, charge-related, semiempirical, molecular-type, atomic-type, and bond-type descriptors for the training set (Table 2.3) and test set (Table 2.4) data. Log property (1/log) and biological activity/IC 50 value were considered for all the training and test sets against HeLa (cervical) cell lines of the training set compounds for the present QSAR modeling.

Best multi-linear regression (BMLR) was utilized which performs a stepwise search for the best n-parameter regression equations (where n stands for the number of descriptors used), based on the highest R 2 (squared correlation coefficient), R cv 2OO (squared cross-validation “leave-one-out (LOO)” coefficient), R cv 2MO (squared cross-validation “leave-many-out (LMO)” coefficient), Fisher statistical significance criteria (F) values, and standard deviation (S 2). Statistical characteristics of the QSAR models are presented in Table 2.5.

Table 2.5 Descriptor of the best multi-linear QSAR model for the HeLa (cervical) tumor cell line active agents

Descriptors enlisted in the table are the chief contributors in the model development. Above all Min # HA and # HD molecular-type descriptor explaining the bioactive agent as hydrogen acceptor/donor is important in governing the QSAR model with its t-criterion (9.200) and minimum coefficient with (0.247). The second largest contributing molecular descriptor is FNSA-2 fractional PNSA (PNSA-2/TMSA), which is a charge-related descriptor with t-criterion (5.546) and has the highest coefficient value of 0.596 controlling the QSAR model that is given by

$$ \mathrm{FNSA}2=\frac{\mathrm{PNSA}2}{\mathrm{TMSA}} $$
(2.6)

The third and last molecular descriptor of HeLa QSAR is depicted with t-criterion (4.424), and the second most effective parameter controlling the QSAR model based on its coefficient (0.426) is HASA-2/SQRT(TMSA), which is also a charge-related descriptor. The area-weighted surface charge of hydrogen-bonding acceptor atoms (HASA2) is determined by

$$ \mathrm{HASA}2=\sum_A\frac{q_A\ \sqrt{S_A}}{\sqrt{S_{tot}}}\kern0.5em A\upepsilon\ {X}_{H-\mathrm{acceptor}} $$
(2.7)

6.1.5 QSAR Model Validation

The reliability and statistical validity of QSAR model solely depend on the internal and external validation procedures. In the present QSAR model, the internal validation is assessed by CODESSA-Pro technique employing both leave one out (LOO) and leave many out (LMO). The observed correlations from the internal validation are R cv 2OO = 0.738 and R cv 2MO = 0.776. The squared correlation coefficient of the designed QSAR model is R 2 = 0.815, the standard deviation of the regression is S 2 = 0.008, and the Fischer test value is F = 39.615 that reflects the ratio of the variance explained by the model and the variance due to their errors. The most potent synthesized analog 13, from the training set, exhibited an IC50 of 5.94 μM on the HeLa QSAR model with an experimental value of 4.87 μM and an error of 1.07. The other compounds from the training data set 16, 20–29, 31, 33–35, 38, and 42 relative to cisplatin standard reference clinically used against cervical carcinoma (IC50 = 7.71 μM) showed predicted experimental values with an error range of 0.06–1.12. Compounds 32 and 39 were considered potent analogs against cervical carcinoma (IC50 = 5.55, 5.51) and had predicted values (IC50 = 7.78, 7.84) with a greater error range of 2.23 and 2.33, respectively. Among the mild antitumor active agents against HeLa cell line, compounds 15, 30, 37, 41, 43, and 44 (IC50 range = 8.64–10.89 μM) revealed predicted potency (IC50 range = 6.27–10.77 μM) with a relatively larger error range (0.41–2.47) than the high potent analogs. Among the low potent analogs against HeLa cell lines, compounds 11, 17, 19, 36, and 40 (IC50 range = 11.20–24.36 μM) revealed large deviation in the predicted potency (IC50 range = 6.53–26.07 μM) with an error range of 1.64–5.99 (Table 2.5). From all the above statistical observations, the attained HeLa QSAR model can be considered a good predicative model to produce high potent HeLa antitumor hits compared to those of mild or low potency.

Compounds 12, 14, and 18 were selected for the purpose of validating and examining the predictive ability. The selected test set exhibited experimentally high or low potency against the tested cell line. Table 2.4 reveals the experimental and predicted IC50 values of the test set. Compound 14, considered as high potent against the HeLa cell line relative to the standard reference (cisplatin), had an experimental value of IC50 = 5.75 μM and a predicted value of IC50 = 5.64 μM with a minimum error of 0.11. However, compounds 12 and 18, considered low potent activity against HeLa cell line, had experimental values of IC50 = 12.71 and 10.76 μM and predicted IC50 values of 8.99 and 23.70 μM along with much greater error values of 3.72 and 12.94, respectively.

7 Pharmacokinetic and Pharmacodynamic (PKPD) Simulation (Nielsen and Friberg 2013)

Rowland and Tozer state in 2011 that pharmacokinetic (PK) has been defined as “how the body handles the drug” and pharmacodynamic (PD) has been defined as “how the drug affects the body.” PK and PD are the vital mechanisms of the modern drug development process. Characterization of PKPD effectively suggests that the concentration that leads to desired effects and least side effects, with an appropriate dose regimen, can be computed.

7.1 Pharmacokinetics

Being a central part of clinical pharmacology, PK designates the link between drug dosing and drug concentration-time profile in the body. The determination of drug concentration (C) in plasma and its change from an initial concentration (C 0 ) with respect to time (t) is given by an exponential function:

$$ C(t)={C}_0\ast {e}^{-{k}_e\ast t} $$
(2.8)

Equation 2.8 describes the single PK model with decline in concentration by single distribution phase. Considering the elimination rate for a given system, the change over the time points is directly proportional to the concentration or amount remaining in the system and elimination rate constant (k e ), which is of the first order and has a unit of per time (h−1):

$$ \frac{dc}{dt}=-{k}_e\ast C $$
(2.9)

where k e is the parameter to be estimated based on the data and is inversely related to half-life (t 1/2) of the drug. From Eqs. 2.8 and 2.9, it follows that once k e is known, the drug concentration can be predicted at any time point for a given C 0 .

k e is determined by the apparent volume of distribution (V d ) as well as clearance (CL) that describe the elimination capacity, which is typically governed by liver and kidney function. For a drug with immediate distribution and a CL value independent of concentration, k e can be described as

$$ {k}_e=\frac{CL}{V_d} $$
(2.10)

Often the nature of a drug is more complex because the distribution of the drug inside the body is not immediate due to the effect of its surrounding environment. Hence, the concentration-time course of drug distribution can be better explained by two or more compartments. The differential equations for a two-compartment model can be written as

$$ \frac{dA_c}{dt}=-\frac{CL}{V_c}\ast {A}_c-\frac{Q}{V_c}\ast {A}_c+\frac{Q}{V_p}\ast {A}_p $$
(2.11)
$$ \frac{dA_p}{dt}=-{\frac{Q}{V_p}}^{\ast }{A}_p+\frac{Q}{V_c}\ast {A}_C $$
(2.12)

where A c and A p are the amounts in the central and peripheral compartments and V c and V p are the corresponding volumes of distribution. Q represents intercompartmental clearance. An intravenously administered dose would be given into the central compartment.

The total exposure is often described as the area under the concentration-time curve (AUC). AUC is obtained by integrating the drug concentration-time profile and can also be computed as the systemically available dose over CL. The bioavailability, F, determines the fraction of an extravascular dose that reaches the systemic circulation and is thereby a measure of the extent of absorption. The rate of absorption is often characterized by a first-order rate constant, k a .

7.2 Pharmacodynamics

Pharmacodynamics/PD designates the association among concentration and both the desired and undesirable effects by the given drug. The mathematical function describing the PKPD relationship is a sigmoidal. E max model given by

$$ E(t)={E}_0+\frac{E_{max}\ast C\ {(t)}^{\gamma}}{{E C}_{50}^y+ C\ {(t)}^{\gamma}} $$
(2.13)

where E max is the maximum effect that can be achieved by the drug in the investigated system and EC50 is the drug concentration that results in half of the maximum effect. EC50 is inversely related to the potency. γ is the Hill or sigmoidicity factor that determines the steepness of the relationship but is in many cases not statistically significant from 1.

However, there are often situations where sufficiently high concentrations cannot be achieved to estimate E max, and simplifications can be made to estimate fewer parameters. When C « EC50, the E max model collapses to a linear model (γ = 1) or a power function (γ ≠ 1) with coefficient slope as shown below:

$$ E\ (t)={E}_0+\mathrm{Slope}\ast C{(t)}^{\gamma} $$
(2.14)

The underlying E 0 is not always constant over the study period. For example, the effect variable may vary because of an underlying disease, such as fluctuations in glucose in the event of diabetes or a diurnal rhythm in blood pressure.

8 Conclusion

Translational science in bioinformatics and drug discovery provides a powerful method especially when used as a tool within an armamentarium for discovering new target, drug leads, and novel approach in diagnostic and treatment for the betterment of society. Genomic technologies and NGS methods have proven to be the keystone of advanced research. The identification of genes’ role in disease and disorder makes it possible to design personalized medicine approach, where a single or a few genes can be targeted or may act as a biomarker in the diagnosis and treatment of disease and disorder. Data from public domain chemical libraries selected for appropriate target with structure-based and ligand-based discovery can create a very promising lead which may continue to clinical trials . Simulation study of pharmacokinetic and pharmacodynamic behavior of a chemical compound helps us estimate the concentration and dose value in computed form that can significantly reduce the overconcentration and dosing effects. As bioinformatics develops further, it is expected that genomics, proteomics, drug discovery, and computational power will continuously explode with new advances in therapeutic applications; new targets and leads may be brought to marketplace more rapidly each year.