1 Introduction

Human genome has approximately 568 protein kinases and 156 protein phosphatases that play an important role in indispensable biological processes such as differentiation, proliferation and apoptosis. The activation or deactivation of these protein kinases is achieved in different ways; such as binding with activator or inhibitor proteins; to kinase itself, through autophosphorylation or dimerization induced cis-phosphorylation etc. Under physiological conditions, their expression and activation is precisely regulated inside cells. However, just like pulling the strings of an intricately woven net, deregulation of kinase activity changes the spatio-temporal landscape of gene expressions which further leads to various disease conditions including development of tumors. Now, with the advent of science and instrumentation, the mechanistic defects that cause disease are deeply understood which have created an immense scope for the development of new drugs for their remedy. Enormous sums of money is being spent every year for the realization of these remedies, but endeavour goes down the drain when nine out of ten drugs fail in between the phase 1 trials or regulatory approval thus creating a huge gap in demand and delivery. To fill the gap, computational technology could be explored to rescue the problem. In the past few years Artificial Intelligence (AI) has become a pertinent topic in the field of drug discovery with an aim to reduce time, research expenses, and failure rates in clinical trials. The availability of large datasets for life sciences and rapid evolution of machine learning (ML) algorithms have led many AI based companies to focus on drug discovery [1]. AI has a wide range of applications in medical sectors from hospital to clinical research to cut down cost and to improve the outcome of the patients. Many pharmaceutical companies have been using AI in the development of drugs. Figure 1 shows market share of the top ten companies across the world [141]. The global market of drug discovery is segmented into four regions; North America, Europe, Asia-Pacific countries (APAC), and Rest of the World [142]. North America comprising the US, Canada and Mexico is the fastest growing AI based drug discovery market and the US is the major contributor. Figure 2 shows the percentage wise contribution of four regions in AI aided drug discovery. Application of AI in the field of drug discovery along with molecular dynamics simulations could automate and fleet the drug discovery process.

Fig. 1
figure 1

Countries having biggest pharmaceutical market share in the world

Fig. 2
figure 2

Percentage wise contributions of four regions for AI based drug discovery

1.1 Motivation of the Study

Protein Kinases play a vital role in the cellular activation processes. Phosphorylation of protein kinase is the critical process that regulates different cellular activities including cell cycle, growth, motility, proliferation, differentiation, apoptosis, etc. With the recent advancement in our understanding regarding the fundamental mechanisms related to the cell signalling have shown that deregulation of the kinases activity leads to oncogenesis. Identification and characterization of new diseases and their causative defects have created a huge scope for development of new drugs for therapeutic intervention. Traditional medicinal pipelines are time consuming, costly and alone cannot fill the demand and delivery gap, thus AI methods have come to rescue this problem. Advent of AI in the field of drug discovery and development has exponentially lowered the time and hence cost required to bring a new drug to the market.

1.2 Contribution of the Study

In this review, we have covered the systematic review of recent research trends in the field of drug discovery using AI, that includes the application of AI in the different phases of drug design and development: i) Identification of target, ii) Drug screening for hit identification, iii) Lead identification, iv) Clinical trial, v) Drug repurposing. Article discusses different AI techniques used to extract the patterns for the identification of drug-targets that are difficult for humans to identify through traditional methods alone. Second, its application to virtually screen the targets against millions of compounds significantly reduces the cost and time required in wet labs by reducing the experimentally screenable compounds. Third, generation and assessment of optimized structures by AI models for lead optimization. Fourth, AI application in the clinical trials and drug repurposing that have shown recommendable results is discussed. Different AI based tools available online for predicting 3D structure of protein and ligand binding site prediction are also covered in the review. Lastly, we have also discussed the critical issues and limitations associated with each stage of the drug discovery process along with the future directions associated with them.

2 Drug Discovery Process

The overall process of drug discovery is time consuming, complex and depends on numerous factors. It starts with identification of the biological target i.e. cause of the disease, then the identification of the first chemical compound that shows activity against the given target, this first compound is called a ‘hit’. Hits are found by screening the chemical libraries or isolated naturally from bacteria, plants and fungi [2]. The next step is to isolate the lead compound; it is the compound that shows propitious potential for the development of the drug against the given target. The selected lead is further modified for its enhanced specificity and potency even at lower concentration; this process is known as lead optimization. Then the clinical trial of the drug is done to know its effect. Overall, phases of drug discovery and development are shown in Figure 3.

Fig. 3
figure 3

Overall drug discovery process

3 AI, machine learning and deep learning

AI enables the system to perform tasks that require human intelligence. In the process, information is acquired, rules are developed for using information, approximate or definite conclusions are drawn and then self-correction is done [3]. The main advantage of an AI approach is that they learn from examples and even develop a model if our understanding of the underlying process is limited [4]. AI has various applications in the field of health care which include diagnosis and treatment of many diseases. ML is the subfield of AI that has the ability to automatically improve with experience. ML makes predictions using computational statistics, which can be classified into unsupervised, supervised and reinforcement learning. In unsupervised learning techniques, hidden patterns in the data are extracted and this information is used to form clusters in meaningful ways. Disease target identification by clustering through feature methods can be done by unsupervised ML. In supervised learning techniques the model is trained on input data that have output associated with it, then the model thus developed is used to predict the output for unseen input data. Classification and regression methods are used to develop the predictive model based on labelled data. Supervised Learning algorithms can be used for disease diagnosis, clinical and medical research [5]. The reinforcement learning system learns by interacting with the environment and by using feedback from its experiences and actions. DL is the subset of ML in which the system learns without human intervention from both unstructured and unlabeled data. AI has its applications in various fields; speech recognition [129], health care [124], gaming [125], automobiles [126], social media [128], agriculture [127] etc. Figure 4 shows the application of AI in different sectors. K. Das et al. [120] proposed a model for the treatment of Epilepsy based on Electroencephalogram (ECG) signals. The framework thus proposed consists of feature extraction (current maxima, lower threshold and target point selection), second module consists of pattern matching (segment and domain matching) and in the third module they were able to detect epilepsy seizure from ECG signals. The accuracy and F1 score of the proposed model was reported as 92.66% and 94.86 % respectively. D. D. Chakladar et al. [123] proposed a framework for classifying the cognitive state of a user by using the Filter Bank Cognitive State Pattern (FBCSP) method and Long Short-Term Memory (LSTM) based Deep ensemble. For the experimentation purpose the ECG signals were divided into equal size multiple frequency bands and extraction of features was done by the Common Spatial Pattern (CSP) algorithm. Deep ensemble models thus proposed consisted of multiple LSTM networks connected in parallel. Accuracy of 87% was obtained and was able to estimate the cognitive state in a low computing environment. Das. Chaklandar et al. [122] presented a hybrid model based on bidirectional long short-term memory (BLSTM) and LSTM for the classification of workload during multitasking mental activities of humans. “STEW” Data set consists of two tasks: “no task” and “simultaneous capacity (SIMKAP) based multitasking activity” was used for experimentation. The presented model was able to attain the classification accuracy of 86.33% and 82.57% for “no task” and “simultaneous capacity (SIMKAP)- based multitasking activity” respectively. S. Mukherjee et al. [121] a DL based model for the automatic detection of four classes of diseases in plant leaves. For classification they have adopted GoogleNet to identify disease types and an accuracy of 85.04% has been reported. From the literature it has been reported that AI has a wide range of applications. AI can be employed in the drug discovery and development process to improve the decision making process involving abundant, high-quality data, thus promoting data-driven decision making and reducing the failure rates in drug discovery [4].

Fig. 4
figure 4

Application of AI in different sectors

4 AI in Drug Discovery Process

4.1 Application of AI in Target Disease Gene Prediction

AI has been appreciated in the past few years and has been successfully applied in various stages of drug discovery. The Human Genome Project was completed in 2013 since then numerous updates of draft assemblies have been made available to the academicians and industry for understanding the Human Genome and the quest to understand the genomic factors associated with the disease. Thousand Genome Project and HapMap project have also provided wealth of information regarding the alternate loci and alternate genotypes (i.e. Insertions, Deletions and Substitutions etc.). Screening of the clinical cases and comparisons with the healthy controls led to the identifications of risk alleles (i.e. an alternate genotype at a certain locus with high correlation with the disease phenotype), massive genome wide association studies (GWAS) further identified the causative genotypes from the correlated milieu as a function of tradeoff between stringency and sensitivity addressed in the statistical methods. Interestingly, after the development and clinical validation of the Drugs, variable response amongst patients prompted genotyping which led to identification of certain germline variants and somatic mutations. Structural and biophysical investigations on the target protein and drug molecules further elucidated the atomistic explanation for the variable responses which included identification of nonsynonymous variations causing sensitivity to the therapy or resistance perse. ML methods have been used in the identification of various disease targets. Its classifiers can be trained on vast genetic data usually in excess of gigabytes and Gene Ontology for predicting disease gene association [6]. Decision Tree based classifiers trained on the morbid and druggable dataset, metabolic and transcriptional interactions, protein–protein interaction, tissue expression and sub-cellular localization are network attributes to predict the genes that are associated with morbidity and may be druggable. Based on these applications, researchers have concluded that plasma membrane localization and transcription factors are vital factors for druggability and morbidity. In a recent study, Random Forest (RF) based model outperformed other classification methods of ML such as Naive Bayes (NB), linear and radial SVM (SVM) with an accuracy of 80% for prediction of Autism Spectrum Disorder (ASD) genes [7]. In the 2019 [8] manifold learning-based method was proposed by assuming that the distance between genes and its associated disease is shorter as compared to non-associated disease-gene pairs. The model thus developed is capable of identifying new disease-gene associations when studied for Lung Cancer and Bladder Cancer. A Deep Neural Network-based technique has also been developed for the identification of association between infectious disease and host genes by considering sequence and protein interaction as network features [9]. They found that out of 100 highly infectious disease-genes associated with them, 73 were verified experimentally. A model was developed to track the changes that occur in human muscles due to age [10]. Several ML supervised methods were compared, out of which linear-Kernel SVM and Deep feature selection models were the best model suited for the identification of ageing biomarkers. They concluded that ageing biomarkers could be used for anti-aging therapy. E. Ferrero et al. [11] trained different ML based classifiers such as SVM, RF, gradient boosting machine and neural network to explore disease-gene association. It was reported that neural network based classifiers achieved an accuracy of > 71%. They used the developed model for the prediction of 1431 novel targets. Joen et al. [12] proposed an SVM based model to classify cancer drug targets and non-cancer drug targets for pancreatic cancers (PaCa), ovarian cancers (OvCa) and breast cancers (BrCa). Thirteen biological and network features were identified, out of which relevant were selected using SVM-recursive feature elimination method the key features thus obtained were mRNA expression, gene essentiality, somatic mutation pattern, DNA copy number and protein–protein interaction. 257 antibody targets were identified, out of which 30 affected all the three types, whereas 88, 28 and 53 were specific for PaCa, BrCa and OvCa, resp. They were also able to identify 345 peptide targets. DGLinker [135] is a web server developed for the prediction of candidate genes for human diseases. It is a user-friendly interface that uses the biomedical information from various biological and phenotypic databases and uses ML based techniques for the prediction of new disease-associated genes. From the past few decades ML based algorithms have been applied in bioinformatics for the prediction of disease-gene association [136, 137].

4.1.1 Critical Issues and Future Directions

Different databases from different sources are available for target identification. The major issue is to manage the heterogeneity among these databases. Data recorded in these databases are collected under different experimental conditions, and the format of recording is not similar. To address this problem integrated or curated databases have been created. The curated databases are DisGeNet [109], Therapeutic target database (TTD) [110], STRING [111], LinkedOmics [112], Open-Target platform [113], DepMap portal [114], HMDD [115] and Comparative Toxicogenomics Databases (CTD) etc.[116]. Curated databases have major limitations, such as lack of validation for the target-disease association. Some of the curated databases have a number of publications as supporting evidence but they lack direct correlation with potency of target modification. Another limitation of these curated databases is that they lack target druggability information. Target druggability softwares have been developed including TractaViewer [117] and Drug Tragetor [118] to find out the molecular Ligand abilities and potential safety risks. Usage of curated databases should be increased. Moreover, curated databases lack programmatic accessibility. There is a need to increase program accessibility to accelerate usage of curated databases.

4.2 Application of AI in Drug Screening

To reduce the R&D expenditure in the process of drug discovery, AI techniques can be explored for the identification of target-specific molecules that involve screening of large compound libraries. Selecting drug candidates for particular targets with desirable properties is an important step in the drug discovery process. Physical properties and chemical properties of the compound must be considered as they substantially affect the bioavailability, toxicity and bioactivity of drug molecules. Virtual-ligand based or structure-based design approach is applied on available data for profile selection which substantially reduces the final size of the compound library for in-vitro validation.

4.2.1 AI in Predicting Physical Properties

Physical properties greatly affect the biological properties of the drug by modifying its solubility, stability, protein binding and absorption. Drug’s Physical properties i.e. hydrophobicity, pKa and solubility affects the bioavailability of the drug. The concept of inverse design is introduced that starts with the desired properties of the compound and then the probable molecules are searched. Generative modeling of ML has been employed [13] for the joint probability distribution of both molecular representation and physical properties to retrieve inverse design. Molecular fingerprint, coulomb matrix, potential energy functions, bag of molecular bonds and fragments, density of electrons, atom and bond weighted graph, atomic charge association in 3D and SMILES strings are used by AI based tools for molecular representation.

4.2.2 AI in Predicting Bioactivity

The traditional ML based techniques namely gradient boosting machines (GBMs) [14], deep neural network (DNNs) [15] and RF [16] have been applied to interpolate the transformation done in drug-compounds by retrosynthesis. In recent years matched molecular pair (MMP) analysis, i.e. the impact of bioactivity and molecular properties by introducing a single chemical transformation in a drug compound [17] has been widely used in de novo design. DNN coupled with MMP performs better than GBMs and RF in predicting the bioactivity of the compound [18]. With the availability of large dataset for public domain, ML along with the MMP has been used to predict the bioactivity properties namely absorption, distribution, metabolism, and excretion (ADME) [19], intrinsic clearance [20], oral exposure [21] and mode of action.

4.2.3 AI in Predicting Toxicity

Drug toxicities and its side effects are an important issue in the regulatory clearance of drugs. Traditional in vitro and in vivo tests are performed to scrutinize drug safety. “organ on a chip” an in vitro model [22] has been developed in recent years to reduce the cost, but still this approach is time-consuming and costly. Computational methods have shown considerable dominance in comparison to experimental methods as they are fast, inexpensive, more accurate and can be applied prior to the synthesis of compounds [23]. In recent years various ML based methods include probabilistic neural network [24], SVM [25] and NB [26] have been used to predict chemical carcinogenesis of the compound. DeepTox [27] is a DL based tool for predicting toxicity, the tool first normalizes the chemical representation of the compound, chemical descriptors are computed and they are used as the input to ML methods. The descriptors are classified into two categories: static or dynamic. Static Descriptors include surface areas, atom counts and the absence or presence of a predefined substructure in a compound; different infinite numbers of dynamic features are calculated. The DeepTox Algorithm predicts the toxicity of compounds with good accuracy. S. Jain et al.[138] developed a model using RF, deep neural network, conventional and graph convolutional neural network approach for prediction of toxicity of small molecules using ChemIDplus dataset consisting of > 80,000 compounds having measurements against 59 acute toxicity endpoints. They were able to predict 36 out of 59 points. Some of the tools used for predicting toxicity, chemical synthesis and molecular properties of the compounds are listed in Table 1.

Table 1 Tools for prediction toxicity, chemical synthesis and molecular properties

4.2.4 Critical Issues and Future Directions

The major issue is the quality and quantity of data for screening the drugs. Small amount of data is available and data is dispersed across many literatures and is ambiguous. Curated databases (MoleculeNet [133]) can solve this problem. The transfer learning concept can be explored in this area to have effective results. ML based models are based on intrinsic feature and have low interpretability [134] can be solved by building a data driven feature generation model. Another challenge prediction of ligand based property is the activity cliff. Activity Cliff means chemicals having similar structure but exhibiting different properties. To solve this problem, one needs to have the information beyond the structure of the compound which is quite challenging.

4.3 Application of AI in Lead Optimization

Drug-like molecules for specific targets involve extensive virtual screening of compound libraries. Once drug-candidate is identified then it is further refined or modified to make it more target specific and effective that involves two step processes, first step is retrosynthesis that is to recursively transform the drug molecule into smaller fragment and second step is to find out the organic reaction that will transform the fragment into target compound. Finding the suitable organic reaction is quite cumbersome as it requires scanning of a large number of reactions. AI techniques can be explored to pick the most feasible reaction. Previously the Expert based system was used to solve the problem of prediction of reaction and retrosynthesis but these techniques were not widely used by chemists because these algorithms required human intervention, as the dataset thus used do not have molecular context knowledge. A deep neural network based model [28] has been reported to solve reaction conflict that occurs in the early rule-based system; the model was trained on 3.5 million reactions. They attained an accuracy of 95% (for top10) in retrosynthesis and 97% of accuracy for reaction prediction. AI and Monte Carlo tree search [29] have been combined for the synthesis of organic molecules by retrosynthesis. The performance of MCTS, neural Best First Search (BFS) and heuristic Best First Search (BFS) for 497 different molecules has been studied; 92% of the test set was solved by MTCS, whereas neural BFS and Heuristic BFS solved 71% and 4% respectively. Search problem in retrosynthesis is complex and deep reinforcement learning [30] is used for identification of reactions in each step of the retrosynthesis. Neural networks have been trained to estimate the cost of the molecule on the basis of its molecular structure. AI techniques can be trained on available dataset to predict the probability of the selection of the reaction for the transformation of molecules from one stage to another, linking each transmission with its predecessor and also considering the yield and cost of the transformation. Auto in-silico ligand directing evolution (AILDE) [139] has been developed for lead optimisation. In the developed Framework compound library were constructed, molecular dynamics simulation was used for conformational sampling and fragment growing for ligand modifications. However, an assumption was made that there were no changes in the binding mode. AILDE was not able to perform well in case of an activity cliff. AI can be employed for the prediction of optimal and feasible retrosynthesis routes for drug molecules.

4.3.1 Critical Issues and Future Directions

Although the emergence of AI based methods in retrosynthesis looks promising. There are critical issues associated with these methods. The first issue is that a mostly similarity based approach is used for the prediction of the next step in the reaction based on existing reaction knowledge. The result is based on an empirical approach for the automation of retrosynthesis. As the model is restricted to operate and give suggestions on the basis of the data provided to it. Significant amount of uncertainty is suggested by the model extrapolating outside its training data. The second issue associated with AI based retrosynthesis is lack of high quality data. The major issue associated with the data are; kinetic associated with the reaction and order of catalysts and reagents. Need is to accelerate standardized matrix and quality shirt data set with common benchmark to have favourable results from AI based models.

4.4 Application of AI in Clinical Trial

It has been reported that to bring a single drug to market it takes about 1.5–2.0 billion USD [31]. Clinical trials of drugs take about half of the time of the whole drug development process. Failure of it is not only a waste of time but also money spent in preclinical phases of drug development. The success of the clinical trial depends on various factors including recognition of the disease, identification of target and finding out the effect of the drug molecule in the patient. AI technique’s' ability to automatically identify the pattern from large datasets could be explored to reduce the time required in clinical trials. AI platform (AiCure) has been used on mobile devices to measure medication adherence phase II for patients suffering from schizophrenia [32]. The comparison of AI platform and modified directly observed therapy (mDOT) shows that mean cumulative adherence reported for AI platform and mDOT was 89.7% and 71.9% respectively for patients receiving ABT-126.

4.4.1 Critical Issues and Future Directions

The major issue in application of AI in clinical trials is the necessity of a high volume of labelled data sets for the training of models. Another issue is the need for regulating relevant ethical issues (patient privacy, securing data, retaining confidentiality) for using healthcare data for AI. To effectively explore the numerous steps of clinical trials, data scientists and medical scientists need to work together to have promising results. Data should be collected in such a way that it carries information about correlation of trial design features and trial performance.

4.5 Application of AI in Drug Repurposing

Drug repurposing is the process in which reuse of existing drugs is explored and implemented for new medical therapy. The advantage of drug repurposing is that already approved drugs can omit the phase I of the clinical trial and toxicity testing thus reducing the development time and risk in drug development. A deep neural network (DNN) model [33] has been proposed by using transcriptional data to classify therapeutic categories of different drugs. The data set consisted of 433, 454 and 308 drugs for PC3, MCF7 and A549 cell lines respectively. The proposed model was able to classify drugs based on their toxicity and therapeutic use. Study performed by Li et al. [34] suggested a DL based drug repurposing approach based on chemical structure and transcriptome expression data. They were able to report the repurposing of Pimozide used in the treatment of Tourette’s Disorder for the treatment of non-small cell lung cancer. Zeng et al. [35] proposed deepDR, a DL based approach for drug repositioning by predicting a new drug-disease association. Proposed model consisted of a heterogeneous network: seven networks of drug-drug and one network for drug-side-effect, drug-disease and drug-target. deepDR first constructed the PPMI (positive point wise mutual information) matrices for each network then the features were extracted by Multimodal Deep Autoencoder and finally features were used by collective Variational Autoencoder (cVAE) for the prediction of drug-disease association. The accuracy thus obtained by the deepDR for predicting association between drug and disease was 82.6%. A deep generative adversarial autoencoder (AAE) and variational autoencoder (VAE) were implemented and compared by Kadurin et al. [36] for the identification of molecular properties that had known anticancer properties. AAE and VAE performance was compared by conducting three experiments, in the first experiment they compared models with reconstruction error, and it was found that AAE is better than VAE with reconstruction error of 9.52 as compared to VAE with reconstruction error of 14.60. In the second experiment, VAE and AAE were compared by their ability to generate molecular vectors. VAE performed better than AAE in terms of coverage. In the third experiment two models were compared in terms of feature extraction where both the models performed well. The association between drug-disease can be used for drug repurposing Zhang et al. [37] represent the association between drug-disease as bipartite networks. They proposed a similarity-based inference method (NTSIM) for prediction of unknown association between drug-disease and similarity-based classification method (NTSIM-C) for classification of therapeutic association. Moghadam et al. [38] presented scored mean kernel fusion (SMKF) method to predict drug candidate by considering six features that are drug chemical structure, drug side effect, drug’s receptor phenotype, protein–protein interactions, drug sequence alignment with receptor protein and disease phenotype. The model was developed to know the effect of disease and drug features in predicting drug-disease association.

4.5.1 Critical Issues and Future Directions

Major issue in drug repurposing is intellectual property consideration. Legal issues related to patenting the new medical use for already existing drugs impede the drug repurposing. Electronic Health Recorder (EHR) has been used to overcome the limitation related to data availability for drug repurposing. The need is to advance technology for integration and extraction of heterogeneous large scale data. Options including patent pools, open licensing should be explored for rare and neglected diseases from intellectual property prospective to enhance drug repurposing.

5 AI in Predicting the 3D Structures of Protein and Protein Ligand Binding Site Prediction

5.1 AI in Predicting the 3D Structure of Protein

Proteins are complex macromolecules in our cells which regulate physiology. Knowing the 3D structure from the sequence of the proteins is vital for drug discovery as it helps to determine its function, topology and druggable pockets. It also helps to find the drug molecule that will bind with it. Prediction of protein structure by experimentation is a very complex, time consuming and tedious task. AI-based techniques are implemented in this area to increase the accuracy and efficiency of structure prediction. AlphaFold, an AI based system was developed [39] by combination of three neural networks for protein structure prediction based on distance prediction between the residue pairs, quantification of the candidate structure was done by Global Distance Test (GDT_TS) and then the protein structure was generated. The system was able to predict the structure with high accuracy because it was based on the distance between residues and angle between peptide bonds and the system was made to learn probability distribution to generate the structure of protein. FoldRec [140] is the model for recognition of protein folding to incorporate the interaction among proteins. In the proposed model recognition is done by combination of cluster-to-cluster model and protein similarity network. Some of the tools for 3D structure prediction of proteins are listed in Table 2.

Table 2 Tools for predicting 3D structure of proteins

5.2 AI in Protein–Ligand Binding Site Prediction

Onco or disease marker proteins with aberrant activity require binding with other bio-molecules or ions to form specific interactions to attain specific functions. These bio-molecules or ions are called ligands, specific positions or key amino residues in proteins where the ligand binds are called ligand binding sites (LBSs). The identification of these LBSs helps us to effectively explore the mechanism behind the pathogenesis of diseases, thus helping in the process of drug design and development. With the development of computer technology in recent years, AI algorithms have been used not only in ligand binding site prediction but also for binding affinity prediction. Identification of protein–ligand interactions has an extensive impact in the field of drug discovery as it not only helps to identify the lead hits but also in the process of drug repositioning. With the emergence of large collections of protein–ligand complexes complemented by binding data, as found in PDBbind or BindingMOAD, new opportunities for parameterization and evaluating scoring functions have arisen. With huge data collections available, it becomes feasible to fit scoring functions in a QSAR style, i.e., by defining protein–ligand interaction descriptors and analyzing them with modern ML methods [40]. Some of the ML and DL based LSB prediction and protein–ligand prediction methods are listed in Table 3.

Table 3 Machine learning and deep learning based LSB prediction and protein–ligand prediction methods

Protein ligand binding sites are a class imbalance and dichotomous problem. Many ML algorithms have been implemented to predict the protein ligand binding sites including; Linear regression, Support Vector Machine, Naïve Bayes classifier, RF algorithm and KNN algorithm some of them are demonstrated in Fig. 5.

Fig. 5
figure 5

Diagrammatically representation of various Artificial Intelligence techniques used in the field of Drug discovery and development: a Random Forest b neural network c Convolution neural network d KNN and e Support Vector Machine

Linear regression is simple to implement but its accuracy is poor because of under-fitting. Naïve Bayes classifier is a simple, fast and effective algorithm for classification problems but it requires prior probability and not effective for data that have correlation between samples. Although the KNN algorithm is quick, simple and has less training cost, it performs poorly for class imbalance problems. RF algorithm works on a decision tree that performs poorly for class imbalance. SVM (SVM) has excellent classification accuracy, high generalization ability and exceptional ability to classify high-dimensional small sample data; it has been used recently in the field of LSB prediction and protein–ligand interaction. Some of the published research using SVM is discussed below.

In 2009 Chauhan J S et al. [45] developed the ATPint web server to identify ATP binding residue in the protein. Two SVM based models are developed; the first model is developed using the primary sequence of the proteins and the second model is developed by using PSI-BLAST generated position specific scoring matrix (PSSM). The first model attains the maximum Matthews's Correlation Coefficient (MCC) of 0.33 with accuracy of 66.25% and second model performance is recorded as MCC 0.5, which is better than the first. In 2011, MetaDBSite server based on SVM developed by Jingna Si et al. [51] predicted the protein-DNA binding residues by considering sequence information. MetaDBSite integrates the results of six predictive tools: BindN-rf [82], DNABindR [83], BindN [84], DISIS [85], DBS-PRED [86] and DP-Bind [87]. Input parameters of the SVM model are attained by the result of DNABindR, BindN, DISIS, and BindN-rf, while DBS-PRED and DP-Bind provide auxiliary parameters. The output obtained by MetaDBSite is better than any single prediction model. On the test set, MetaDBSite reported an ACC, Spe, Sen of 0.77 and MCC of 0.32. Ke Chen et al. [52] in 2012 proposed the NsitePred algorithm SVM based model for predicting binding residues for ATP, ADP, AMP, GTP and GDP from protein sequences. First, secondary structure, dihedral angles and relative solvent accessibility are extracted, then the PSSM profile of the given protein sequence is tested, and an eigenvector for the given protein is generated for describing the residue. NsitePred performs better than ATPint [45] and GTPbinder [98]. Yank Zhang et al. [55] in 2013 presented a SVM-based method, combining sequence information-based prediction and template-based methods TM_SITE and S-SITE along with the prediction result of the three methods FINDSITE [99], ConCavity[100], and COFACTOR[101] for training SVM. COACH model performs better than classical prediction algorithms with MCC = 0.54 and Pre = 0.59.

In the early 2000s, DL has surpassed ML in various fields such as speech recognition [102], image recognition [103], text classification [106], image segmentation [104] and semantic modeling [105]. DL solve the complex problem even if the dataset is very large, inter-connected and unstructured that make it well suited to solve the complex problem in the field of drug discovery and medical domain such as image-base diagnosis of diseases, predicting the chemical activity of compound, designing the chemical structure of compound and protein LBS prediction. In the past few years, DL has been used by researchers for protein ligand-binding-sites prediction. Some of the DL based LBS prediction methods are discussed below.

Durrant et al. [49] proposed a scoring function based on a neural network trained on 4141 protein–ligand complexes from the Protein Data Bank. Neural network having one hidden layer.

and five neurodes was trained to find out the influence of the training set size and architecture of the network on the accuracy and robustness of the network output DL based deep belief network (DBN) named as DeepDTIs was proposed by Wen et al. [68] for effective prediction of drug-target interaction. They tested their model with an independent test set and compared them with other algorithms: RF, Bernoulli Naïve Bayesian (BNB) and Decision Tree (DT). The dataset being used was taken from the DrugBank database containing 1412 drugs, 1520 targets and 2,146,240 DTPs. The features being used included—Extended Connectivity Fingerprints (ECFP) and protein sequence composition descriptors (PSC) for drugs and target representation. DBN performed better than BNB and DT. Testing was done on external EDTIs data extracted from the DrugBank database containing1412 drugs, 1520 targets and 2,146,240 DTPs. The performance of the model was measured on parameters including Area under the ROC Curve (AUC), accuracy, sensitivity and specificity 91.58, 85.88, 82.27 and 89.53, respectively. 3D-Convolutional Neural Networks KDEEP [71] predicted binding affinity and compared it with another ML approach. Dataset PDBbind (v.2016) has been used to contain 13,308 protein − ligand complexes and their corresponding experimentally determined binding affinities. The 3D convolution neural network was compared with RF-Score, X-Score, and cyScore. They reported the comparison of Pearson's correlation coefficient (R) of 0.82 and a RMSE of 1.27 in pK units between experimental and predicted affinity. A study done by Ztu et al. [70] considered only sequential information of both target and drug for binding affinity prediction using a DL based model. CNN based model DeepDTA was developed on two Kinase dataset Davis and KIBA dataset. The Concordance Index (CI) was used for the performance measure of the model and it was compared with the Kronecker Regularized Least Squares (KronRLS) based approach and SimBoost based approach. Nested cross validation was used to decide the best parameters for each test set. Lee et al. [72] proposed a Drug Target Interaction (DTI) prediction model based on convolution neural networks by performing convolution on various lengths of amino acids sub sequences. The model was trained which contained data from three databases: DrugBank, KEGG, and IUPHAR; duplicates from the dataset were removed. Dataset contained 11,950 compounds, 3,675 proteins and 32,568 DTIs. Negative DTI dataset was inevitably generated randomly. Biasness from randomly generated negative DTIs dataset was reduced by building ten sets from a positive dataset. Validation dataset was created from the MATADOR dataset, and all DTIs observed in the training dataset were excluded. Evaluation of the model was done on two independent test datasets from the PubChem BioAssay database and ChEMBL KinaseSARfari. Hyperparameters such as learning rate and window sizes are tuned during cross validation to increase the performance of the model. They first determined the learning rate of the model, then the selection of the activation function and regularization parameters were set. Grid-search method was employed for optimization of other hyperparameters for neural networks. DeepCSeqSite proposed by Cui et al. [75], in which several convolutional layers were stacked on each other to extract hierarchical features. Convolutional kernels combined extracted features and for prediction softmax was used. In 2020, Zhang, H. et al. proposed DeepBindPoc based on DL; the model was developed by incorporating the information of the binding pocket and associated ligand. The model contains densely connected 16 layers outputting 100 units. The ReLU activation function is used for the hidden layer and output layer employed by the sigmoid activation function. One of the advantages of DNN is that it learns more high-level and abstract features of very complex data. BiteNet [80] DL based model identifies binding sites by spatiotemporal features identification. It represents the structure of the protein as a 3D image with a channel corresponding to atomic densities. It has been observed that it takes about 0.1 s for the analysis of single conformation and about 1.5 min for analyzing MD trajectory of 1000 frames, each frame containing about 2000 atoms. SSnet [130] a deep neural network based model was developed for prediction of protein ligand interaction by utilising secondary structure information and torsion of the protein backbone. It was observed that SSnet is not biased towards any specific conformation and was able to extract information for protein ligand interaction prediction. DEELIG [131] a DL based model uses the spatial relationship among data for prediction of affinity and binding of proteins. Multi-PLI [132] a DL based model was developed to overcome generalizability and heterogeneous data issues that occur in structure based models. Classification was used to find out whether it was binding or not binding. Regression was used for finding binding affinity. It was found out that the model was able to predict amino acids that were essential for the binding of ligands.

5.3 Critical Issues and Future Directions

Tools have been developed for protein structure prediction but still there are many issues. Energy functions needed for prediction are approximated for computational efficiency as a result of which accurately balancing non-polar and polar interaction at the interface is a challenging task. However, only modelling subset with ordered water molecules can be done but it is a computationally costly process. The need is to develop both robust techniques for prediction of protein structure and conformation. Analysis of the energy landscape along with molecular dynamics trajectory can be exploded to capture a flexible dynamics system. The binding site prediction depends on structure information of protein. As the protein structure database will grow in future it will open up the opportunities for the improvement of binding site prediction and functions.

6 Open Research Challenges and Opportunities in Drug Discovery

The major challenge faced by pharmaceutical companies to develop a new drug is its cost and time required. AI technologies have been successfully applied in various fields: Natural Language Processing, Signal Processing, computer vision, agriculture Sector etc. and has the potential to reduce the time and cost required developing a new drug. Many researchers have shown that the future of drug discovery is very promising as covered in our review. Still, application of AI in the field of drug discovery is very challenging. Drug discovery is a very complex process and it requires knowledge of various fields (chemistry, biology and medicine). Second, reliability and safety are the major issues in the decision-making process of the discovery as it directly affects public health. High quality data is the main concern. Data labelling in drug discovery is very complicated. Moreover, data available is very less as compared to the large amount of information available in records, as open data sharing is not common in pharmaceutical companies. The data that is available is not in uniform format. The solution of this problem is to start an initiative to share data for the betterment of Public Health. To deal with heterogeneity of the data a “one-short learning” algorithm has been developed by Stanford University [119]. Fourth, Lead optimization is a challenging phase to develop effective drugs with desired properties and sometimes these parameters are incompatible and independent. This makes the process very complicated. Optimizing each parameter individually and improving our model this problem can be solved. Another major challenge faced by companies using AI in drug discovery is that they have to undergo a rigorous process to have copyright for their work as most of the countries do not give patents to these inventions.

7 Discussion and Conclusion

Drug discovery and development is a complex process and typically costs billions of USD and takes about 10–12 years to bring a drug to the market. To address this problem different AI based techniques have been explored. An example of this is drug screening, there are millions of drug-like compounds at online databases and laboratory screening of each of these compounds traditionally costs 60–100 USD and takes several months to screen a significant batch, yet, it remains unfeasible to screen all available compounds even through high throughput robotics. With the advent of AI in the drug screening process, billions of compounds can be screened in a few days. Many AI based tools and methods have been developed to facilitate the different phases of drug development process from 3D structure prediction, target disease gene prediction, protein–ligand binding site prediction, drug screening, predicting physical property, toxicity and bioactivity, lead optimization, clinical trial to drug repurposing. In the past few years, AI computational methods have shown a great impact in the field of drug discovery that lead many pharmaceutical companies to invest in AI-based R&D programs and to have collaboration with AI start-ups and academic institutions. Takeda Pharmaceuticals Company and MIT’s School of Engineering have collaborated to work together to start a drive to explore the application of AI in the field of healthcare and drug discovery. However, there are still some challenges to overcome. Just like organic chemists have over the years adopted a universal nomenclature of chemical compounds, the universality of health care record format and curation of metadata of patients is still to be achieved. Enormous expanses of experimental screening information available at the research journal archives do not follow a common format and thus often requires manual reformatting before funnelling into an AI algorithm that in itself limits the efficient use of AI in this field. Even after successful identification of the genes implicated in disease development and identification of structural details of its protein product, its druggable pocket and key target amino acid residues identification through a bunch of AI algorithms, yet there is no integrated workflow which address this process end to end and this necessitates development of such a platform. Proteins are drug targets and are dynamic in nature, the excessive reliance on their experimentally available structures alone for AI based drug discovery can bias the results, thus conformational ensembles generated through the molecular dynamics simulations could also be included in such a procedure to boost identification of novel compound scaffolds for intellectual property. By and large, we feel safe to say that a strong foothold of AI is already into the drug discovery and development and due to realization of its strength; the associated issues will be addressed.