1 Introduction

Proteins are essential biological molecules whose interactions are at the pivot of many biological systems, including the immune and nervous systems. The interactions can happen between proteins of any size, i.e., any protein oligomer is a potential candidate for interaction. Their interactions are very specific, and hence any mutation at interaction interfaces are more threatening than at the non-interface regions [1]. Interfacial mutations are reported to trigger diseases like cancer [2]. Aggregation of proteins causes diseases like Alzheimer’s disease and Parkinson’s disease; a cure is not yet found for these diseases. This points to the importance of interaction studies on proteins. Structural biology gains popularity in studying protein interactions as their function and structure are highly interrelated. Along with the advent of experimental techniques like X-ray crystallography, NMR (Nuclear Magnetic Resonance) spectroscopy, and cryo-EM (Electron Microscopy) have come radical changes in interaction studies. X-ray crystallography proved simple and cheap and can generate high-resolution structures, while NMR can accommodate interaction dynamics. On the negative side, the former demands crystallizable structures and hence cannot account for the possible conformational changes, and the later procedure describes the structures in terms of observed distances which makes it difficult to use in large proteins. The prediction of protein structure greater than  50 kDa is nearly impossible using NMR. With the introduction of cryo-EM, better quality structures could be obtained even for large molecules. Both hardware and software improvements contribute to its better performance. It utilizes the enhanced computational power provided by GPU and improvements in image processing techniques accompanied by a better electron microscope and detector to deliver better structures. Yip et al. recently reported that using cryo-EM, they could obtain a high-quality structure of 1.25Å resolution for apoferritin [3]. This technique is appropriate for samples of high molecular weight only. Apart from the aforementioned limitations, the experimental techniques are laborious and expensive. There is a considerable gap between the number of individual protein structures and the protein complex structures available in PDB databank. Thus computational techniques become a viable alternative in structure prediction tasks.

Computational techniques for protein complex structure prediction are broadly classified as template-based methods and template-free methods. In template-based methods, the onus of generating accurate structures is fulfilled only with the available templates. A substantive improvement in the quality of predictions is visible by the application of refinement techniques. On the other hand, a template-free method starts the prediction procedure from scratch, with no reference structure available. Protein–protein docking is one of its kind that predicts the near-native orientations of proteins. Input to a docking algorithm can be any oligomeric protein. Largest among the input proteins is designated as the receptor and the smallest as the ligand for computational benefits. Predictions by these techniques are utilized to delineate interactions’ characteristics, the role of mutation in interaction patterns, and affinity predictions; this information can aid in drug discovery.

Another exciting classification of docking technique is based on the flexibility of interacting proteins. Rigid-body techniques assume that, upon complex formation, the individual proteins preserve their internal geometry, which is not necessarily the case. Changes can happen to side-chain atoms and/or at backbone positions. Methods like Molecular Dynamics (MD), Monte Carlo (MC) methods, Normal Mode Analysis (NMA) help us to account for such conformational changes. A recent review on flexible docking methods [4] observes that technological advancements have driven the field much forward, but the scope for improvement remains.

We know that 20 amino acids can be connected in different combinations and permutations through peptide bonds to generate different proteins. In 1963, Ramachandran et al. [5] inspected the internals of the structures and proposed the Ramachandran plot, which delineates the range of torsional angles in a near-native protein structure. It suggests the probability contour of torsion angles (shown in green color in Fig. 1) based on reference to already known structures. In Fig. 1, left side shows the plot corresponding to a plausible structure where other than few outliers, all the angles are within the contour. In contrast, the angles are seen dispersed in the allowed and prohibited regions in the figure on the right side, showing its distortion. A Ramachandran plot adhering to the torsional angle criterion is necessary but not sufficient to conclude on the quality of the structure under consideration. The plot cannot identify the clashes between residues far apart in the sequence but too closer in structure, as these residues may satisfy angle criteria. Other conditions like negative potential should be added to identify plausible near-native structures.

Fig. 1
figure 1

Ramachandran plot of a plausible structure and distorted structure of a sample protein

2 Protein Representation Schemes

A string of amino acids is designated as polypeptide or protein, depending on its length. The largest protein in the human body, titin, contains  27,000 to  33,000 amino acids, each containing at least four atoms. Computations involving such gigantic proteins in all-atom representation demands more execution time and high-end resources. In such situations, the adoption of a good representation technique may help. This drives the experiments in reduced representation schemes. Different computational techniques assume different levels of granularity—from all-atom to residue level—in representing input proteins. Recent studies show that a coarse-grained approach improves the performance in terms of execution time while maintaining the accuracy [6].

Zacharias [7] presented a coarse-grained model where at least two pseudo atoms represent all but Glycine—one pseudo atom represents the C\(\alpha\) atom, and one or two pseudo atoms represent side-chain heavy atoms depending on the size of amino acid. Results show that the minimum energy structure obtained by docking proteins in reduced representation resembles with experimental structure. This scheme is effectively used in many docking techniques [8,9,10].

Koliński proposed [11] a reduced protein representation, \(\hbox {C}_{\alpha }\)\(\hbox {C}_{\beta }\) Side group (CABS) model, where four atoms represent a residue viz. \(\hbox {C}_{\alpha }\) atom, \(\hbox {C}_{\beta }\) atom, and two pseudo atoms : one representing the center of two \(\hbox {C}_{\alpha }\) atoms, and the other for the center of mass of side-chain atoms. This lattice-based model redefines the force-field parameters based on the extensive statistical analysis of available protein structures. This approach is reported to be widely used in applications related to protein folding, structure prediction tasks [12, 13], protein interactions[14], and protein docking [15].

UNited RESidue (UNRES) model proposed by Khalili et al. [16, 17] reduce the number of representative atoms to two—\(\hbox {C}_{\alpha }\) atom that represents the backbone atom, and center of mass of side-chain atoms—with support for the extended simulation period. The coarse-grained UNRES MODEL is utilized in UNRES-DOCK [18], which is designed for protein–protein and protein-peptide docking.

AWSEM (Associated memory, Water mediated, Structure and Energy Model) [19] uses three atoms to represent a residue: \(\hbox {C}_{\alpha }\), \(\hbox {C}_{\beta }\), and O. The model accommodates both direct and water-mediated interactions. It finds application in the prediction of dimerization interfaces of protein–protein complexes [20]. An extensive study on coarse-grained representation can be found in [21].

With the advancements in problem-solving strategies, there is a paradigm shift in docking methods. Blind docking techniques that demand no a priori information, other than interacting protein structures, usher integrative modeling; this becomes the new order of the day [22]. Recently, deep learning techniques are also gaining ground. This work aims to throw light into the challenges and opportunities in docking techniques, particularly rigid-body docking, how existing methods handle them, and the scope of developing good-sounding concepts.

3 Protein–Protein Docking: A General Pipeline

This section gives an overview of the docking procedure, which may be conducive to understanding its intricacies. The onerous task of protein–protein docking, in general, involves two steps : pose generation and scoring. A schematic diagram of the same is given in Fig. 2. A preprocessing step that precedes the algorithm execution may involve cleaning the bound structure of input proteins, adding missing information on atoms or residues, which is mostly done by MODELLER [23], extracting the surface information, and finding an appropriate representation scheme for input proteins.

Fig. 2
figure 2

A general pipeline in protein–protein docking. Here sampling and scoring phases are shown separately for the ease of representation. Usually both stages happen in concert

3.1 Pose Generation

In molecular docking, the pose generation phase is very critical as the exclusion of near-native poses in this step will profoundly affect the efficiency of the method. The basic steps involved in this stage are translation and rotation of interacting proteins, which can be done using an exhaustive or stochastic method; both approaches have advantages and disadvantages. Exhaustive searching, also called systematic searching, brings down the propensity of an algorithm to omit near-native structures by considering all the possible transformations of ligand and receptor at a given interval. It drastically increases the time and space for execution from a different perspective, thus the demand for improved sampling methods. The method is altered so that search confines to regions that can probably lead to near-native structures. The introduction of constraints like shape complementarity and electrostatic complementarity helped in finding potential binding sites. Stochastic methods, on their capacity, can also lead to the accurate prediction of near-native structures, coercing the scoring stage to be more accurate. This technique is more suitable with metaheuristic algorithms that begin execution using a random set of solutions.

3.2 Scoring

This phase involves assigning a score to different poses based on various criteria. The effectiveness of scoring functions may depend on the characteristics of proteins involved in interactions. Scoring functions are classified as force-field-based, knowledge-based, empirical, consensus, and machine learning-based, depending on the parameters used for score calculation. Force-field based scoring functions [24,25,26] take advantage of non-bonded terms (van der Waals potential and electrostatic potential), combined with bonded terms like bond angle, bond length, and dihedral angle. In the case of rigid-body docking, only the non-bonded terms are applicable. Flexible docking gives more refined results by considering atoms’ vibration, and such algorithms use bonded terms, too, for more accurate calculations. Empirical scoring functions [27] utilize intermolecular interactions and changes in the accessible surface area of the proteins to calculate the score. The total score may be attributed to factors like hydrogen bonds, hydrophobicity, and hydrophilicity of residues. Knowledge-based scoring functions [28, 29] use existing knowledge on protein interactions to derive statistical potential mean force. Consensus-based scoring [30] operates on a combination of parameters, considering the different aspects of interactions. With the application of AI techniques, machine learning-based scoring functions [31] are developed.

The best structures chosen after the scoring phase may be subjected to refinement where the side chain or even backbone atoms undergo conformational changes to get the minimum energy structures. A mindful selection of strategies is advised in designing the tools as both steps play a seminal role in accurately predicting structures.

4 Algorithmic Approaches in Protein–Protein Docking

The incalculable computational complexity of docking techniques, owing to the innumerable possibilities it needs to consider, can be tackled by skillfully utilizing the key provisions in different algorithmic approaches. From the introduction of Fast Fourier Transform (FFT) to exhaustive searching techniques, the field was evolving with many dramatic improvements in solution strategies. Different techniques tested with the geometric, energetic, and topological aspects of protein interactions are discussed in this section.

4.1 Computational Geometric Algorithms

Among the many successful methods in docking, geometry-based techniques are one of the prominent because of the shape complementarity exhibited at interface regions of proteins. However, the constraint is not as strict as in protein-ligand docking, where the lock-key strategy is applied. The dominance of shape complementarity largely depends on the nature of interacting proteins. A near-exact complementarity is expected for rigid-body targets, while a relaxed strategy works for antigen-antibody targets [32].

Delaunay tesselation [33], alpha shape [34], convex hull [35, 36], and geometric hashing are some of the geometric concepts that are widely endorsed in the surface generation, interface identification, and pose generation. A general pipeline of geometry-based methods is shown in Fig. 3. The procedure begins with the extraction of surface information of the interacting proteins. The usual practice is to roll a sphere of specific radius over the protein and render the path traced by its surface [37]. The generated surface is fragmented into convex, concave, and flat regions based on the curvature values of the points, and the alignment of these curved regions ensues pose generation. An indispensable entity in protein alignment is a shape descriptor [38, 39] that satisfies the following criteria [40].

Fig. 3
figure 3

Geometry based docking. In this figure, first column shows the ribbon structure of input proteins. In the second column, different surface patches are shown in different colors. The solid surface in the 3rd column shows the receptor protein. Some of the possible positions of ligand around the receptor are shown around it as ribbon structures

Translation and Rotation Invariance: The input proteins, perhaps, have different orientations and may occupy distant coordinate space. The alignment of their complementary regions must not be affected by these factors. The descriptors must be invariant on rotation and translation of the proteins to align them with its counterpart.

Statistical Independence: Descriptors of different points must be statistically independent. It indicates the need for a compact representation.

Reliable and Fast: There is a trade-off between correctness and speed of execution. The descriptor must be accurate, and its calculation must not induce a performance bottleneck.

The critical points of alignment may be the center of the surface patch [41] or a point equidistant from the majority of the points in a patch [42]. The curvature value of a point can also make it eligible for a critical point. The alignment of surface patches is pivoted on these critical points. Multipatch alignment can be adopted instead of single or two-patch assembly to ensure that the proteins are arranged based on shape complementarity.

Though not properly addressed, topological complementarity was considered in the pioneering work in automated protein–protein interaction prediction. Wodak et al. [43] applied exhaustive searching, which uses a coarse-grained representation of residues to identify the interacting residue in combination with the free energy of dissociation. The work laid the foundation for further studies in interaction prediction.

A breakthrough in exhaustive searching methods was in 1992 when Katchalski-Katzir et al. [44] introduced FFT to speed up the searching strategy. They formulated a correlation function that can be applied to the discrete molecular representation. The method solely based on geometric complementarity was successful for rigid structures. This revolutionary technique inspired many to modify the same to improve the performance [45,46,47,48].

The majority of FFT-based techniques work in 3D transformation space. It requires a new grid for each rotated position, demands separate FFT calculation for each term in the correlation function, and has difficulties in accomodating restraints. Padhorny et al. [49] attempted to apply FFT in 5D rotation space to circumvent these limitations. Compromising energy calculations and offloading the liability of accurate prediction to the clustering stage, the proposed representation in SO(3) contributes largely to fast execution. Cluster centers, chosen as predictions, are selected based on cluster size, balancing the inaccuracy in energy calculation.

Geometric hashing is a computer vision technique that identifies a match between two regions under consideration. It assigns each point an invariant representation with respect to a selected reference frame and compares them for similarity. An agreement in the representation of data points indicates a match between the regions. This technique is successfully implemented in docking algorithms to find the complementary regions on proteins during pose generation [50, 51]. The problems introduced by the non-uniform distribution of data, in the case of proteins, during hashing can be discounted by the application of rehashing techniques [52] or self-organizing maps [53]. Venkatraman et al. [54] proposed an alternative hashing strategy for the same problem, which uses kd-tree to search the matching points; steps to calculate the invariant representation is the same. Recently, Christoffer et al. [55] released the LZerD webserver capable of doing pairwise and multiple protein docking. Estrin et al. [56] proposed the SnapDock algorithm that utilizes hashing in template-based docking. The method calculates hash values of templates in Dockground [57] and PIFACE [58] dataset. Similar processing is applied to the query sequence to find the matching interface region. The identified regions are then aligned to generate the actual prediction.

Another geometric technique adopted in docking is triangulation. When the interaction between atoms is hindered by intermediate atoms, calculating distance-based potential will cause unjustifiable computational costs. With this assumption, Jafari et al. [59] proposed a Delaunay triangulation-based scoring function. It calculates the score only for the triangulated interprotein atoms that are within a threshold distance. The method is proved to have superior performance in calculating the potential but underperforming in structure prediction, which may be rectified by proper tuning.

4.2 Population Based Metaheuristic Algorithms

Among the many factors determining the interaction between proteins, van der Waals force, electrostatic force, and desolvation potential are prominent. Upon complex formation, these forces of attraction or repulsion may induce conformational changes in proteins due to their fundamental nature to exist in minimum stable energy. Thus structure prediction can be modeled as an optimization problem where the system tries to minimize the energy of generated structures. Due to the hardness of the docking problem, a population-based metaheuristic algorithm may be necessary to find a suitable solution. These algorithms, in each iteration, try to improvise the solutions by applying different operators.

Evolutionary algorithms are metaheuristic algorithms that mimic natural evolution. These algorithms balance exploration and exploitation of solution space by applying reproduction, mutation, recombination, and selection operations. The foundation stone for evolutionary algorithms was laid in the 1970s when John Holland [60] proposed Genetic Algorithm (GA). The algorithm begins execution by generating a random set of chromosomes, constituting an initial population. Then, in the course of execution, the algorithm tries to improve the population in each generation by applying cross-over and mutation operations. Finally, to curb the population explosion, only the fittest chromosomes are allowed to survive. In 2001, Gardiner et al. [61] proposed a solution to protein–protein docking based on this algorithm with each chromosome representing the degrees of freedom of ligand in six dimensions. They employed Niche restriction [62] which chooses a new solution based on its difference with existing ones so that searching is not confined to a local region. Geometric complementarity of the surfaces used to find the fitness of a pose is not implemented as a strict constraint due to the structural characteristics of proteins. Sunny et al. [63] proposed FPDock that uses Flower Pollination Algorithm (FPA) [64]. Pollens, the agents in the algorithm, are quaternions representing the transformation parameters. Pollens try to improve their score in each iteration by combining with the best or some random pollens. These iterations are followed by the purging of unhealthy pollens to avoid population explosion and maintain the population’s quality. The pollen score depends on its electrostatic and van der Waals potential, based on which the best pollens are identified. These best pollens may be toppled by better new generation pollen. The algorithm stops execution after a fixed number of iterations as no prior information on the stable potential of the target is available. Choosing multimodal output is satisfied by splitting the initial population into different islands where the earlier steps are applied. Though this split is similar to the island variant of FPA, the migration step is excluded for getting diverse solutions. All the evolutionary algorithms have to deal with a large population, and score calculation can be a bottleneck in execution. The introduction of parallel algorithms can effectively deal with the problem, supported mainly by the island implementation.

Swarm optimization algorithms take the collective behavior of agents in the swarms to find the optimal solution. Figure 4 shows such a system where ligand swarms are formed around the receptor. In Particle Swarm Optimization (PSO) [65], each particle is associated with a position and velocity. These values get updated depending on the best position accomplished by any particle in the swarm, i.e., individual best and swarm best determine particles’ current position and velocity. A variant, PSO2 [66], uses PSO for a local search around selected particles in basic PSO. PSO algorithm along with normal mode analysis is at the pivot of SwarmDock algorithm [67]. Particles, which represent the ligand transformations, are spread around the receptor as in Gaussian distribution. Their fate is determined by distance-dependent potential, which is the sum of electrostatic and van der Waals potential. Best positions on receptors are identified through intensification and diversification of solutions. JabberDock [68] uses PSO “kick reseed”, proposed in POWer environment [69], to explore the potential energy surface of the proteins. Unlike PSO, “kick reseed” randomly reinitializes the particle positions, thereby improving the convergence while saving the particles from being trapped in local minima. The docking algorithm starts with an MD simulation followed by the application of the PSO algorithm for sampling. The generated structures are checked for atomic clashes, and those having only negative van der Waals potential are hand-picked for shape complementarity checking. The proposed STID maps are proved to be suitable in devising appropriate scoring functions, and it leads to improved performance of the tool. The tool is found to be efficient for transmembrane proteins, too [70].

Fig. 4
figure 4

Population-based docking. From the input proteins, surface information is extracted, shown in column 2. Ligand swarms, shown as ribbon like structures, formed around the receptor are shown as different clusters. The solid surface in the 3rd column shows the receptor protein

A notable work in docking was proposed by Jiménez-García et al. [71] which uses Glowworm Swarm Optimization (GSO). LightDock initializes the swarm centers around the receptor where the swarms get flourished. Each swarm has a swarm center with the highest luciferin value, to which glowworms in its vicinity get attracted. This multimodal optimization algorithm that returns swarm centers accommodates multiple scoring functions. It allows users to design their scoring function or use the default nine scoring functions in any combination. The disadvantage of getting trapped in local minima is avoided here by the multimodal nature of the GSO algorithm. Anisotropic Network Model (ANM) [72, 73] in the framework supports partial flexibility of backbone atoms. A problem that can arise in a 3D system that describes rotation in terms of Euler angles is Gimbal lock problem. This occurs after several rounds of rotations when any two axes of rotation become parallel to each other, and consequently the system loses a degree of freedom in space. Quaternion offers a solution to this problem, and hence it is adopted in this technique. It is observed that the proposed method works best for flexible complexes than for rigid complexes. The blind docking model of LightDock is later revised to a data-driven model that can make predictions based on information about binding sites [74]. This addition made a drastic increase in the success rate of the tool. When compounded with HADDOCK refinement [6], LightDock could work well for membrane-associated complexes, utilizing the topological information of membranes [75]. Faster execution time in LightDock is attributed to the parallel execution supported by the swarms in the system.

4.3 Monte Carlo Algorithms

Monte Carlo (MC) algorithms, indeed, are a class of algorithms that predicts the probability values based on historical data. They have an inherent element of randomness in their execution, with a slim chance of returning an incorrect solution. Nevertheless, these algorithms offer solutions when many fail to do so, making it favorable to find solutions to hard problems. For instance, problems involving numerous degrees of freedom of particles can be solved using MC algorithms. It offers non-deterministic solution to a stochastic problem. The main steps involved in solution determination are defining a solution space, choosing random inputs from a probability distribution and its evaluation, and getting the compounded result. The docking problem has a similar interpretation where solutions are identified from the conformation space based on its probability of being a near-native structure.

RosettaDock [76, 77] uses one such method where MC search joins hands with optimization of atom coordinates to get proper docking results. Here, the rigid-body perturbations, rotamer packing, and minimization of rigid-body displacement followed by the scoring phase precisely determine the quality of predictions. Rosetta has a two-stage sampling procedure—a coarse-grained rigid-body sampling and a fine-grained, flexible sampling where side-chain conformations are acceptable. The unavoidable computational cost incurred by the implementation of MC in high-resolution sampling can be alleviated by parallel tempering or replica exchange tempering [78, 79]. Parallel tempering, proposed to improve physicochemical simulations, starts by randomly initializing N copies of the system at different temperatures; the replicas are allowed to exchange their configurations. A replica can sample over a large space at high temperatures, while local sampling is possible at low temperatures. An ensemble-based method involving frequent temperature exchange between hot and cold samples can avoid a possible convergence at local minima. A deliberate choice of temperature and number of replicas can help in handling computational cost. Another alternative is the use of Hamiltonian exchange among ensembles [80], where replicas show differences in interactions. The allowable degrees of freedom cause a bottleneck in its working as the algorithm needs an overlap in energy levels of ensembles. A well-tempered ensemble (WTE), which holds energy as the collective variable, offers a solution to this problem. Zhang et al. [81] compared the standard MC, Temperature Replica Exchange Monte Carlo (REMC), well-tempered ensemble temperature Replica Exchange Monte Carlo (WTE-REMC), and well-tempered ensemble two-dimensional Hamiltonian Replica Exchange Monte Carlo (WTE-H-REMC) to find the best method that can be added with RosettaDock. The authors observe that the concocted WTE-H-REMC outperformed the other methods.

Siebenmorgen et al. [82] presented repulsive scaling replica exchange MD (RS-REMD) simulation for complex structure prediction. It assigns an increased van der Waals radius and reduced van der Waals attraction, reducing the effect of Lennard Jones (LJ) potential, which is a functional form used to describe van der Waals interactions, and other potential parameters. Compounding this with repulsive bias helps the system to escape from suboptimal binding sites. Existing methods for MD simulation require the starting point to be near the actual binding site. RS-REMD can assure the convergence to a near-native solution with no such constraints.

4.4 Graph Based Algorithm

For problems involving dependencies between entities, a graphical representation of the data perhaps paves the way for better and easier solution strategies. In the basic definition, a graph \(G=(V,E)\) consists of a set of vertices V connected by edges in set E. A one-to-one correspondence between two graphs is called graph isomorphism, where the edges between nodes are maintained. In an alternative graph definition, a graph can be expressed as a mapping \(f:X->Y\). i.e., \(G=(x,f(x) \vert x \in X)\). These definitions are applicable in a protein, where there is a relation between atoms connected through bonds. Spatial relations between atoms can be treated as a connection. A residue level graph also holds information about interactions. Apart from the proximity data, it conveys information about physicochemical characteristics as attributes of nodes and edges. Vishveshwara et al. [83] performed a detailed analysis on the construction of protein graphs in various applications. Graph properties like isomorphism and graph spectra are widely used in protein folding, function, and dynamics studies. Clique, which is a complete subgraph in a graph, has numerous applications in different fields. It is used in protein structure comparison, especially in drug discovery applications [84]. Complementarity finding can be easily mapped to similarity finding problems if both inside and outside the protein surface are treated similarly. Grindley et al. [85] proposed a maximum common subgraph (MCS) based solution to the similarity finding problem. The method constructs a correspondence graph that contains all the possible equivalence between the graphs. The following steps are carried out to generate such a graph. Let A and B are two graphs whose MCS needs to be identified. Firstly, generate a set \(C=\{(n_{\mathrm{a}}, n_{\mathrm{b}}) \vert n_{\mathrm{a}} \in A \ and \ n_{\mathrm{b}} \in B\}\). Each element in C occupies a node in the correspondence graph, CG. Add an edge between two nodes in CG if an edge exists between corresponding nodes in graphs A and B. Now cliques in CG can be interpreted as MCSs in A and B. Gardiner et al. [86] adopted this idea in finding regions that satisfy both shape and hydrogen-bonding complementarity. They identified hydrogen-bonding donor-acceptor pairs in both proteins and constructed a docking graph. The inclusion of an edge in the graph implies that the distances between these atoms in two proteins are comparable with a threshold value. Cliques identified in this graph represent similar regions and can be interpreted as complementary regions. Observing the inner and outer regions alike can lead to the identification of similar regions, too, along with complementary regions, ensuing clashes in docking predictions. Axenopoulos et al. [42] solved the issue by considering the ligand surface and negative of receptor surface; the idea is later successfully adopted in SP-Dock [41], a shape complementarity based docking tool.

Like sampling phase, scoring can also take advantage of graph representation. We know that a tree is an acyclic graph. A kd-Tree is an efficient data structure in handling k dimensional data. It follows the principles of a binary search tree, where elements to the left of a node are less than its value and elements to the right are greater than or equal to its value, and hence searching in k-D space is easy. The proficiency of a multidimensional binary search tree in tracking down solutions is utilized by TreeDock [87] in exploring the energy landscape of proteins. It generates a kd-tree using the polar coordinates of the atoms in two proteins—one is fixed (F), and the other is assumed to be movable (M)—promoting the easy searching of atoms within a threshold distance from partner protein. This implementation drastically speeds up the energy calculation step, resulting in an overall improvement in performance. Generation of clash-free and low-energy structures is fundamental to perceiving a proper docking tool, which TreeDock achieves by merry-go-round algorithm. In this algorithm, a movable protein explores the space around a fixed protein to find the lowest energy structure. Firstly, using the generated kd-Tree, the merry-go-round algorithm finds those M atoms that can be moved closer to F atoms by rotation about the z-axis. It, then, removes those transformations leading to atomic clashes. Each transformation applied to the movable structure attempts to update the lowest energy values and the corresponding configurations.

The implementation of random walk graph kernel (RWGK) in GraphRank [88] to find subgraph similarity is exploited by Borgwardt et al. [89] to predict functions of proteins. Geng et al. [90] utilized the same for choosing the best docking model. The proposed tool, iScore, is equipped with an SVM classifier to segregate the positive and negative models. An interface graph representing protein–protein interaction is a bipartite graph constructed using interface residues. Each node in this graph is annotated with an evolutionary conservation profile of residues. This 20x1 vector obtained from Position Specific Scoring Matrix (PSSM) represents the log-likelihood ratio between the observed probability of an amino acid to appear in a particular position and its expected probability in a random sequence. RWGK applied concurrently on the two interface graphs calculates the similarity score between them. This is integrated with HADDOCK energy terms to boost the performance of the scoring function. The combination of energetic information and evolutionary information enables iScore to outperform GraphRank. The main bottleneck in the implementation of the tool is the generation of interface graph, and computation of RWGK [91], which is solved by MPI implementation with offloading of tasks to GPU.

4.5 Machine Learning Based Algorithms

Machine learning is a branch of AI that generates models that learn from training data without explicit coding. Statistics and probability form the base of these algorithms. During the training phase, the model tries to identify hidden patterns in the training data. Based on this knowledge, it can later perform similar tasks on new input. ML techniques can be classified on the basis of learning strategies as supervised, unsupervised, semi-supervised and reinforcement learning. Supervised learning techniques require labeled data for training. Once trained, it can perform classification and regression tasks on unknown data. Classification involves categorizing the data into some classes, while regression tasks demand the prediction of numerical values. Unsupervised techniques that do not require labeled training data draw inferences from the given data and divide the data into different classes after several rounds of refinement based on predefined criteria. This approach is best suited for problems like clustering and dimensionality reduction of data. Semisupervised learning supports the use of a mix of labeled and unlabeled data. The most promising reinforcement learning algorithms take in feedback from the environment, this feedback is then used to ascertain the next step.

The journey towards machine intelligence began in 1944 when Warren McCullough and Walter Pitts implemented a neural network. There were ups and downs in the popularity of the model during the period of its development. The idea of backpropagation put forth in the 1980s triggered a second surge of its acceptance. With the introduction of voluminous data, there was a paradigm shift from knowledge-based methods to data-driven methods. Thus the objective, learning from rules, changed to learning the rules. The current success of machine learning techniques, particularly deep learning, is attributed to parallel computing and advancements in hardware technologies. A fundamental characteristic of machine learning-based methods is that it demands user-supplied features. Classical ML models include Artificial Neural Network (ANN), Support Vector Machine (SVM), and decision trees. SVM uses a kernel function to find the maximal distant hyperplane capable of separating two data classes as shown in Fig. 5. Classification and regression tasks in protein–protein interactions have been successfully implemented using the ML algorithms mentioned earlier.

Being data-driven, the type of data and its volume are primary concerns in developing a good model. Equally important is the number of samples in the different data classes. An imbalance in data may lead to biased predictions, which is unacceptable. E.g., in the case of protein–protein interactions, the number of interacting residue pairs is much lesser than the number of non-interacting pairs. This data imbalance affects the performance of the model. Techniques adopted by existing methods to tackle this issue are discussed in the remainder of the text.

A protein interface is usually defined by changes in the accessible surface area upon complex formation or those residues having heavy atoms within a distance threshold of a partner protein. The threshold is usually taken as 6Å. The commonly used sequence-based features are sequence profile, sequence conservation score, and one-hot encoded residue name. Structure-based features include but are not limited to the relative accessible surface area, half-sphere exposure, geometrical descriptor, hydrophobicity, protrusion index, residue depth, and one-hot encoded secondary structure. Since the properties of interacting residues depend not only on their features but also on their neighbors, the structural and sequential environment features are considered for predictions.

Earlier, SVM-based techniques were adopted for designing scoring functions. The use of probabilistic SVM along with the structural, biological, and physicochemical properties, outperformed many scoring functions [92]. SVM-based scoring function combined with geometric correlation and amino acid-specific scoring functions also yielded good results [93]. Later, this technique was adopted in interface prediction methods. Unlike many sequence-based methods, PAIRPred [94] uses both sequence and structural data to predict partner-specific binding information. An SVM fed with a pairwise kernel predicts the score for each interacting pair of residues. A neighborhood averaging following this step gives the actual interactions. Tests show that bringing in structural features has a positive effect on the results. Das et al. [95] proposed an SVM based classifier for classification of interfacial residues. They generated a non-redundant dataset of complexes using CD-HIT [96] and BLASTp [97]. The method generates complexes from individual protein structures using PatchDock [51] and its identified interfaces are analysed, using PISA [98, 99], for feature extraction. An SVM now does the classification based on training. The proposed method predicts the nearness of a predicted interface to a known interface. This is available online as Protein Complex Prediction by the Interface Properties web server.

Ensemble methods are a class of machine learning methods that base their decision on a set of models created during execution. For instance, averaging and voting are ensemble methods. It outperforms other methods due to its collective decision-making ability. A popular ensemble machine learning algorithm is the random forest, which is widely used in studies related to drug discovery [100] and protein–protein interactions [101, 102]. BIPSPI [103] uses Extreme Gradient Boosting (XGBoost) in identifying the interfacial residues. Like PAIRPred, it works on both sequence and structural data. In its implementation, two consecutive XGBoost classifiers are fed with the feature set; these classifiers then predict interacting pairs, which are finalized by applying a scoring function. Another ensemble-based learning method for protein–protein interaction site prediction is EL-SMURF proposed by Wang et al. [101]. While DLPred [104], a sequence-based method that uses simplified LSTM, achieved an accuracy of 71.1%, 73.1%, and 71.8% for PDBtestset164, Dtestset72, and Dset186, respectively, ELSMURF has an accuracy of 77.7%, 79.1%, and 77.1% on the same datasets. It is interesting to look into the intricacies of the method. The algorithm works on sequence profile features and evolutionary information (Residue Evolution Rate). The application of oversampling to circumvent the data imbalance problem is followed by a multidimensional scaling algorithm to deal with dimensionality reduction of features. The use of a random forest classifier compounded by expert system voting for data integration yields good results in classification.

Though ML methods can work utilizing a small amount of data, they have some disadvantages. These methods require human intervention in feature selection. The importance of chosen features influence the overall performance of the model. Hence it becomes an esoteric job and demands in-depth domain knowledge to develop an efficient model. Another potential demerit is its inability to learn beyond a level, which limits its performance. i.e., the supply of an enormous amount of data does not always improve the performance of an ML model.

4.6 Deep Learning Techniques

A boom in deep learning (DL) techniques accompanied revolutionary changes in hardware technologies. DL is a particular class of machine learning methods that relieve the programmer from hand-picking features. They are deep neural networks that can learn from self-designed feature sets, and the supply of abundant data can significantly improve its performance. Equally important is its black box nature, which demands experience to fine-tune the model to get the intended level of accuracy. A schematic diagram showing the difference in ML and DL is shown in Fig. 6. A deep learning model utilizes forward propagation and backward propagation to learn the relation between input and output. The forward propagation step checks whether the model can generate the intended result, and the backward propagation step tunes the network parameters. The ability of an ANN to mimic biological neurons forms the backbone of deep learning networks. A basic flow diagram is shown in Fig. 7. It has different connected nodes triggered based on the signal from other nodes. A neural network in its basic structure has an input layer, at least one hidden layer, and an output layer. The input layer gets the input for the hidden layers to process. The output of all nodes, but input layer nodes, is a function of data from the previous layer, weight associated with the edges connecting it with the previous layer nodes, and the bias, which is an additional adjustment for the output. The activation function applied to the obtained result determines the triggering of a particular node. These functions add nonlinearity to the input, making it suitable for handling complex relations. Also, its differentiability supports the backpropagation step. The nodes that have values above a threshold can pass the data to the next layer. At the output layer, the node having the highest value determines the output of the model. A loss function calculates the difference in expected and predicted outputs based, the backpropagation step then adjusts the weight matrices in different layers. The repeated application of these steps yields the optimal solution. The selection of the loss and activation functions depend on the nature of the problem to be solved. Convolutional Neural Networks (CNN) [105, 106], Recurrent Neural Networks (RNN) [107, 108], Long Short Term Memory networks (LSTM) [109], Generative networks [110, 111], transformers [112, 113] are some of the most popular deep learning architectures.

Recently, the scientific community witnessed the enormous power of deep learning (DL), amassed through years of research, in predicting the tertiary structure of proteins from amino acid sequences. The groundbreaking solution to this hard problem was proposed by DeepMind [114]. The program, AlphaFold2, utilized an attention mechanism on a graph along with the evolutionary information. It was reported to be trained on 170,000 structures publicly available in protein databank and other data sources. The performance of AlphaFold in CAPS opened many discussions on the effectiveness of deep learning in structure prediction. Baek et al. [115] used a three-track network that uses the sequence data, distance map, and 3D coordinate information to elucidate the working of AlphaFold. In this method, an SE(3) transformer refines the 3D coordinates of atoms generated by a graph transformer that accepts MSA features. In addition to protein structure, the model could predict protein complex structures too, but with low resolution.

Fig. 5
figure 5

Support vector machine

Fig. 6
figure 6

Machine learning vs deep learning techniques. Machine learning needs human intervention to extract features upon which machine learning model can act. Deep learning models themselves are capable of extracting features

Fig. 7
figure 7

a. Artificial neural network architecture. b. The structure of a single node in ANN. x\(_{\mathrm{1}}\), x\(_{\mathrm{2}}\),...x\(_{\mathrm{n}}\) represent the input values, w\(_{\mathrm{1}}\), w\(_{\mathrm{2}}\),...w\(_{\mathrm{n}}\) represent the weights, b is the bias applied to a neuron, and f is the activation function

Voluminous data in appropriate representation is required to propel the model to its full performance in DL. One notable advantage of this technique is its ability to identify features independently, thereby exempting the user from the burden of providing essential features. At the same time, this black box nature results in poorly interpretable models. The scientific community has started experiments with deep learning in tasks like identification of binding sites, interface regions, prediction of protein–protein interactions, implementation of scoring functions, and structure prediction of complexes.

4.6.1 Convolutional Neural Networks (CNN)

One of the breakthroughs in deep learning algorithms was in 1998 when Yann LeCun developed CNNs for document recognition [105]. Motivated by the development of Neocognitron [116] in 1988, LeCun started working towards convolutional networks. In its inception, the architecture was called LeNet [117, 118]. Convolutional Neural Networks (CNN or ConvNets) features and architecture are most suitable for handling tensors. The inclusion of the convolution operation led to remarkable changes in the output of these deep learning models. CNN mainly contains four types of layers as shown in Fig. 8. In a model, there can be one or more convolutional layers, which implements convolution operation. In each convolutional layer, filters are convolved over the multidimensional input data to extract features. Feature extraction is incremental; the first layer extracts only low-level features. Consecutive layers should extract higher-level features. The results of the dot product of filters and input are known as feature maps. During the training phase, randomly initialized filters are modified to reduce the loss function value. After the convolutional layer, there is pooling layer. The use of too many parameters in a layer may be computationally challenging. Aggregation operations on the feature map can deal with the issue by reducing the dimension. A pooling layer takes up this responsibility. In the pipeline, a ReLU (Rectified Linear Unit) layer, which acts as an activation function, replaces all the negative values in the feature map with zero. The last layer in this architecture is the fully connected layer that processes the flattened data from the previous layer to predict the output. This architecture outperformed all the then state-of-the-art architectures with their prowess to specially treat important input aspects. Apart from this, the application of filters gives a reduced representation making it suitable for handling massive inputs. Inter-protein interactions are orchestrated not only by the participating residues. The neighboring residues rather influence the probability of interaction. A simple neural network can never ought to this demand. CNN equipped with convolution operation is best suitable for this need.

Fig. 8
figure 8

A convolutional neural network architecture. Multidimensional input is passed through convolutional, pooling and fully connected layer to generate the desired output

Fig. 9
figure 9

Multilayer graph convolutional neural network [129]

Townshend et al. [119] describe the architecture of SASNet, which uses the spatial information of atoms in proteins to identify the interfacial residues. The method does not use any handcrafted features but applies siamese like 3D CNN on voxelized input proteins containing the atom type information to generate feature vectors. It has improved performance compared with BIPSPI [103] and Node average method using GCN [120] due to the use of a large dataset, DIPS (Database of Interacting Proteins) containing 42826 binary protein interactions, for training. A degradation in the tool’s performance when trained using Docking Benchmark 5 (DB5) is due to fewer samples in the dataset. Still, the proposed method outperforms the other two tools.

Xie et al. [121] introduced a CNN-based method for finding the interaction sites utilizing the binding propensity of residues. Given a sequence of amino acids and its associated information like sequence profile and physicochemical properties, and the structural features like accessible surface area, relative accessible surface area, protrusion index, and depth index in addition to its hydrophobicity of residues in the inputs, the algorithm predicts the interaction propensity of the residues. In this implementation, the interface region includes those residues having any atom within a distance of 6Å from any atom in the partner protein. This definition sometimes causes the inclusion of false-positive residues in the interface. The authors observed that these false positives in training data badly affect the prediction accuracy.

Zhu et al. [122] proposed ConvsPPIS in an attempt to devise an ensemble deep convolutional neural network tool to predict interface residues. The method includes evolutionary information in addition to the sequence and structural data to generate separate feature graphs. The association of an ensemble predictor with three deep CNNs trained on these feature graphs exposes residues’ interactability. The model is found to be effective and has an accuracy of 88% on the DBv5-Sel dataset.

Hadarovich et al. [123] attempted to predict the structure of homodimers using a CNN model. The first stage generates a contact map using a deep CNN model with binary cross-entropy as the loss function, which is then fed along with the 3D structure of the protein in its unbound state to a gradient descent algorithm. The burden of finding the affine transformation that can be applied to the second protein to get the dimer structure is on the shoulders of the optimization algorithm. The method failed to output an impressive result. Nevertheless, the attempt was appreciated and can be an indicator of future trials.

The efficiency of 2D and 3D CNNs [124,125,126] are proved undoubtedly through many works in image processing. DOcking decoy selection with Voxel-based deep neural nEtwork (DOVE) [127] is the first attempt to test the applicability of 3D CNNs in protein–protein docking. As the name suggests, it does voxelization of the decoys, analyses the interfacial residues, and calculates the potential developed upon complex formation using a 3D CNN to separate the near-native from other structures. The method uses TMscore [128], which measures the similarity between structures using Eq. (1), to remove redundant poses in the input dataset. TMscore uses the number of amino acids in the protein, L\(_{\mathrm{N}}\), number of its matching residues with any reference protein, L\(_{\mathrm{T}}\), distance between \(\text {i}{\mathrm{th}}\) matching residues, d\(_{\mathrm{i}}\), and the normalization factor, d\(_{\mathrm{o}}\) to calculate the resemblance.

$$\begin{aligned} TMscore = Max \left[ \frac{1}{L_{\mathrm{N}}}\sum _{i=1}^{L_{\mathrm{T}}} \frac{1}{1 + \left( \frac{d_{\mathrm{i}}}{d_{\mathrm{o}}}\right) ^2} \right] \end{aligned}$$
(1)

At the same time, to cope with the problem of unbalanced data, positive samples are generated by differently orienting the original structures. It is worth noting that each voxel is associated with total atomic potentials. The performance of the model revealed the power of CNN in scoring too.

4.6.2 Graph Neural Networks(GNN) and Graph Convolutional Networks (GCN)

CNN-like models are suitable when dealing with data that lies in the n-dimensional linear space, \(\mathbb {R}\)n. However, not all data can be mapped to \(\mathbb {R}\)n. Sometimes, graphs can replace n-dimensional grids to represent the relationship between the different dimensions of input data. Graphs are better structures for representing the hierarchical relation between a set of entities, the same applies to graph neural networks. Input to GNN is a graph that represents the relations between different entities in a problem. Each node embeds information about its neighbors by message passing, which includes an aggregate operation on neighboring node data followed by a combine operation with the nodes feature. Computations at \(l{\mathrm{th}}\) layer can be mathematically written as follows:

$$\begin{aligned} a_{v}^{(l)} = Aggregate^{(l)} \left( \left\{ h_{u}^{(l-1)} \ \forall \, u \in \mathcal {N}(v) \right\} \right) \end{aligned}$$
(2)

where v is the node, \(\mathcal {N}()\) returns the neighborhood, \(a_{v}^{(l)}\) is the aggregated node feature of neighborhood, and \(h_{u}^{(l-1)}\) is the neighborhood node feature in \((l-1){\mathrm{th}}\) iteration.

$$\begin{aligned} h_{v}^{l} = Combine^{(l)} \left( h_{v}^{(l-1)}, a_{v}^{(l)} \right) \end{aligned}$$
(3)

Like any other network, stacking of layers can improves the model’s performance. Summation of the individual node encodings gives a representation of the whole graph. When the neighbors need to be treated differently, there is a demand to fall back on convolution. Graph convolution networks save the situation. A GCN has a convolution layer, linear layer/fully connected layer, and non-linear activation layer to carry out the intended task. A multilayer GCN is shown in Fig. 9 [129]. The models are scalable as changes need localized modifications to get new embeddings. Another variant of GNN is the Graph autoencoder which encodes the graphical data. The encoder part generates a latent vector representation while the decoder generates the graph from this encoded vector.

Fout [120] proposed a GCN-based model, which classifies the residue pairs as interacting or not, based on structural and sequence data. It separately treats the two proteins using GCN, and the final merging operation facilitates the prediction of interacting residue pairs. The results showed that convolution has a significant role in improving the results. In interface prediction, adding residue binding propensity to separate positive and negative residue pairs is successfully implemented in [121]. This implies a difference in the nature of residues occurring at the surface and interior of proteins. Also, there is residue preference for interaction due to the change in properties of different residues. An integrated graph neural network and CNN-based approach [130] which uses structural, sequential, and high order interaction data was proposed recently. High-order interaction data includes details about adjacent residue pairs as they have a role in the interactions. GNN models generate a sequential representation of individual proteins. This sequential data is summed or concatenated to form a tensor which then passes through CNN layers to yield predictions for pairwise interaction. The inclusion of in-protein interaction information helps to deal with the imbalance in training labels.

Cao et al.[131] use graph convolution networks to implement a scoring function. The network accepts both intra and inter-molecular contact maps as input. This spatial relationship between residues is used to calculate the intra and inter-protein energy values. The model is trained to reduce the difference in calculated Gibbs energy and the actual energy of the complex. Also, the generated score value can be relied on for binding affinity prediction. Improved performance can be expected if the model is trained with data, which is a distribution of test data. Wang et al. [132] presented a GNN based scoring function where interfacial information is embedded in a graph. Two graphs representing the covalent and non-covalent interactions are treated with a gated graph attention network which predicts the probability of a structure to be near-native.

4.6.3 Generative Models

Generative models in deep learning are designed with immense power to generate new data instances. Hence, its introduction has a widespread impact on many fields. For instance, data scientists could compensate for the low data availability to train deep learning models. It is interesting to know the working of generative models. In simple terms, the method attempts to generate data that follows the distribution of training data. When strict constraints on data distribution may make the task impossible, near-exact compliances can be expected from a successful model. Variational autoencoder (VAE) [133] and generative adversarial networks (GAN) are examples of generative models. An autoencoder has an encoder and a decoder network, working hand-in-hand to reconstruct its input at the output. The encoder generates a latent vector of the input on which the decoder operates to generate the actual input. Latent vectors are a reduced representation of the input, and hence the architecture is widely used in dimensionality reduction. Autoencoders with only one hidden layer are vanilla autoencoders. They have the power to generate data from an encoded representation but fail to create variations from the input data. VAE, which uses Bayesian inference to update the probabilities, has the power of a generative model. Instead of encoding the data, VAE tries to encode the data distribution. Thus a sample from this encoded data can be used for generating data that follows the distribution.

GAN [134] is similar to VAE but varies in the training method. Fig. 10 shows the working of a simple GAN. A GAN has a generative network, G, to generate the data, which the discriminator, D, classifies as real or fake based on its training. The objective of the method is to find Nash equilibrium between the two networks, where the generator and discriminator make optimal moves to minimize and maximize the loss, respectively. In short, the model is an implementation of a min-max game that works on the following equation:

$$\begin{aligned} {}_{\mathrm{G}}min \ {}_{\mathrm{D}}max \ V(D,G) = \mathbb {E}_{\mathrm{x} }[log D(x)] + \mathbb {E}_{\mathrm{z}}[log(1 - D(G(z))) ] \end{aligned}$$
(4)

where \(\mathbb {E}_{\mathrm{z} }\) is the expectation of noise data and \(\mathbb {E}_{\mathrm{x} }\) is the expectation of real data. The discriminator is first trained with actual data. The data generated from the supplied noise data is fed to a discriminator, and its results are propagated back separately to train the two networks. The generator tries to maximize the probability of the discriminator making a mistake. On each iteration, both networks improve their skills until a balance is attained.

Degiacomi [135] tested the ability of autoencoders in generating protein structures. He showed that the decoder part of a well-trained autoencoder suffices to generate a conformation space. Structures generated by MD simulation were used in autoencoder training, which generates latent vector representations of the structures. The trained model can generate a conformation space from these latent vectors. Recently, Ramaswamy et al. [136] improved the conformation space generation procedure using 1D convolutions in both encoder and decoder. The combination of force-field parameters and the Mean Squared Error (MSE) loss function helps to generate low-energy structures. While the non-bonded terms (van der Waals and electrostatic potential) guide the transition to a conformation, bonded terms (bond angle, bond length, and dihedral angles) help in refining the structure by making local alterations in atom positions. Results showed that the inclusion of physics-based terms helped in generating biologically relevant conformations. Nguyen et al. [137] made a similar attempt with generative adversarial networks in D3R Grand Challenge 4. Along with generator and discriminator modules, the proposed method accommodates a mathcentre. The generator module is an auto decoder that generates a structure using latent space and noise input. The mathcentre converts this structure to a low-dimensional mathematical representation utilizing algebraic topology, differential geometry, and algebraic graph. The discriminator, which is an autoencoder, encodes the mathematical representation to a latent space and checks its quality. While the discriminator tries to maximize the loss function, the generator tries to minimize it. Their adversarial action helps the network to find the optimal model parameters.

An image-to-image translation system is designed to accept images from one domain and manipulate them to generate images in another domain that follows a different style. One of the main hurdles in implementing this translation system was the requirement of paired training data. CycleGAN [138], an unsupervised learning technique, was first introduced to circumvent this demand for supervised learning. The network has two pairs of generators and discriminators. One of the notable features of CycleGAN is the cycle consistency loss, which ensures forward-backward consistency of the generated data. Mol-CycleGAN [139] utilized the power of CycleGAN in generating optimized molecular structures. A more precise description would be that, given a molecule, Mol-CycleGAN generates a molecule of a similar structure but with optimized characteristics on par with the standards. A step to generate latent vector containing a component of junction tree scaffold and molecular graph using JT-VAE (Junction Tree VAE) [140] is carried out before the application of cycleGAN. The model works on two sets of latent vectors—an active set, Y, and an inactive set, X. Two mapping functions \(G: X->Y\) and \(F: Y->X\) are defined along with corresponding discriminators D\(_{\mathrm{Y}}\) and D\(_{\mathrm{X}}\) respectively. The discriminator and generator functions try to outperform their adversary till convergence is achieved. For instance, D\(_{\mathrm{Y}}\) forces G to focus on the distribution of Y. It is the identity mapping loss that ensures similarity of input and generated structure. i.e., it ensures that input to one generator and output of the other generator are matching. Once trained, this model has learned the rules for transforming an inactive molecule to its active state. Thus given an input molecule, \(x\in X\), G generates an embedding of x that follows the distribution of Y and is similar to x. The generated embedding can be fed to the decoder of JT-VAE to obtain the optimized molecule. The method finds application in generating molecules that are difficult to synthesize.

4.6.4 Other Deep Learning Architectures

Though 3D CNN can handle translation, it has an inherent inability to afford rotation invariance, and hence it might be unwise to use it to manipulate point cloud data. Tensor field networks [141] are special networks that can work with such data as it offers rotation and translation invariance. This distinct quality eschews the data augmentation step in addition to its support for solving problems related to classical mechanics molecular structure. Eismann et al. [142] utilized tensor field networks in PAUL, an end-to-end implementation of a scoring function. Unlike other methods that feed in the physicochemical properties, this method uses only three atom level features—3D coordinates, corresponding atom name, and whether it belongs to receptor or ligand. In the implementation, the convolution operation is carried out by taking the tensor product of input and truncated series of spherical harmonics with learnable radial function to ensure the equivariance of input. Hierarchial learning is achieved by atomic and residue-level aggregations in different layers to finally give a single feature vector representation for the whole structure. Since the physical forces are significant over a locality, convolution is applied within the neighborhood only. The model is trained for the regression task of Ligand Root Mean Square Deviation (LRMSD) prediction and the classification task of segregating acceptable and unacceptable structures. Each layer of processing reduces the granularity of input data, and in the last fully connected layer, a scalar output is generated. The equivariant network, hierarchical learning, and the nearest neighbor convolutions form the backbone of this model. Results show that augmenting PAUL with other scoring functions promises better prediction.

Fig. 10
figure 10

Overview of generative adversarial network

Gainza et al. [143] explored the capacity of geometric deep learning in checking the interactability of proteins. The methods divide the surface into fragments of fixed geodesic radius to which biophysical and chemical/structural properties are assigned as an encoded descriptor. For a matching pair of surface fragments, the distance between the descriptors will be significantly less, and such a pair can form the base of interaction. The authors describe three critical applications of the descriptor: MaSIF-site, MaSIF-ligand, and MaSIF-search. MaSIF-site identifies the interaction site between two proteins, MaSIF-ligand can find the binding pockets of ligands, and MaSIF-search helps filter out the best probable interactions. The method, using bound structures, is compared with PatchDock, ZDOCK [144], and ZDOCK+ZRANK and is found to have comparable performance but with much-reduced execution time. This is because the fragmentation of surface into patches serves the simultaneous processing of multiple patches. However, the model needs to be improved to get good results for unbound structures.

5 Implementation Aspects: Parallel Computing in Docking

The twenty-first century witnessed many advancements in the hardware domain. It led to evolutionary changes in computation; serial programming paves the way for parallel computing, which can effectively utilize the improved hardware. There are provisions to support instruction-level parallelism, thread-level parallelism, data-level parallelism, and transaction-level parallelism. Porting an existing serial system to parallel execution must be handled carefully to get the intended results. An APOD (Assess, Parallelize, Optimize, and Deploy) strategy can be employed to get the maximum benefits. A proper analysis of the program is required to identify the potential candidates for parallelization. It is always recommended to conduct automated profiling of the code to support the manual analysis. Once profiled, independent steps that take more execution time can be parallelized to get improved efficiency. The use of GPU-accelerated libraries, OpenACC directives, and GPU programming languages helps in accomplishing the task. A code that complies with the best practices adds to the optimized performance and can be deployed successfully. OpenMP, MPI, CUDA, OpenACC, and OpenCL are popular platforms that offer parallel programming capabilities. Protein docking, being a computationally intensive job and has the scope of parallel execution, can benefit from this technique to a great extend. Some of the specific works that adopt parallel programming are discussed below.

MegaDock 4.0 [145] uses the method proposed by Katchalski-Katzir to compute the score of generated poses. Unlike the proposed correlation functions, a single correlation function that accommodates all factors is utilized, making the execution faster. This method offloads all processing into GPU, including voxelization, ligand transformation, score calculation, and final prediction. The provision for simultaneously utilizing multiple cores and GPUs makes the execution even faster. In each of the 420 nodes available for computing, 2 Intel Xenon X5670 CPUs and 3 NVIDIA Tesla K20X GPUs are accommodated. A boost on the software side is achieved using hybrid CUDA, MPI, and OpenMPI. Shimoda et al. [146] studied the effect of different computational environments – GPU and MIC (Many Integrated Cores) – in the working of MEGADOCK and GPU proved to be better.

Cell-Dock [147] implemented an array of techniques for faster execution. It parallelizes both rotation and discretization steps on a PlayStation 3 (PS3) and Cell BE Blade in addition to the data localization techniques employed. It offers two versions – CELL-256 and CELL-128 – out of which the former has a better performance. CELL-256 on a dual-processor Cell BE-based blade contained two SMT-enabled Cell BE processors at 3.2 GHz with 2 GB DD2.0 XDR RAM (1 GB per processor). The integration of task-level parallelism with data-level parallelism improved its performance compared with FTDock[45], which is an FFT-based technique that uses shape and electrostatic complementarity for filtering the structures.

Sukhwani et al. [148] analyze the ways to accelerate the PIPER [46], an FFT-based algorithm . A profile analysis identified FFT as a good candidate for applying parallelization. Along with this, other steps were also considered as the application of Amdahl’s law suggests that modifications to the above step could speed up the whole procedure by a factor of 11 only. Finally, the work compared the performance of FPGA and GPU in accelerating docking and observed the dominance of GPU in protein–protein docking.

Deep learning techniques, randomized algorithms, population-based algorithms, and other traditional approaches are suitable candidates for parallel execution. Such techniques can drastically reduce the execution time in addition to effectively utilizing the resources.

6 Critical Assessment of PRediction of Interactions (CAPRI)

Critical Assessment of Structure Prediction (CASP), started in 1994, to bring state-of-the-art techniques in protein structure prediction from amino acid sequences to a single platform. It is treated as a world championship where reputed labs test their tools using unknown targets, whose structures are not published anywhere. In 2001, EMBL-EBI started Critical Assessment of PRediction of Interactions (CAPRI), following the path laid down by CASP, to showcase the performance of scorers and predictors in the protein–protein docking process. Given the coordinates of input proteins, predictors generate ten possible near-native structures of the complex. In addition, they may supply large number of structures for the scorers to check the efficiency of their scoring functions. CAPRI offers separate tracks for servers and human predictors. Targets, whose experimentally determined complex structures are yet to be published, are chosen in different CAPRI rounds. Emphatically, the individual structures are available in free form. Ab initio methods were adopted for docking at its inception, and only protein–protein complexes were chosen as targets. Later, the arena opened for other protein-bound targets too. In 2014, CASP joined hands with CAPRI for a biennial event CASP-CAPRI. Most of the targets in these events are homo-oligomers, and homology modeling may suffice to predict their complex structures. By this time, predictors relied on template-free methods only on the unavailability of target templates. The introduction of clustering techniques improved the methods by effectively choosing appropriate templates from a set of similar structures. Targets having different structural domains posed a real problem for both predictors and scorers [149, 150]. Apart from the changes in input and procedure, CAPRI brought new objectives to the event. In addition to structure prediction of targets, the participants were assigned the task of predicting position of interfacial water molecules [151], and binding affinity [152,153,154]. Overall performance of 4 servers which are consistently gaining success in CASP-CAPRI events is shown in Table 1; the values are taken from [149, 150, 155].

Table 1 Composition of docking performance of different servers in CASP-CAPRI experiments. Results of each experiment are taken from corresponding publications [149, 150, 155]

Lensink et al. [155] reported the details of CASP13-CAPRI held in 2019. Participants were 24 human predictors, eight predictor servers, 14 human scorers, and five scorer servers. In a competition where template structures are supplied, the general trend of docking servers is to adopt template-based techniques wherever possible. Cluspro server identified appropriate templates of given sequence data using HHPred [156]. On the unavailability of templates for complex, Cluspro [48] opted for free docking, which starts by applying the FFT-based PIPER algorithm [46]. The generated structures were scored using van der Waals, electrostatic, and desolvation potentials. Finally, the clustered structures’ centers are chosen and cleaned of atomic clashes by implementing van der Waals minimization that uses CHARMM potentials. MDockPP also uses the FFT-based structure generation technique, followed by an optimization step and scoring using ITScorePP [29]. The final models were selected from clustered data based on the biological information. GalaxyPPDock takes advantage of GalaxyHomomer [157] to segregate the templates identified by HHsearch [158]. It performed FFT-based ab initio docking and then refined the generated structures using GalaxyRefineComplex [159]. PSO-based structure prediction technique uses HHBlits [160] for finding the homologous sequences and generate the individual structures using constricted PSO2 with Dfire potential [161]. The standard SwarmDock algorithm to find the optimal binding location is then used together with ranking SVM [31] to select the desirable structures. HADDOCK employed ab initio, template-based, and information-driven approaches to predict a near-native structure. However, the ab initio method could not generate any successful targets. LZerD server used a combination of PSI-BLAST and HHpred to find the template structures. Therefore, the application of the LZerD algorithm is limited to cases for which templates could be identified and the ability to identify the correct templates predominantly affects the performance of servers. Optimization of the generated structure is indeed encouraged in the case of difficult targets. Among the predictor servers, HADDOCK outperformed all other servers in quality of prediction and the number of targets solved. This analysis clearly shows that all methods apply template-based techniques wherever possible to obtain high-quality results. Furthermore, with the addition of more protein–protein complex structures, the tool’s performance could be improved. Fig. 11 shows the performance of different docking servers based on the rank and quality of the generated structures as reported in [155]. The majority of the servers got through for almost all easy targets but the results were not promising for difficult targets.

7 Benchmark Set and Evaluation Criteria

The truest test for docking tools is their ability to predict near-native structures for flexible targets. There must be ways to compare the performance of different docking techniques. Protein–protein docking benchmark, DOCKGROUND, and PPI4Dock are docking datasets that provide the opportunity for the same.

The protein–protein docking benchmark, proposed by Chen et al. [162] in 2003, was the first of its kind. They took special attention to include targets of all difficulty levels—rigid-body, medium, and difficult—while avoiding redundant structures. This benchmark set was refined many times to generate different versions [163,164,165,166]. The latest version protein–protein docking benchmark version 5.5 [167] contains 162 rigid-body, 60 medium difficulty, and 35 difficult targets. It should be noted that these benchmark sets contain experimentally generated structures, and the individual structures are available in a bound and unbound form.

DOCKGROUND [57, 168], a dataset compiled by Douguet et al. in 2006, offers support for multiple aspects in protein–protein complex structure prediction. Apart from experimental structures, computationally obtained bound and unbound structures are also included in this dataset. Specifically, X-ray bound, X-ray unbound, simulated unbound, model-model complexes, X-ray docking decoys, and docking templates are supplied by this dataset. X-ray bound structures, whose details are stored in a PostgreSQL database, are downloaded directly from the PDB repository, and X-ray unbound structures are identified with the help of ProPairs software [169]. Generation of computational unbound structures utilizes the service provided by Langevin dynamics simulation on the bound structures separated from their partners. DOCKGROUND generates decoys structures of complexes using GRAMM-X [170].

The largest dataset for docking hitherto is PPI4Dock [171], which contains 1417 targets obtained by homology modeling. These targets are classified as easy, very_easy, hard, very_hard, and super_hard, depending on the deviation of template structure from the original crystal structure. The authors observe that a tool that could generate near-native structures for very_easy, easy, and hard targets can be considered efficient.

The well-accepted criteria for checking the quality of structures generated by different docking tools is CAPRI evaluation criteria, which uses LRMSD, IRMSD, and f\(_{\mathrm{nat}}\) values for the same. LRMSD is the root mean square deviation of ligand backbone atoms when receptor backbone atoms of predicted and crystal structure are superposed. Similarly, Interface Root Mean Square Deviation (IRMSD) measures the change in interface coordinates of the prediction and original structure. Fraction of native residues (f\(_{\mathrm{nat}}\)) keeps a record of the number of interface residues in a native structure that is reproduced in the predicted structure. Based on the values of these parameters, generated structures are classified high, medium, acceptable, and incorrect, as shown in Table 2.

However, this criteria is not robust as it uses three different quantities in qualifying the structures. Even slight changes in the values of any of these parameters may change the quality class of a generated structure. Also, a large LRMSD value may be compensated by a good f\(_{\mathrm{nat}}\) value, making the structure counted as acceptable. Basu et al. [172] proposed DockQ as an alternative measure that returns a single score value between 0 and 1. It uses scaled IRMSD and LRMSD values obtained by the inverse square scaling technique. The final score is calculated as follows [172]:

$$\begin{aligned} DockQ = (f_{\mathrm{nat}} + LRMSD_{\mathrm{scaled}} + IRMSD_{\mathrm{scaled}})/3 \end{aligned}$$
(5)

where

$$\begin{aligned} RMSD_{\mathrm{scaled}} =\frac{1}{1+ \left( \frac{RMSD}{d_{\mathrm{i}}}\right) ^2} \end{aligned}$$
(6)

The values of d\(_{\mathrm{i}}\) are 8.5Å and 1.5Å to calculate LRMSD and IRMSD respectively. Even CAPRI competition started using DockQ for quality analysis [155]. Fig. 12 shows the range of DockQ scores used in classification of structures into different quality classes.

Fig. 11
figure 11

Performance of different docking servers in CASP13-CAPRI in terms of rank and quality of best prediction

Fig. 12
figure 12

Range of DockQ score

Table 2 CAPRI evaluation criteria [184]

8 Discussion

The majority of docking tools prefer high-quality X-ray crystallographic structures as input. These structures could reveal the atomic details of proteins. The scarcity of such high-quality X-ray crystallographic data stresses the need to investigate the utility of low-quality structures. Structures obtained through small-angle X-ray scattering (SAXS) and homology modeling may be useful in this regard [173, 174]. The usefulness of augmenting input data with additional theoretical knowledge to get around its frailty can also be a point of investigation.

A docking algorithm is required to take only the necessary information using which it can perform well. One of the most striking examples is the performance of HADDOCK-CG with Martini force-field [6]. A notable feature of coarse-grained representation is that it smoothens the energy landscape and speeds up the execution. Also, it is conducive to deal with slight changes in conformation. Thus careful selection of information may contribute to the performance of the method.

Every sensible docking algorithm that tries to accomplish improvement takes care of its computation cost too. Score calculation and refinement are the main bottlenecks in the majority of the docking procedures. Increased execution time is attributed to the pairwise energy calculation in interprotein residues in score calculation. Since interaction does not happen in the core region, many docking techniques calculate the pairwise energy only for residues at a specific depth. Measures to reduce the size of the conformation space can also help in reducing the computational cost.

A water molecule rich in hydrogen-protein bonds is called a highly coordinated water molecule. Such molecules have a significant role in the interaction [175]. Adding its contribution to the scoring function improves the sampling of conformation space. A two-stage approach that explicitly treats water molecules adds to the performance of the scoring stage [176]. Also, an interacting molecule perhaps displaces water molecules to get associations, and hence it may be utilized as an indicator of binding sites. The use of explicit water molecules in MD simulation can improve results [177].

Evolutionary algorithms are used in many protein-related tasks. Underutilization of these algorithms in docking may be due to the difficulty in fixing the convergence criteria. A possible solution is to fix the number of iterations. Furthermore, the addition of SVM with these algorithms may help to fix the parameters [178, 179]. Another problem with these algorithms is their computational complexity which the application of embarrassingly parallel execution can solve.

Dealing with flexibility is a primary concern in protein docking. An efficient docking tool may have to deal with both side-chain and backbone flexibility. Proper employment of refinement techniques is crucial to get near-native structures for flexible targets. These refinement techniques, in their effort to reduce the physical energy, give better structures. Nevertheless, assuming flexibility to targets may not always result in good structure prediction as a few proteins prefer to be in a rigid state, exhibiting limited segmental flexibility [180, 181]. Thus information about the characteristics of input has an important role. The current trend in docking techniques is towards integrative modeling, which combines information from different sources. X-ray crystallography, NMR spectroscopy, Electron microscopy, footprinting, chemical crosslinking, FRET spectroscopy, SAXS, and proteomics are counted as data sources for such techniques.

The participants in CAPRI opted for ab initio methods only when there were no available templates. The current trend shows that template-based methods are gaining popularity. With the addition of more and more complex structures, such methods could perform better.

8.1 Challenges and Future in Deep Learning Techniques

The transition from ML to DL was gradual and was boosted by the technological advancements in GPUs. Backed by the high computational power, DL models are responsible for feature extraction from training data on the precondition of supply of a large quantity of data. Apart from the theoretical aspects, in practice, the claim of DL methods extracting features is specious. The input need not always be raw data; a preprocessing step perhaps precedes a deep learning model. Max pooling layer is one instance of hard-coded features. However, the pretense of requiring fewer hand-coded features compared to ML models is ensured in DL models.

The model design demands handcrafted features due to the reduced number of known protein complex structures. A major hurdle in using deep learning-based approaches on protein docking or any other protein-based application is data representation. It is challenging to effectively represent both the positional information of the atoms and the interactions between them. Furthermore, using a 3-dimensional space to represent the positional details of the protein is also difficult due to its sparsity. The uneven distribution of atoms adds to the trouble of representation scheme selection. A sparse convolution offers a solution to the sparsity problem. Another possibility is the use of graph representation which graph networks can process. Though graphs can give spatial proximity data, they lack implicit data on the location and direction of atoms/residues and can be added as an explicit node feature. A novice designer can take the assistance of encoder networks to get the proper representation schemes. It not only returns a reduced representation but also avoids less critical features. When dealing only with sequence data, BERT and transformers are also helpful in finding the embeddings for input data.

As with any deep learning model, overfitting and underfitting can shadow the performance. The model may learn the training data in every detail, including the noise, making it unfit for predicting new data. The same can happen when the model fails to understand the underlying rule of the data. Proper training with adequate data can solve this issue; the distribution of data matters here. Training data must contain data from different data distributions with sufficient size.

Finding the fitting parameters is essential for the model performance. Equally important is the tuning of hyperparameters. For instance, varying the learning rate during the training phase possibly improves the learning of the model. The available computational capabilities may constrain the selection of batch size.

Deep learning techniques are computationally demanding due to the requirement for processing huge-sized data and the depth of the network. Deeper networks may use a huge number of parameters and thus may be stopped by computational limitations. Furthermore, compared to the evolutionary changes in algorithmic approaches, hardware improvements are far behind, limiting the model performance. Researchers observe that only computationally efficient methods can sustain in the future as computational limits are fast approaching [182].

A promising deep learning model for protein–protein complex structure prediction is GAN. Generator and discriminator networks in GAN are ideally working against each other. Hence it is crucial to find the correct balance between their working. As with any other model, the selection of loss function demands utmost care. Since GAN is designed for data generation, further augmenting data for training may negatively affect the results. Another possibility is using tensor field networks, which are helpful when dealing with point cloud data that demands transformation invariance. This property avoids the need for data augmentation.

Protein complex structure prediction from sequence data is also worth mentioning. Discussions about quaternary structure prediction followed the much-celebrated success of Alphafold [183] in tertiary structure prediction. A transformer is one of the successful neural network models capable of dealing with sequential data. The main crux of the transformer lies in its self-attention module, which computes how strongly the information should be routed from one token to another. The transformer suits best as any protein-related task depends on the interaction between the amino acids in the sequence. However, it considers the relations between every pair of tokens in the input, which may not be desirable for protein sequences where only local attention is needed. In addition, a fully connected graph for the long protein sequences may overburden the computational resources. The recently proposed graph transformers seem a solution to this problem. It gives local attention to the input tokens. In other words, it examines only the relationship between connected nodes, making it suitable for tasks related to interaction prediction. The idea can be further extended to structure prediction tasks. Unlike RNNs and other sequential models, the transformer performs better because it can be parallelized, and also, the attention module makes the information flow much more concrete.

The application of deep learning in the structure prediction of protein complexes has to overcome other challenges too. First, training a model for the same purpose requires voluminous data. The number of complex structures available in different databases counts to a few thousand, and this may not be sufficient for a model to learn the rules. Second, any learning procedure ought to deal with the data imbalance problem. Third, since a model cannot afford to learn from low-quality data, there is a stringent need for high-quality experimental data. A generative model may be a pathbreaking solution to this problem.

There is immense scope for the application of deep learning techniques in protein-related tasks. Interaction prediction of proteins, interfacial region identification, classification of interfacial residue pairs, interprotein contact map prediction, implementation of scoring function, and generation of conformation space are actively probed by the research community. In addition, attempts to generate protein complex structures are on their way to development.

9 Conclusion

Study on protein interactions is crucial in understanding the working of biological systems. Computational techniques have overtaken experimental methods in popularity but need much refinement for the former to replace the later. Conventional techniques for protein docking in their capacity can work with less volume of data. These methods analyze the physicochemical and geometric properties of proteins to predict probable near-native structures. The addition of more and more protein–protein complex structures to different databases favors templates-based methods in docking. Template-based methods are more efficient than ab initio methods, provided proper templates. Also, experiments for integrating available knowledge in structure prediction are increasing due to its improved performance. The latest trend in problem-solving is the application of deep learning techniques. They demand a vast amount of data for training. A full-fledged DL model for protein–protein docking shall be expected soon as the algorithmic techniques and the size of protein–protein complex data are increasing.