1 Introduction

Proteins show an essential role in the biological processes of all living organisms. They are the essential building blocks of all living organisms and play a significant role in cell development. The constancy of proteins depends on the analysis factor of protein structure prediction. Later this information is used to generate protein secondary and tertiary structures. They play a significant role in oxygen conveyance in blood vessels.

Structural bioinformatics is a study of different types of problems associated with different biological systems like protein secondary structure prediction, tertiary structure prediction, etc., and they are the most significant problem statement in structural bioinformatics [1, 2]. Subsequently, it results in the development of the latest scientific and computational methods.

The identification of protein function depends upon how accurate the tertiary structure fold occurs for a given protein. A misfold tertiary structure will affect drug design, energy drink, and vaccination design [3, 4], etc. The creation of the protein tertiary structure depends upon the chemical and physical properties of the amino acid sequence and its polypeptide. All these properties are integrated within the building blocks of proteins. If certain protein performs incorrect folding, it leads to the protein breakdown and incorrect structure prediction, later called misfolding. These misfolded proteins yield different types of diseases like diabetes, Alzheimer’s, etc. [5,6,7].

Protein Structure Prediction (PSP) is a challenging task in the bioinformatics field as the amount of amino acids known is more than the predicted tertiary structure. Although there are several experimental methods like X-ray and Nuclear magnetic resonance (NMR) used in finding Protein structure and also they can produce a more effective result. These techniques are more expensive and time-consuming as they can take months in recognizing the structure, which results in generating an immense gap between the number of amino acid sequences and identified protein 3-D structures. In order to overcome this gap, computational techniques must be used in protein structure finding [8,9,10,11].

In the past several computational approaches were proposed as replacements for the PSP problem [2]. Based on the structural information from the protein data bank, four different types of computational groups can be defined: (a) principle approaches with database knowledge, (b) ab initio or principle approaches without database knowledge, (c) relative modeling approaches, (d) fold recognition approaches. However, these methodologies have limitations. Group (b) can generate the latest protein structures using the latest folds, but even for a small amino acid sequence, the search space is more complex and with a high dimension. Next, Group (c) can easily forecast the protein structure, which is alike to the know amino acid sequence of the given structure. Group (d) protein structure prediction is based on the available fold collection from the protein data bank. Whereas group (a) method performs well when the result is compared with the critical assessment of protein Structure Prediction [12]. Due to the excellent performance of group(a), the author has integrated Group (a) techniques with machine learning (ML) concepts to improve the protein structure method.

The rest of the paper is structured as follows: Section II gives a brief on protein structure representation, energy function, and protein templets. Next section III discusses the related works. Section IV gives an insight into the proposed work. The result and validation of the proposed model are discussed in section V. Finally, section VI briefs the conclusion part.

2 Background

To better understand the principle of bioinformatics, this section illustrates the background of the protein structure, amino acid, energy function, and protein templates.

2.1 Protein’s structure and its amino acid residual

The computed tertiary structure does not only depend on the amino acid sequence but also on various other parameters. Solvent, temperature, and many other biological parameters define the relation among amino acid sequences. These features help in identifying the natural protein structure, and more the features quantity high will be the accuracy. The dihedral angle [13] can be used to define the tertiary structure of the given sequence. This method uses bond length, as the length of the peptide chain is almost constant.

Protein is a collection of subsequently linked amino acid residues with associated peptide bonds. Every amino is made up of a carboxyl set, a link to carbon bonds, and a successively linked side chain with detailed physicochemical properties. When a carboxyl set of residue interacts with an amino set of the peptide bond is generated, later a water particle is released. A torsion angles [14] phi (ϕ) and psi (ψ) of amino acid can be used to define polypeptide and protein backbone. These torsion angles of a protein structure can be represented by the Ramachandran plot [15]. Meanwhile, steric hindrance defines the value of rotational round the ϕ and ψ angles. A side chain of polypeptide must have dihedral angles (Chi angles-χ).

2.2 Energy function

To decrease an energy function, the protein structure prediction method must change the orientation of the protein structure atoms [16]. To assess the quality of the predicted structure different energy functions can be used. Here the author has incorporated the Rosetta energy function [17]. The Rosetta scoring function contains more than eighteen energy terms, some are inter-atomic interaction, hydrogen bond, Newtonian physics, knowledge built potential, etc. According to the critical assessment of protein Structure Prediction (CASP), Rosetta energy-function based methods performs better when compared with other energy functions [18].

2.3 Protein templates

A Protein template from a protein data bank (PDB) can be used to decrease the high dimensionality and also the difficulty of the search space formed during the ab-initio process, later this knowledge can be used to design conformations. In the proposed work author has used Central-Residue-Fragment based technique (CREF) for extraction of protein templet from PDB. The CREF [1920] predicts the protein templet based on the phi (ϕ), psi (ψ), and torsion angles of the amino acid chain acquired from the PDB. These PSP techniques can work on small fragments well. The cluster method is incorporated in the identification of templates, next, each cluster is labeled against the conformational state specified by the sections in the Ramachandran graph. Finally, the clustering result is used in the development of conformation using the mapping function.

3 Related work

Li et al. [21] proposed a bio-inspired algorithm called artificial bee colony (ABC) with pigeon-inspired optimization (PIO). Here the computation of protein 3D structure is performed by integrating two hybrid algorithms namely an ABC and PIO. Thereafter Cauchy perturbation is used to improve the local fitness. This experiment is performed on ten short-length protein sequences. Whereas Backtracking Search Optimization (BSO) and Tabu Search (TS) algorithm [22] perform prediction of protein 3D structure using BSO and TS. They used TS to overcome the generated local optimum. The prediction can be further increased by incorporating a path linking strategy. In [23] authors showed that their model ResNet is able to predict twenty-six protein folds out of thirty-two targeted values whereas when they have not incorporated ResNet they are able to predict only eighteen protein folds. The prediction can be further increased by incorporating a path linking strategy.

In [24] author has used Mod- artificial bee colony (ABC) with PDB structure knowledge for the prediction of protein tertiary structure. The predicted solution is further computed under a crossover operation between two target solutions. Later, RMSD, distance, and energy functions are used in the analysis of all eight protein sequences. In [25] the author has used a memetic algorithm with structure knowledge for the prediction of protein 3D structure from amino acid sequences. Further, during computation search space is reduced by including an angle probability list. The accuracy can be further improved by integrating ML techniques with a PR strategy.

Yousef et al. used a hybrid model [26] for tertiary structure prediction, they combined the genetic algorithm with energy function. However, during the crossover process, the computed structure can be different due to steric hindrance, this can be validated by energy minimization. Results show that the hybrid models botched to computing the sidechain torsional angles. The prediction can be further improved by including PyRosetta with structure knowledge.

RMSD plays an essential role in protein structure prediction. The author [27] proposed a method to decrease RMSD error by incorporating a feed-forward neural network and an adaptive neuro-fuzzy approach. Subsequently, the computed result is used in structure prediction. The prediction result can be further increased by altering the number of linguistic variables.

In this work [28], the authors combined a neural network and particle swarm optimization algorithms for protein 3D structure prediction from amino acid sequences. They extracted protein sequences using three different hydrophobic, composition, and Frequency amino acids. The result shows that system is able to classify only alpha and beta trends, whereas it failed to classify alpha + beta and alpha/beta values. To overcome this, drawback authors suggested building a tree classifier at the initial stage.

4 Materials and methods

This section describes the proposed MACO-PR Algorithm implementation. The steps followed in the proposed approach to predict protein tertiary structure is as follows:

Step 1: Extraction of amino acid sequences from PDB.

The above given steps in shown in Fig. 1. Here amino acid sequences are extracted from PDB [29]. Next, the extracted amino acid and template are supplied as input to the proposed MACO-PR model. Finally, the protein tertiary structure is validated using RMSD.

Fig. 1
figure 1

Proposed MACO-PR Architecture for Protein Tertiary Structure Prediction

In this work, author developed an enhanced MACO for the protein tertiary structure prediction. Here MACO [30] is integrated with PR [31] strategy and SK. As MACO can generate local minima of torsion angle author has integrated PR with MACO named a MACO-PR. To validate the proposed model MACO-PR author tested them on six different protein sequences and with basic ACO, Support Vector Machine (SVM) [32] and Neural Network (NN) [33]. The ACO mainly goes through three stages: Construction Stage, Local Search Stage, and Pheromone Update Stage.

  • Construction stage: Each labor ant starts at a random amino sequence position, and subsequently building a conformation of the given sequence by deploying a candidate solution.

  • Local search stage: During this stage further Optimization of conformations folds occurs.

  • Pheromone update Stage: based on the conformation energy and search stage each labor ant will start updating the pheromone matrix.

A generic ACO algorithm [30] working steps is illustrated below:

  1. Step 1

    Initialize pheromone trails value to Nil.

  2. Step 2

    Generate candidate confirmation.

  3. Step 3

    Initialize labor ant search.

  4. Step 4

    Update pheromone values.

  5. Step 5

    Compute the best path.

During the path construction stage of the proposed model MACO-PR, each labor ant starts at a random position within certain amino acid sequences. Later each residual will get added when the sequence fold occurs in both directions either left or right. Subsequently for each labor ant path generation occur, resulting in protein conformation. Based on the pheromone values and direction, sequence position updating occurs for given position ‘i’. Here the total count of ants will be less than or equal to the number of nodes. Each labor ant will start from a random position [34] in search of the final node(ni) with the resource (Rj). The computation will be carried out as shown in Eq. (1).

$$PF_{{ij}} = {\raise0.5ex\hbox{$\scriptstyle {\left[ {\left( {T_{{ij}} } \right)\alpha \left( {H_{{ij}} } \right)\beta } \right]}$} \kern-0.1em/\kern-0.15em \lower0.25ex\hbox{$\scriptstyle {\left[ {\sum {\left( {T_{{ij}} } \right)\alpha \left( {H_{{ij}} } \right)\beta } } \right]}$}}$$
(1)

Here, ‘Tij’ denotes the pheromone values corresponding to node (ni) and resource (Rj). ‘H’ is used to define the given heuristic value. Subsequently, ‘α’ value and ‘β’ denotes the identified pheromone and heuristic values.

PR [35] is a method used in the identification of trajectories linked leading to the best solution generated by heuristic methods. It can be considered as an enhanced version of scatter search. PR takes two or more efficient solutions generated by the original search and creates a path between them and also uses beyond the certain solutions in the given space.

figure a

Algorithm 1 illustrates the proposed MACO-PR prediction procedure. Initially, amino acid sequence and energy function are supplied as input to the proposed model. Subsequently, pheromone path and optimal path value have been assigned to Null as shown in line number 4 and 5. Line 6 defines the sets X and Y which contain the optimal paths. Next line numbers 7–11 compute the best path using Eq. (1), for sets X and Y. Line numbers 12–15 deal with the computation and comparison of the best path, meanwhile the best path is stored in ‘opt_path’ variable. Next, apply path-relinking method to compare the best path as shown in line number 16. The path-relinking process is applied seeing that the latest result OPi andOPi+1 from Y is randomly selected. It starts with OPi value i = 1.n and at each stage OPi values is substituted by OPi+1. Later all solutions should be visited through this process. Finally, line number 18 returns the best path. Here path-relinked is used to reduce local minima ensues during torsion angle adjustment as shown in line number 16.

5 Results

The proposed work was used to predict the protein tertiary structure. Here six different protein sequences have been used in the identification of protein structure. The six used protein sequences are 2P5K, 1AIL, 1AB1, 1L2Y, 2MTW, and 1WQC. The author has enhanced the ACO algorithm by incorporating path-relinking strategies for the effective retrieval of protein tertiary structure. Later lowest potential energy was used in structural analysis. Furthermore, Root Mean Square Deviation (RMSD) values were computed using PyRosetta [36], as shown in Eq. (2).

$$RMSD\left( {x,y} \right) = \sqrt {\sum\nolimits_{{i = 1}}^{n} {\frac{{\left\| {rxi - ryi} \right\|^{2} }}{n}} }$$
(2)

here rxi and ryi are the values representing the location of atom vectors representing the positions of the given atom. Tables 1 and 2 show the computed average energy and RMSD values for the proposed MACO-PR algorithm and normal ACO algorithm, here x-axis represents the PDB ID while y-axis represents computed average energy (negative values) or RMSD values (positive values) as displayed in Figs. 2 and 3.

Table 1 Average energy (Kcal / mol− 1) values computed through proposed MACO-PR algorithm and ACO algorithm
Fig. 2
figure 2

Average energy (Kcal / mol− 1) values computed through proposed MACO-PR algorithm and ACO algorithm

For each six-protein sequence, minimum energy is computed using the proposed MACO-PR algorithm, and subsequently generated result is analyzed using ACO and RMSD methods. 2MTW protein ID generates the lowest energy − 17.45 kcal/mol− 1 using MACO-PR whereas − 16.83 kcal/mol− 1 by ACO method. Next 2P5K & 1AIL PDB ID’s produces − 14.5 kcal/mol− 1 and − 13.0 kcal/mol− 1 energy when computed using MACO-PR while − −12.01 kcal/mol− 1 and −12.55 kcal/mol− 1 using ACO. Similarly, 1AB1 generates the average energy of −16.5 kcal/mol− 1 and − 14.22 kcal/mol− 1 by incorporating MACO-PR and ACO methods. The result shows that out of six protein ID four protein ID generate the lowest energy when computed using the proposed model MACO-PR whereas the remaining two protein ID produces the lowest energy using ACO as shown in Fig. 2.

Table 2 RMSD (Å) values computed through proposed MACO-PR algorithm, ACO, SVM[32] and NN [33]

Figure 3, presents the calculated RMSD values using MACO-PR, ACO, SVM[32] and NN[33]. Here 2MTW PDB ID produces the lowest RMSD value of 4.16 using proposed MACO-PR while 5.61 using ACO and 6.5 using SVM. Similarly, 1L2Y generates a 5.16 RMSD values when computed using MACO-PR whereas 6.51 and 6.1 use ACO and SVM. Further 2P5K and 1AIL generate RMSD values of 6.81 and 6.85 when computed using MACO-PR whereas 8.55 and 6.85 using ACO and 8.2 and 7 using SVM. Finally, 1WQC produces an RMSD value of 5.84 when computed using MACO-PR and 8.00 using ACO. The result exemplifies that out of six different PDB ID’s 2P5K, 1AIL, 1L2Y, 2MTW and 1WQC perform well when computed using the proposed model MACO-PR while ACO generate good result only for 1AB1 ID, similar case with SVM.

Fig. 3
figure 3

RMSD (Å) values computed through proposed MACO-PR, ACO, SVM and NN algorithms

6 Conclusion and future scope

The study of protein and its structure is one of the key research areas in Computational Bioinformatics [1, 3738]. In this work, the author presents two different versions of the ACO method for protein tertiary structure prediction. In the first version, basic ACO, SVM and NN is used, whereas in the second version author has integrated MACO with PR strategy and SK. In the proposed work, Amino acid sequences are extracted from PDB. Next, the extracted amino acid and template are supplied as input to the proposed MACO-PR model. Here PR is used in overcoming the local minima of torsion angle generated during structure computation; however, SK is incorporated to handle the search space complexity.

The average energy measured using proposed model MACO-PR for 2P5K is −14.5 Kcal/Mol− 1, 1AB1 is -16.5 Kcal/Mol− 1, 1A1L is 13.0 Kcal/Mol− 1, and 2MTW is −17.45 Kcal/Mol− 1 PDB ID while − 12.01 Kcal/Mol− 1, 14.22 Kcal/Mol− 1, 12.55 Kcal/Mol− 1, −16.83 Kcal/Mol− 1 using basic ACO. Similarly, RMSD values computed using the proposed model MACO-PR for PDB ID’s 2P5K is 6.81 Å, 1A1L 6.85Å , 1L2Y is 5.16 Å, 2MTW 4.16 Å and 1WQC is 5.84 Å, whereas using ACO 2P5K is 8.55 Å, 1A1L 6.85 Å, 1L2Y is 6.51 Å, 2MTW 5.61 Å and 1WQC is 8.00 Å.

The above average energy value indicates that the proposed model MACO-PR performs well for 2P5K, 1AB1, 1A1L, & 2MTW PDB ID’s but failed to do well for 1AB1 PDB ID. Also, MACO-PR performs well for RMSD values for 2P5K, 1A1L, 1L2Y, 2MTW & 1WQC PDB ID’s whereas failed to predict 1AB1 PDB ID. Overall all results illustrate that MACO-PR outperforms when compared with basic ACO, SVM and NN concerning average energy and RMSD measures. As a part of future work, the model can incorporate deep learning techniques for more timely and accurate prediction.