Introduction

The distinctive spherical form of proteins originates from the unusual three-dimensional structure of polypeptides. This structure imparts features of proteins. The building blocks of proteins are amino acids. Amino acid residues are joined to create peptide bonds, which are the fundamental units of polymer chains [1]. By analyzing the DNA sequence, one can learn a protein’s exact order of amino acids. Gene expression follows a predetermined sequence that is encoded in DNA. A gene is a separate DNA sequence. These locations hold the blueprints for certain genetic components, such as chromosomes, RNA molecules, or proteins [2]. Discovering and classifying protein structures and functions is tremendously challenging and complicated in bioinformatics. Conventional laboratory processes can process large amounts of RNA data. Proteins are classified into families and subfamilies meticulously, which helps researchers understand their functions in living creatures [3]. Fixing the problem will be easier if you do this.

Feature extraction approaches in traditional machine learning methodologies are responsible for protein molecule classification. However, how well manually produced features work is heavily dependent on the selection technique. Protein function prediction is achieved utilizing artificial neural networks (ANNs) parts of deep neural networks (DNNs). Discrete neural networks (DNNs) gradually enhance initial inputs as they go through the network’s levels; these networks comprise multiple hidden layers. Conventional machine-learning approaches are laborious, time-consuming, and resource-intensive [4]. The main reason for these traits, described earlier, is the exponential growth of unique protein sequences.

Fig. 1
figure 1

Various protein structures (Primary, Secondary, Tertiary and Quaternary)

Many different computer methods have been created because of the need to classify proteins and guess what biological functions they fulfill. Figure 1 shows the primary, secondary, tertiary, and quaternary groups of protein structures. It also shows several different protein structures. You can better understand the complicated world of proteins and the different groups they belong to with the help of this visual aid [5].

The careful selection of features is essential for classifying proteins in protein research. The outermost layer of the protein, the arrangement of its amino acid chains, and its functionalities all contribute to its unique qualities. Techniques seek to efficiently categorize structural protein molecules to facilitate a deeper understanding of evolutionary changes and their temporal relationships. Advances in various techniques for identifying structural stability groups based on sequence details have been made possible by statistical analysis techniques [6].

DNN techniques have been widely embraced in modern scientific research, particularly in biomedical studies, to leverage recent advances in computational power. These techniques have been shown to outperform conventional bioinformatics methods in terms of effectiveness. Two areas in which they are commonly employed are the analysis of visual data and the application of machine learning to the processing of natural language. Single-task DNNs, which are responsible for making binary predictions, and multi-task DNNs, which can classify input data into multiple pre-defined classes, are the two categories used to classify DNNs [7]. Diverse classifications of DNNs are available to accomplish distinct functions in protein data modelling and analysis. The neural network architectures that have been discussed include convolutional neural networks, feed-forward neural networks, auto-encoder deep neural networks, deep belief networks, recurrent neural networks, restricted Boltzmann machines, and graph convolutional networks. Developing improved methods for accurately categorizing protein sequences in proteomics research is the main goal of this work [8].

This study tackles two key research questions: (1) overcoming difficulties in protein sequence classification approaches and (2) evaluating feature extraction methods with existing CNN, LSTM, BiLSTM, and ProtICNN-BiLSTM models [9]. Both of these questions are important in the field of research. Both of these concerns may need to be addressed in subsequent research. When advancing medical research and biological analysis, raising the bar on the precision of protein sequences is necessary. Within the scope of this investigation, an improved model known as ProtICNN-BiLSTM is presented. Attention-Based Improved Convolutional Neural Networks are paired with Bidirectional Long Short-Term Memory units in this particular instance. The model utilizes the Bayesian Optimisation technique to capture the local and global interactions within protein sequences. This is done to improve the accuracy of the protein sequence categorization process. The remarkable performance of the model in comparative investigations reveals its major influence on medical and biological research [10]. In addition to its promise for enhancing protein sequence categorization, the model also demonstrates its promise for improving classification.

A breakdown of the article’s structure is as follows: the Related Work section is where we take a detailed look at the previous research that has been done. The Materials and Methods section presents a comprehensive description of the methods and materials used for the investigation. In the Experimental Results and Discussion section, a summary of the findings and an evaluation of the significance of those findings are offered. For the time being, the study has been completed, and Conclusion and Future Directions section provides an outline of potential future research areas.

Related work

The newest protein structure analysis and classification deep learning algorithms are reviewed here. It includes hybrid models and attention mechanisms. Paper [1] used several indicators to compare deep-learning protein sequence synthesis algorithms. This research examined many cutting-edge deep-learning approaches for protein sequence generation. Many indicators were used to compare deep learning methods. We explored these strategies for synthesizing unique protein sequences from various sequences. The study’s main contribution is comparing each approach’s merits and cons. A diversified protein sequence library ensured complete research. Some approaches succeeded; however, there were few evaluation criteria and methodologies, so protein sequence design concerns may have been missed.

A deep learning-based method for predicting proteins of snake toxins using word embeddings is proposed in the article [2]. This study accomplishes the prediction of proteins containing snake venom through word embeddings within a deep learning framework. The model learned to link sequence patterns with toxin characteristics from a collection of annotated snake toxin proteins. With impressive performance indicators such as a 99.71% average 6-class classification success rate and a 99.85% binary classification accuracy for SARS-CoV-2 compared to HIV-1, Deep-STP proved to be highly accurate in predicting these proteins. Notwithstanding these encouraging findings, the model’s performance may be affected by the variety and quality of the training dataset as well as particular architectural decisions.

In [3], the authors review and investigate the many methods and instruments available for determining the locations of lysine malonylation sites in protein sequences. Machine learning and deep learning are essential to these methods and resources. We tested machine learning and deep learning algorithms to predict lysine malonylation. Researchers tested how well these algorithms predicted published protein sequence positions. This book’s most essential section evaluates the latest abilities and instruments, highlighting their benefits and cons. Every protein sequence in the collection has its lysine malonylation sites determined. Despite finding some promising solutions, the research could not evaluate their generalizability because of annotated data quality and accessibility issues.

The MaTPIP technique was initially introduced by the authors of the work described earlier [4]. Utilizing a deep learning architecture, it provides eXplainable artificial intelligence to predict sequence-driven, feature-mixed protein-protein interactions. As part of its deep learning architecture, MaTPIP makes use of explainable artificial intelligence approaches. This work aimed to attempt to predict the interactions between proteins.

The algorithm learned protein sequences and interactions. Sequence characteristics and other inputs helped it predict interactions. An explanation-based AI system is MaTPIP’s main contribution. This knowledge helps us predict protein interactions. Positive findings showed good interaction prediction accuracy during the investigation. Protein interaction complexity and annotated data may affect model efficiency. Deep learning makes protein syntheses interactive [5]. They proved that their model successfully predicted health system protein-protein interaction locations. They also acknowledged that machine learning was the primary component that led to this accomplishment. In this article, we offer a model that uses machine learning to forecast the times at which protein-protein interactions occur. Training on a set of protein sequences and their interaction sites allowed the program to predict these sites using sequence information. Including practical contact site prediction is a big improvement. This approach may benefit healthcare systems. The experiment showed exact protein-protein interaction sites. This was found throughout the investigation. However, the study found several factors may alter the model’s efficacy.

These variables include the degree of interaction complexity and the availability of data that has been annotated. Six deadly RNA viruses are being studied [6]. Their feature-engineered protein patterns classify these viruses into several types. The viruses are “Human respiratory virus type 3, influenza A, B, and C, and HIV-1”. Another respiratory virus is HIV-1. Data classification and analysis using linear complexity measures. Because of this, it provided reliable categorization, especially for large data sets. The impressive average success rate of 99.71% that the model achieved in identifying the six classes included in the data set is comparable to a published method and implemented with a high degree of precision. The SARS-CoV-2 binary classification fared better than the HIV-1 classification, with a success rate of 99.85%. This was the case across all of the success rates. Additionally, a convolutional neural network (CNN) and a gated recurrent unit (GRU) can be utilized in conjunction with a long short-term memory (LSTM) to locate proteins that bind to DNA.

This is one of several innovative ways developed in recent years. CNN-BiLG describes this combination. CNN-BiLG collects more data and analyzes protein sequence contextual relationships more thoroughly than previous approaches.

An explanation for this can be found in the enhanced capability of the CNN-BiLG technique to capture detailed information. Compared to deep learning and machine learning predictions, CNN-BiLG demonstrated superior performance, as demonstrated by the results of the studies. According to the validation findings, the reports have an incredible accuracy of 95%.

According to many investigations, the suggested model outperforms previous methods in efficiency, cost-effectiveness, and classification accuracy. The proposed model is cheaper.

NLP-based text categorization methods are widely used to classify protein sequences [8]. Deep learning and word embedding have improved text categorization. These advancements have increased protein categorization accuracy and opened new choices. Word-embedded protein sequence representations encounter many challenges in natural language processing (NLP) because amino acid sequences have different “words” than other sequences.

The longer sequences and smaller letter sizes included in the protein data bring additional difficulties for the learning models. This can be attributed to the presence of lengthier sequences in protein data. It has been established that pre-training is one way that can help boost the effectiveness of machine learning techniques. Even though it was initially proposed [9] for computer vision applications, it is currently being utilized extensively in a wide range of machine learning applications, including those connected to language.

Research shows that pre-trained models offer high generalization and convergence rates for tasks with limited training data. Pre-training methods like BERT and ELMo are important despite processing resource constraints. Data-driven neural networks like GCNs can delay hidden cell interactions. To learn and remember, GCNs act sequentially in biology. Hidden cell connections in data-driven Graph Convolution Networks (GCNs) are time-delayed. GCNs build memories and integrate knowledge by operating in biological sequences one element at a time.

Techniques such as SeqVec and ProtTrans [10] use language models and transformer frameworks to represent protein sequences as embedding vectors. This helps to contribute to the understanding of the biophysical characteristics of proteins. Pre-training can leverage comprehensive labelled datasets and transfer knowledge to smaller data problems due to shared pattern characteristics in protein sequence-based classification tasks. This is in contrast to deep learning models derived from natural language processing contexts, which require significant computing power.

Many different types of biological data [11] can be used to predict the functions of proteins. These data types include sequences, three-dimensional structures, folding information, protein-protein interactions, variations in gene expression, amino acid families, and their integration. The classification of common data has been accomplished through the development of statistical theories through the use of techniques such as decision trees, Support Vector Machines (SVM) [12], and Neural Networks (NN) [13]. A few studies that applied SVM after feature extraction from protein sequences demonstrated the potential of SVM in protein classification. Deep learning methods have shown promise in investigations focusing on relatively small groups of proteins and functional categories. Although these methods have not been extensively explored for large-scale protein function prediction pipelines [14], DNN architectures have been trained to anticipate protein operations through research investigations conducted using various protein characteristics. These investigations included both single-tasking and multi-tasking architectures. Table 1 presents a comparison of various existing research in protein sequence analysis.

Table 1 Comparison of various existing research in protein sequence analysis

Materials and methods

This section initiates by introducing the datasets employed in the model’s development. Subsequently, this article elucidates the conceptual framework and testing methodologies. Finally, a model algorithm utilized in the demonstration is presented.

Proposed hybrid model

ProtICNN-BiLSTM is a proposed hybrid model that combines Bidirectional Long Short-Term Memory (BiLSTM) units with Improved Convolutional Neural Networks (ICNN) and uses Bayesian Optimization [23, 24] to improve the model parameters. Using the strengths of the ICNN and BiLSTM frameworks, the ProtICNN-BiLSTM method efficiently captures both local and global interdependencies in protein sequences. Bayesian optimization for hyperparameter modifications leads to an even greater increase in model performance. The operation of every component is explained in depth in this part, together with the pertinent equations [25]. The suggested hybrid model ProtICNN-BiLSTM is architecturally illustrated in Fig. 2. Following its receipt by the input layer, the protein sequence data is sent through a convolutional layer with 64 filters, a 3 × 3 kernel size, a stride of 1, and “same” padding. ReLU activation comes last after batch normalization. Another convolutional layer with 128 filters, a 3 × 3 kernel size, a stride of 1, and “same” padding processes the output of this layer. Batch normalization follows [26].

Fig. 2
figure 2

Architecture of proposed hybrid model (ProtICNN-BiLSTM)

ICNN model

The ICNN part of the ProtICNN-BiLSTM method is supposed to extract local properties from the protein sequences. Convolutional Neural Networks can efficiently capture spatial hierarchies in data using convolutional processes [27]. ICNN applies attention methods to enhance feature extraction. Many improvements are included in the enhanced CNN architecture of the ProtICNN-BiLSTM model to increase efficiency and capture more intricate features from protein sequences. These improvements have included the remaining connections to address the disappearance of gradients and enhance gradient flow during training. Batch normalization layers are added after each convolutional layer to standardize the input of individual layers, therefore stabilizing and speeding up the training process even further [28].

ReLU activations are included following batch normalization to increase the expressiveness of the model even further and introduce non-linearity. Half-rate dropout layers train by arbitrarily removing certain input units to prevent further overfitting. Consistently handling sequences of varying lengths is easier using adaptive pooling layers, ensuring a constant output size irrespective of input dimensions. These enhancements collectively increase the overall efficacy of the ProtICNN-BiLSTM model by enhancing CNN’s comprehension of flexible and robust features from protein sequences [29].

Fig. 3
figure 3

Architecture of improved CNN model

The Improved CNN Model’s design is shown in Fig. 3. The revised CNN architecture receives protein sequences as numerical array input to do protein sequence analysis. The layered layout of the proposed paradigm is depicted in Fig. 4. 64 The first convolutional layer uses 3 × 3 filters to search for local patterns in the sequences. Next, the ReLU activation function is used to introduce non-linearity. A dropout layer with a rate of 0.5 is employed to reduce overfitting, and batch normalization is used to stabilize the activations [30].

When it comes to the second convolutional layer, there are 128 filters, each of which measures three by three measures. This particular layer is responsible for capturing more abstract and complex qualities. The ReLU activation function is utilized to achieve non-linearity. Next is a dropout layer, followed by batch normalizing. These phases promote long-term, versatile learning. Since adaptive pooling resizes feature maps to a preset size, it allows for many input periods. The 256-unit fully connected layer can combine these characteristics to build sophisticated and abstract representations using the Rectified Linear Unit (ReLU) activation function [31].

In order to compute the probability of the protein classes, the output layer of the multi-class classification process uses the application of the SoftMax activation function. This is done to achieve the desired results. The classification procedure is carried out to guarantee that it is appropriate. In addition, local and global feature extraction procedures have been incorporated into this architectural design [32] to improve the precision and continuity of the protein sequence categorization process. Along with regularization strategies, this design also includes regularization techniques.

Fig. 4
figure 4

Layers and parameters of the proposed hybrid model

Convolution operation process

The convolution process is used to extract local features from the sequence that is being input [33]. The mathematical form of the convolution operation is summarized in Eq. (1), which is presented below. \(\:{Z}_{i,j}^{k}\) represents the output of the convolution operation phase, (i, j) represents positions, k represents filter, \(\:{b}^{k}\:\)is a bias term, M and N filter size.

$$\:{Z}_{i,j}^{k}=\:\sum\:_{m=1}^{M}\sum\:_{n=1}^{N}{x}_{i{+\left(m-1\right).\:\:(j+n-1){W}_{m,n}^{k}}^{\:}+{b}^{k}}$$
(1)
Attention mechanism

An attention technique has been implemented to concentrate on the most important features retrieved by the convolutional layers [34]. According to Eq. (2), the attention weights are calculated as a given. Here \({\sigma }_{i}\:\): attention weight, \(\:{(e}_{i})\): Energy score, L: Local feature count.

$${\sigma}_{i}=\:\frac{{\text{exp}\:(e}_{i})}{{\sum\:_{\text{j}=1}^{\text{L}}\:\text{exp}\:(e}_{j})}$$
(2)
Energy score calculation

An energy score\(\:{e}_{i}\)can be calculated by Eq. (3). Here\(\:{W}_{a}\)and\(\:{b}_{a}\)are learning parameters, \(\:{h}_{i}\) : hidden state [35].

$$\:{e}_{i}={tan}h({W}_{a}*{h}_{i}+{b}_{a})\:\:$$
(3)
Weighted feature representation

The attention weights and the convolutional features are combined to produce the weighted feature representation \(\:{W}_{f}\) as described in Eq. (4). Here\(\:{W}_{f}\): weight feature, \(\:{h}_{i}\) : hidden state, and L: Local feature count.

$$\:{W}_{f}=\:\sum\:_{i=1}^{L}{\sigma}_{i}{h}_{i}\:\:$$
(4)

Bidirectional long short-term memory (BiLSTM)

Bidirectional Long Short-Term Memory, also known as BiLSTM, is a more advanced kind of Recurrent Neural Network (RNN) that was developed to handle sequential data, for instance, sequences of protein, by taking into account dependencies in both the forward and backward directions [36]. The operation of BiLSTM is as follows.

Forward-LSTM (Fw-LSTM)

Predominantly, the Forward LSTM Layer The input sequence is processed from the beginning to the end, with the forward dependencies being captured. Each cell in this layer comprises three gates: input, forget, and output [37]. These gates control the flow of information, enabling the network to remember or forget knowledge from the past depending on the circumstances; the essential formulas are presented from Eq. (5) to (10). Table 2 presents the key symbols used in forward-BLSTM.

Table 2 Key symbols used in Forward BLSTM
$$\:\:{f}_{t}^{\left(f\right)}=\:\:\:{\{W}_{t}^{\left(f\right)}\left[{h}_{t-1}^{\left(f\right)},\:{x}_{t}^{\:}\right]+\:{b}_{f}^{\left(f\right)}\}$$
(5)
$$\:{Input}_{t}^{\left(f\right)}=\:\:\:{\{W}_{input}^{\left(f\right)}\left[{h}_{t-1}^{\left(f\right)},\:{x}_{t}^{\:}\right]+\:{b}_{input}^{\left(f\right)}\}$$
(6)
$$\:{Output}_{t}^{\left(f\right)}=\:\:\:{\{W}_{output}^{\left(f\right)}\left[{h}_{t-1}^{\left(f\right)},\:{x}_{t}^{\:}\right]+\:{b}_{output}^{\left(f\right)}\}$$
(7)
$$\:{\check{C}}_{t}^{\left(f\right)}=\:\:tanh\:{\{W}_{C}^{\left(f\right)}\left[{h}_{t-1}^{\left(f\right)},\:{x}_{t}^{\:}\right]+\:{b}_{C}^{\left(f\right)}\}$$
(8)
$$\:\:{C}_{t}^{\left(f\right)}=\:\:f{\:}_{t}^{\left(f\right)}\left[{C}_{t-1}^{\left(f\right)}+\:{input}_{t-1}^{\left(f\right)}\right]\:\:{\check{C}}_{t}^{\left(f\right)}\}$$
(9)
$$\:{h}_{t}^{\left(f\right)}=\:\:output{\:}_{t}^{\left(f\right)}[\text{t}\text{a}\text{n}\text{h}(\:\:{\check{C}}_{t}^{\left(f\right)})$$
(10)
Backward-LSTM (Bw-LSTM)

An additional LSTM layer simultaneously processes the sequence from the end to the beginning, capturing the backward dependencies. This layer functions in a manner that is analogous to that of the forward layer, employing the same gate mechanisms to control the flow of information [38].

Cumulative output

Concatenating hidden states from forward and backward LSTM layers every time step. With information on each component and consideration for the past and future, this combination delivers a complete sequence picture every time. For the protein sequence “BHDU,” the forward LSTM would be B◊H◊ D◊U. In contrast, the reverse LSTM works as follows: U◊D◊H◊B. Combining the concealed state from both sides ensures that every point in the sequence incorporates information from the previous and next portions [39]. BiLSTM is effective for complex sequential data like protein sequences, where component interactions determine analysis and classification. Its bidirectional approach causes this.

Bayesian optimization method

Bayesian optimization allows one to modify the hyperparameters in complex processes such as deep learning models. Optimized hyperparameter setups are found by constantly updating a probabilistic model and balancing exploration and exploitation. Successfully and carefully modifying hyperparameters, Bayesian optimization improves ProtICNN-BiLSTM protein categorization [40]. Bayesian optimization operates in the model as follows.

  • Defining an Objective Function: The goal function f(x) is our attempt to maximize the performance metric for protein sequence analysis. Equation 11 is another classification metric; precision, recall, or accuracy are additional possibilities.

$$\:f\left(x\right)=Accuracy(ProICNN\_BiLSTM(x\left)\right)$$
(11)
  • Set a Gaussian Process: To get a close approximation to the objective function, a Gaussian Process (GP) is started. A mean function µ(x) and a covariance function Cf(x, x′) are utilized by the GP model to forecast the value and uncertainty of the objective function (Eqs. 12 and 13).

$$\:\varPhi\:\left(x\right)=Mean\_Function$$
(12)
$$\:\text{C}\text{f}(\text{x},\text{x}{\prime\:})\:=Covariance\_Function$$
(13)
  • Define an Acquisition Function: The selection of the subsequent hyperparameter point to examine is accomplished with the assistance of the acquisition function. With its ability to balance exploration and exploitation, the Expected Improvement (EI) function is a popular option (Eq. 14).Here\(\:\beta\:\left(x\right)\): Acquisition function, \(\:{x}^{+}\): Selected best hypothetical parameter at the current point.

$$\:\beta\:\left(x\right)=\:E[\text{m}\text{a}\text{x}(0,f\left(x\right)-f\left({x}^{+}\right)\left)\right]\:\:\:\:$$
(14)
  • Hyperparameter Update: Replace the old GP model with the updated one using the evaluation data. This makes the GP model’s goal function approximation more accurate.

Algorithm proposed hybrid model

The key steps for the proposed hybrid model are described in Algorithm 1 below.

Algorithm 1: Proposed Hybrid model for protein sequence

Input: Protein dataset

Output: Protein Sample categories with different classes.

Step 1: Import and preprocess the data

1. Import data on protein sequences through the Protein Data Bank samples and another pertinent resource.

2. Perform preprocessing on the patterns, considering features such as amino acid structure, physicochemical characteristics, and structural details.

Step 2: Divide the data

1. Partition the sample among training and testing collections with an 80:20 ratio, guaranteeing that the protein groups are evenly distributed in both sets.

Step 3: Architectural Design

1. Create a Hybrid Convolution Neural Network (CNN)-Bidirectional Long Short-Term Memory (BiLSTM) model to capture short, practical- and longer-range relationships in protein patterns.

2. Convolution layers can be used to obtain spatial patterns, whereas Bi-LSTM layers, in combination, should be used to retrieve sequential data.

Step 4: Hyperparameter Tuning

1. Specify a range of hyperparameter values, including learning rates, batch size, convolutional filtering size, LSTM units, and dropout rates.

2. Implement Bayesian Optimisation to systematically and effectively investigate and identify the algorithm’s most optimal hyperparameter settings.

Step 5: Assessment Criterion:

1. Select the F1-score as the primary evaluation metric for precision and recall, essential in imbalanced protein sequence datasets.

Step 6: Training and evaluating the model:

1. Utilise Bayesian optimization to obtain the optimal Hyperparameters and then train the model on the training set.

2. Assess the model’s performance on the test set, considering metrics such as accuracy, precision, recall, F1-score, and any metrics specific to the domain.

Step 7: Analysis and depiction:

1. Observe the model’s performance and examine instances where it incorrectly classified data.

2. Utilise interpretability tools to gain insights into the specific portions of the sequences that have the most significant impact on predictions.

Dataset

This research utilizes the standard protein online dataset PDB-14189 (Protein Data Bank-14189) [31]. The dataset comprises a heterogeneous collection of patterns of protein derived from different organisms, which includes “enzymes, antibodies, structural proteins, transport proteins, receptors, and other functional categories. The description of each protein sequence includes details on its components, operation, and biological characteristics. Such information may involve the secondary structure components, binding of legend sites, protein class categorization, organism site, the procedure utilized for organization commitment, and clarity of the empirical form. The PDB dataset is frequently utilized in bioinformatics and molecular science studies for various objectives, such as protein structural estimation, multifunctional annotation, interaction between protein and protein prediction, chemical effects target recognition, and automated tasks such as classification.

Table 3 Instance count in PDB-14,189

Among the 14,189 cases in the PDB-14189 dataset, there are 7,129 positive instances of DNA-binding proteins and 7,060 negative instances of proteins that do not bind DNA. When researching bioinformatics and machine learning, this dataset is frequently utilized for protein function prediction and structural analysis tasks. Table 3 presents the PDB dataset description.

Data pre-processing

Experiments and protein databases were two reliable sources from which the dataset was initially carefully assembled. Thus, the process began. We next enthusiastically launched a thorough data cleaning process. This included removing duplicate sequences, correcting errors, and meticulously handling missing data. By now, the integrity and dependability of the dataset had to be ensured for the following analysis [41].

The feature extraction was achieved using advanced techniques to convert the protein sequences into numerical representations. One-hot encoding was used to turn every amino acid in the sequence into a binary vector, achieving increased Specificity. We have recovered a large range of amino acid physicochemical properties to enhance the feature set and capture significant features of the proteins. Among these were a specific molecular mass and water repellency. Oversampling techniques were used to alleviate class imbalance. Class imbalance resulted when there were more negative than positive instances, that is, proteins that did not link to DNA than positive examples did. It needed to make fictitious data points for the minority class to achieve a balanced distribution and remove bias from the machine learning algorithms.

The dataset was next separated into training, validation, and test sets. Class distribution was carefully maintained inside each subset to ensure accuracy. The models were trained using the training set throughout the machine learning process. The validation set helped us select the best model and adjust the hyperparameters. Finally, the test set assessed the whole model’s performance. We last used feature scaling techniques like normalization to ensure that all features were equivalent. The best efficiency and ability of the machine learning algorithms to learn from the dataset were guaranteed by this stage. These preprocessing approaches ensured that the protein data was prepared and optimally suited for the training and analysis of machine learning models. Results were, therefore, more precise and trustworthy.

Performance measuring parameters

Key parameters evaluated the suggested and present models’ prediction accuracy. Precision (P), recall (R), F1-score, support, specificity (SPC), sensitivity (SNS), Matthews’ correlation coefficient (MCOC), and accuracy (ACR) are calculated using Eq. 13 to 16 [42]. Here, TP: True positive, FN: False Negative, TN: True Negative and FP: False positive.

Precision

Precision is calculated as the count of true positives divided by the overall quantity of positive cases discovered, as stated in Eq. 15.

$$\:P=\:\frac{TP}{[TP+FP]}$$
(15)

Recall

Divide the number of successfully identified positive observations by the total positive specimens to compute recall as presented by Eq. 16.

$$\:\varvec{R}=\:\frac{\varvec{T}\varvec{P}}{[\varvec{T}\varvec{P}+\varvec{F}\varvec{N}]}$$
(16)

F1-score

In binary classification (and multi-class categorization), the F1-score measures precision and recall (Eq. 17).

$$\:\varvec{F}\varvec{S}=2\times\:\left\{\:\frac{[\varvec{P}\times\:\varvec{R}]}{\left[\varvec{P}+\varvec{R}\right]}\right\}$$
(17)

Specificity

As shown in Eq. 18, SPC is a binary categorization metric that measures a model’s negative case detection accuracy.

$$\:\varvec{F}\varvec{S}=2\times\:\left\{\:\frac{[\varvec{P}\times\:\varvec{R}]}{\left[\varvec{P}+\varvec{R}\right]}\right\}$$
(18)

Accuracy

Eq. 18 calculates accuracy by dividing the number of successfully predicted instances by the total number of occurrences.

$$\:\text{A}\text{C}\text{R}=\:\frac{\left[\right[\text{T}\text{P}+\text{T}\text{N}}{\left[\text{T}\text{P}+\text{F}\text{P}+\text{T}\text{N}+\text{F}\text{N}\right]}$$
(19)

Experimental results and discussion

The Proposed and existing models are implemented using Python programming, and various performance-measuring parameters are calculated. Different performance metrics were computed to evaluate the efficacy of these models. The study leveraged PyTorch, an openly accessible deep-learning library. The proposed hybrid model was developed using Python Keras [43]. Evaluation of absolute parameters involved the use of ‘PDB-14189’. The dataset was divided into an 80% training sample and a 20% testing sample [44,45,46,47].

Hyperparameter specification

Table 4 represents the hyperparameter details. Table 4 defines the Hyperparameters used in experimental analysis [48,49,50]. Table 5 contains the parameters elements of CNN and LSTM in the proposed hybrid model, which affect and result from the shape of every level of a DNN in protein analysis.

Table 4 Hyperparameters used in experiments
Table 5 Parameters details of CNN and BiLSTM in the proposed hybrid model

Results for different parameters

The PDB-14,189 dataset was used as a key dataset in this research. Various performance measuring parameters were calculated to measure the performance of existing CNN, CNN-LSTM and the proposed hybrid model. Figure 5 presents protein sequence patterns for different classes. Figure 5 presents (a) the protein sequence frequency of attributes, 5(b) presents the protein sequence length vs. Sequences, and 5(c) presents the Protein sequence frequency count.

Fig. 5
figure 5

(a) Protein sequence frequency of attributes and (b) Protein Sequence Length Vs. Sequences and (c) Protein sequence frequency count

Experimental results

The protein classes are grouped into four categories: Hydrolase (0), Oxidoreductase (1), Ribosome (2), and Transferase (3).

Table 6 Experimental results for CNN (Base Line Model)

Table 6 describes the experimental results of the CNN model for predicted and actual protein sequence analysis. CNN Model achieved a precision of 82.013% for Hydrolase (0), 90.631% for Oxidoreductase (1), 89.856% for Ribosome (2) and 89.163% for Transferase (3), CNN achieved a recall of 87.236% for Hydrolase (0), 88.104% for Oxidoreductase (1), 89.952% for Ribosome (2) and 87.459% for Transferase (3).F1-score results of CNN model is 86.761% for Hydrolase (0), 84.039% for Oxidoreductase (1), 83.791% for Ribosome (2) and 83.014% for Transferase (3) and Final results for accuracy is 86.286% for Hydrolase (0), 85.603% for Oxidoreductase (1), 86.786% for Ribosome (2) and 87.492% for Transferase (3).

Table 7 Experimental results for CNN-LSTM

Table 7 describes the experimental results of the CNN-LSTM model for predicted and actual protein sequence analysis. CNN Model achieved a precision of 87.949% for Hydrolase (0), 87.298% for Oxidoreductase (1), 91.791% for Ribosome (2) and 88.486% for Transferase (3), CNN-LSTM achieved a recall results of 88.659% for Hydrolase (0), 89.476% for Oxidoreductase (1), 90.856% for Ribosome (2) and 86.870% for Transferase (3). F1-score results of CNN model is 89.042% for Hydrolase (0), 86.872% for Oxidoreductase (1), 92.963% for Ribosome (2) and 87.365% for Transferase (3) and Final results for accuracy is 90.787% for Hydrolase (0), 89.321% for Oxidoreductase (1), 88.709% for Ribosome (2) and 89.326% for Transferase (3).

Table 8 Experimental results for the proposed hybrid model

Table 8 describes the experimental results of the Proposed Hybrid model for predicted and actual protein sequence analysis. Proposed Hybrid model achieved a precision of 95.371% for Hydrolase (0), 96.908% for Oxidoreductase (1), 95.772% for Ribosome (2) and 93.474% for Transferase (3), Proposed model achieved a recall results of 96.375% for Hydrolase (0), 97.603% for Oxidoreductase (1), 94.667% for Ribosome (2) and 95.187% for Transferase (3). F1-score results of Proposed model is 94.874% for Hydrolase (0), 93.271% for Oxidoreductase (1), 95.337% for Ribosome (2) and 96.375% for Transferase (3) and Final results for accuracy is 96.074% for Hydrolase (0), 94.387% for Oxidoreductase (1), 96.009% for Ribosome (2) and 97.341% for Transferase (3).

Table 9 Experimental results comparison of existing vs. proposed models
Fig. 6
figure 6

Comparison of existing and proposed models

Table 9 presents a comparative analysis of experimental results of existing vs. proposed models. Existing CNN achieved Specificity of 85.84% Accuracy of 89.27%, Sensitivity of 89.78%, and MCC 81.47% and Existing CNN-LSTM achieved Specificity of 87.37%, accuracy of 90.17%, Sensitivity of 88.98%, and MCC 88.35%, and Proposed Hybrid Model achieved Specificity of 94.65%, Accuracy of 96.57%, Sensitivity of 95.67% and MCC 96.85%.

Figure 6 compares the existing CNN, CNN-LSTM, and the proposed model ProtICNN-BiLSTM regarding Specificity, Accuracy, Sensitivity, and Matthews’s correlation coefficient. The proposed model exhibits outstanding performance, achieving higher results for all four parameters. This highlights the proposed model’s strength and effectiveness compared to the existing CNN and CNN-LSTM models.

Results and discussion

The fusion of Improved Convolutional Neural Networks and Bidirectional Long Short-Term Memory models, augmented with amino acid embedding techniques, presents a robust strategy for dissecting protein sequences. By harnessing the capabilities of amino acid embedding, the model can effectively exploit the feature extraction prowess inherent in BiLSTM. Subsequent processing by both ICNN and BiLSTM components enables precise prediction of protein attributes, including structural configurations and functional characteristics. However, the efficacy of this approach hinges upon factors such as dataset quality, size, and specific analytical goals.

Experimental assessments were conducted using the PDB-14189 dataset, contrasting the performance of established CNN, CNN-LSTM, and the novel ProtICNN-BiLSTM models. The training spanned 1500 epochs with a batch size 128, optimized via the ADAM optimizer. A meticulous dropout analysis encompassing metrics like Specificity, Sensitivity, Matthews’s correlation coefficient, and overall accuracy was undertaken. Results for Different Parameters section delineates binary and multi-class classification outcomes for extant and proposed methodologies. Visualization of protein sequence patterns across diverse classes, encompassing attribute frequencies, sequence length distributions, and sequence counts, is depicted in Fig. 5. The results in Table 6 outline the CNN model’s performance, while Table 7 elucidates the CNN-LSTM model’s efficacy. Notably, the proposed ProtICNN-BiLSTM model attains remarkable accuracy, recall, F1-score, and support metrics, peaking at 98.11%.

The discernible superiority of the ProtICNN-BiLSTM model over conventional CNN and LSTM variants can be ascribed to several key factors. Firstly, the fusion of ICNN with BiLSTM engenders a holistic approach to capturing local and long-range dependencies within protein sequences. Moreover, incorporating amino acid embedding techniques facilitates a nuanced representation of proteins as numerical vectors, fostering more robust feature extraction. Integrating an attention mechanism further enhances model performance by dynamically weighting the significance of various protein sequence components. Collectively, these advancements underscore the efficacy of the ProtICNN-BiLSTM model in surpassing traditional CNN and LSTM methodologies.

Conclusion and future directions

The Protein-BILSTM model demonstrates the value of collaborative deep learning and molecular biology. The proposed hybrid model surpasses existing approaches with 98.11% accuracy. Hydrolase, Oxidoreductase, Ribosome, and Transferase had similar precision and recall ratings of 87.949–91.791% and 86.870–90.856%. These results prove that the increased biological analysis method works. Though promising, protein sequence categorization needs more research to increase accuracy and application. Protein-BILSTM classifies protein sequences for the first time. CNN and BiLSTM encoding improves accuracy. Its protein structure and activity predictions are accurate enough for sophisticated biological research. This model uses NLP for feature extraction and CNN and BiLSTM for sequential associations. The model’s DNA binding prediction summarizes complex biology. The Protein-BILSTM model demonstrates the value of collaborative deep learning and molecular biology. The proposed hybrid model surpasses existing approaches with 98.11% accuracy.

Protein sequence categorization needs more research to improve accuracy and practicality. Research employing more complex datasets can improve model performance and flexibility. Model prediction benefits from protein-protein interaction and functional annotation data. Deep learning and hybrid models can improve efficiency. Protein classifications and databases improve the model. Experimental biologists can verify the model’s biological applicability by verifying predictions.