1 Introduction

Earlier identification of viruses and diseases is the biggest concern for everyone after the covid-19 outbreak as their earlier identification can save more lives and can help in the earlier designing of drugs (Herath et al. 2022; Shadab et al. 2020; Herath et al. 2021a, b). However, due to the shortage of patterns available for various diseases, it becomes difficult to identify them. The classification of DNA sequence can be considered one of the solutions and this is the reason why the classification of DNA sequence plays an important role in computational biology.

NCBI has the largest collection of genome sequences, that's why it is named Genbank. It consists of millions of distinct DNA sequences with billions of nucleoid bases (Benson et al. 2010). DNA sequences are the blueprints of any organism. DNA stands for deoxyribonucleic acid which is made up of four nucleotides namely A, C, G, and T i.e. Adenine, cytosine, guanine, and thymine. The DNA sequence of every disease and virus is unique which helps to extract unique patterns (Momenzadeh et al. 2020). The nucleotides in any form are always bound in double-stranded with its harmonious pair (A-T and C-G pair) as shown in Fig. 1.

Fig. 1
figure 1

DNA base pair structure with the sugar-phosphate backbone

In this paper, a framework has been proposed in which whenever a patient (infected from any virus or any other disease) visits a doctor his/her DNA samples are collected and matched with the database to identify the occurrence of the disease. However, due to the shortage of patterns available for various diseases, it becomes difficult to identify them. This paper has presented a solution in which samples collected from NCBI are used for the classification of DNA sequences. DNA sequence classification will in turn gives the pattern of various diseases; these patterns are then compared with the samples of a newly infected person and can help in the earlier identification of disease. As we know the size of FASTA-based DNA sequences is too big and complex which means this data cannot be given directly for feature extraction. For this data needed to be converted to some equivalent numerical form, in this paper a new hot vector-based numerical representation is introduced, where the position of each nucleotide is reserved by using binary 0 or 1. This hot vector matrix is then given as an input to traditional CNN for feature extraction. The models are then trained and evaluated on test data. Finally, the model can be applied to unseen data (Unknown DNA sequence) for disease prediction.

Various ML (machine learning) techniques are there that can be used for classification purposes. For analysis of our work, we have taken 37,272 COVID samples, 1418 Middle East Respiratory Syndrome (MERS), 6503 Severe Acute Respiratory Syndrome (SARS), 1886 DENGUE, 8226 HEPATITIS, and 10,848 INFLUENZA samples. These samples are firstly used for training and then testing our model on trained datasets. To train machine models for various kind of disease, feature extraction of some well-known sequences are done. In this work, a new neural network approach has been proposed that takes a hot vector matrix as input to convolution layers for extracting features from given input data. The position of each nucleotide of the DNA sequence is represented using a hot vector and the previous layer’s extracted features are used by the convolution layer’s neurons for extracting high-level abstraction features (Fig. 2).

Fig. 2
figure 2

Number of samples in a dataset

2 Background

In this paper Sect. 2.1 highlights the DNA sequence in FASTA file format. Section 2.2 gives a detailed description of what is classification, types of classifiers, and their basic introduction.

2.1 DNA sequence in FASTA file format

In this paper, we have also studied DNA sequences in FASTA file format, which is a text-based format used to denote sequences in bioinformatics (Hach et al. 2014; Mohammed et al. 2012). It consists of base pairs using the single-letter code as shown in Table 1.

Table 1 DNA base pairs

The layout of organisms depends upon the order in which they are stored. The format allows sequences of preceding sequence names and comments. FASTA file format sequence starts with a one-line description of identifier along with DNA data sequence.

 > AR147821.1 Sequence 1 from patent US 6225051

GGCATCTGAGACCAGTGAGAA

The details about the identifier are separated from the sequence of data by the ' > ' sign. The identifier of the sequence is the word after the" > " symbol and the remaining part represents the description which is an optional part that separates the identifier from a white space or tab. The data sequence will start from the next line after the text line and the second line begins with another " > " sign indicating the ending of the sequence and also the beginning of another sequence.

2.2 Pattern recognition

For the identification of various kinds of disease, pattern recognition is considered an important factor. It consists of three primary elements which are data perception, extraction of features, and data classification (Fig. 3) (Sathish kumar et al. 2005; Jain et al. 2004).

Fig. 3
figure 3

Pattern recognition model

2.2.1 Preprocessing

To increase the efficiency of the work, the dataset should be pre-processed instead of giving direct input of the raw dataset to selected classifiers; the raw dataset is preprocessed in different ways to overcome different issues like training overhead, and classifier confusion, false alarms, and detection rate ratios. Separating feature spaces from one another is very necessary and arranged in vector. CNN can be used for various textual data problems such as for classification and categorization purposes, but the only difference is that digital pictures are 2-dimensional matrixes whereas textual data is a 1-dimensional form consisting of letters. That’s why we need to convert it to its equivalent numerical value with the goal that it can be now given as input to CNN.

2.2.2 Features selection

Feature selection is an important factor in any kind of classification either supervised or unsupervised. Since the large numbers of features can be monitored taking into account the large variety of possible values especially for a continuous feature even for a small network.

2.2.3 Final cluster data

As per the final updated population of the cluster center session, the proposed work can use these sessions only for training the neural network. Here, this will increase the learning capacity of the work. As more patterns from the same class increase the confusion of the system. This can be said as small input learning cluster training data improving the detection rate of the proposed work.

2.2.4 Classification

The proposed model uses the latest deep learning neural network for the training as a finding of the subclass is possible by using these networks. Based on this neurons of the network will adjust their weight. The feature vector is grouped during the feature collection steps of the different type of class which is matched, in the network. Finally, TN (Trained neural network) is obtained. For the classification of text data, CNN is proved to give the best results, so in this work, we are using CNN for the classification of textual DNA data.

3 Classification techniques

The most common way of perceiving, comprehending, and gathering thoughts and items into preset categories or "sub-populaces" is known as classification. ML uses several algorithms and some pre-categorized datasets i.e. training datasets to classify the new dataset. These training datasets are then used to predict which sequence will come into which pre-categorized set. One simplest example of classification is to filter mails into spam or nonspam categories or we can say that “It is a form of pattern recognition where classification algorithm is used along with training dataset to check the patterns such as words with similar sentiments etc. for future dataset”.

3.1 Machine learning-based systems

In a machine, learning-based text classification system past experiences/observations are used for making the system learn anything (Ikonomakis et al. 2005). The first step for training an ML-based system is feature extraction, where the textual data is converted into its equivalent number form and represented in the form of a vector. One such method is to represent a word with its frequency of appearing, this approach is known as a “bag of words”. Suppose we have a dictionary of words- {I, have, has, dog, cat, a} and we want to represent the text “I have a dog” in vector form then the vector representation of the text will be {1, 1, 0, 1, 0, 1}.To generate a classification model, a training dataset is given as an input to the ML algorithm that consists of feature set pairs as well as tags (Fig. 4).

Fig. 4
figure 4

a Training. b Prediction

Once the ML model gets trained with several training data samples it becomes ready to make exact predictions. This feature extractor can now be used in the transformation of unseen text into some feature set and these feature sets are then given as an input to the classification model to get accurate predictions on tags.

Classifying text using ML is very much accurate and precise as compared to when performed by any man-made system, it makes learning new examples very much easier.

Text classification algorithms based on ML: two text classifiers are listed below:

3.2 SVM (support vector machines)

SVM stands for support vector machine, which is well known for its more precise and even fastest results than any other ML algorithm without much training (Shanahan et al. 2003). However, it requires more resources for performing its computational work. We can say it sketches a hyperplane that separates the subspace into two parts, the first part consists of vectors that belong to some group, and the second part consists of remaining vectors which do not belong to any group. Hyperlane can be defined as the longest distance between each tag. Figure 5a shows the example of a hyperplane in 2 dimensions.

Fig. 5
figure 5

a Hyperlane in 2D. b Classification of vector and tags. c Best hyperplane

Vectors represent the training text and the group represents tags used for tagging texts. As the complexity of data increases, it becomes difficult to categorize/classify vectors and tags into 2 subgroups only.

However, even when the data becomes more complicated, SVM gives more accurate and precise results, which is the best thing about any classification technique. Figure 5c shows the best hyperplane in 3 dimensions (in circle form).

3.3 Deep learning

Deep learning consists of an algorithm that was inspired by the phenomena of how a human brain works, which is well known as neural networks. For textual classification, deep learning provides the best and most accurate results with the lowest computations. Two well-known text classification deep learning techniques are:

  • Convolutional neural networks (CNN)

  • Recurrent neural networks (RNN).

3.3.1 RNN (recurrent neural network)

In this variety of neural networks, the output of the previous layer is feedback as an input to the current layer. It can help in predicting the output of the layer. Here every neuron has some memory that they kept before going to the next step. One of the most popular applications of RNN is in text-to-speech conversion (Ikonomakis et al. 2005).

3.3.2 CNN (convolution neural network)

A deep learning model with a vital thought of utilizing convolutional layers to separate features from input information is a convolution neural network. It was enlivened by the mechanism of living creatures, where the previous layer’s extracted features are used by the convolution layer’s neurons for extracting high-level abstraction features (Kassim et al. 2017). It consists of several artificial neurons in multiple layers. These neurons are used for computing the sum (weighted) of its input and giving activation value as an output.

Weight defines a neuron’s behavior and by applying it with pixels CNN’s neurons can extract the features. When an input image is given to CNN then an activation map is generated by each layer, which will highlight the image features. Every neuron now picks the pixels of an image and multiplies the colored value with the weight, adds them, and finally, the activation function runs them. Lines such as vertical, horizontal, and diagonal or we can say simple shapes are recognized through the first layer. Complicated shapes for example curves and line endpoints are recognized by second layer neurons. The ear, nose, and mouth are recognized by the third layer. Finally, the complete human face is recognized by the last layers of neurons. One such example is shown in Fig. 6b.

Fig. 6
figure 6

a Structure of an artificial neuron. b Feature extraction by CNN

3.3.3 K-nearest neighbors

KNN is another popular algorithm used for recognizing patterns with the help of a training dataset. It always finds the kth close relative for this purpose. While using KNN for classification we always have to find a place for data to put it in its nearby neighbor's category (Lim 2004). For example, if the value of k is 1 then we have to place the data in a class that is near 1.

3.3.4 Decision tree

Another method used for text classification is a decision tree that can order the class on some level. The functionality of the decision tree is more similar to that of the flow chart; it separates the dataset into two categories of similar type i.e. from tree trunk to branches to leaves (D. E. Johnson. et al. 2002). It allows nested categorization without any supervision by a human. Figure 7 shows the example of a decision tree when applied to a sports dataset.

Fig. 7
figure 7

Decision Tree

4 Literature review

In this section a survey of related work in the field of DNA Sequence classification has been carried out, Table 2 shows the research work done by several authors and answers questions such as why there is a need for DNA sequence classification and what are the methods that can be used for classification purpose.

Table 2 The research was done in the field of DNA sequence classification

CNN has already proved its excellent performance in various fields and is now for DNA sequence classification. In a paper (Bosco. et al. 2016), where an author has compared CNN with other machine learning methods and the results proved that it also gives the best results when applied to nucleotides. Therefore, in this work for feature extraction and classification CNN is used as a basic model.

5 Model

CNN can be used for various textual data problems such as for classification and categorization purposes, but the only difference is that digital pictures are 2-dimensional matrixes whereas textual data is a 1-dimensional form consisting of letters. That’s why we need to convert it to its equivalent numerical value with the goal that it can be now given as input to CNN. A lookup table is maintained to match each letter with its numerical value known as a word vector. However using the lookup table have some limitation, so one new approach of using hot vectors to represent words has been proposed. Figure 8 shows an example, where dictionary D has some words with their hot vector representation.

Fig. 8
figure 8

Hot vector representation of words

Now suppose we have 5 words in a dictionary I; car; bike; bought; a; with their respective hot vectors in the 2 D matrix. To represent the sentence “I bought a car” in a 2 D numerical matrix we take a region of size two and write a hot vector for each pair of words. After converting all words into hot vectors we got a 2-dimensional matrix that can be directly given as input to traditional CNN for further classification. This model is well known as the CNN model for sequences or sequential convolutional neural networks.

5.1 Proposed method

As we know FASTA based DNA sequences are in textual format, but they are the sequences of continuous letters with no space, which means there is no concept of word in it. That’s why we need to convert the sequences into words so it can work the same as CNN on text. The conversion is done in such a way that each nucleotide holds its position in the DNA sequence. This concept is better understood with the help of the below figure. Here we are using a fixed size window of 4 which slides with fixed strides 1. Every time a window reads some continuous sequence of DNA and considered it as a word. Finally, we have derived sequences of words from DNA data. Now textual-based CNN technique can be applied to this data.

As we know that in the FASTA DNA file format there are only 4 letters i.e. A, C, G, and T (Garima et al. 2020, 2021a, b). If the chosen window size is 4 then we have …256 distinct words in a dictionary. That means each word can be represented by using a 256-size hot vector. Finally, from this generated words of DNA sequence, a 2 D matrix has been generated where the position of every nucleotide is maintained, the resultant matrix is now given as an input to CNN for classification purposes.

figure a

In the above example every time 4 consecutive words are picked with the slide of stride 1, such as GGCA, GCAT, CATC, ATCT …and added to the destination sequence. Now we have a dictionary of 256 words (44) and each word can be represented using its corresponding hot vector to make a 2 D matrix of numerical values (Fig. 9).

Fig. 9
figure 9

Dictionary of words

Now using this dictionary hot vector is generated for given DNA sequences (Fig. 10).

Fig. 10
figure 10

Hot vector representation of FASTA DNA sequence

5.2 Proposed framework

Earlier identification of deadly viruses like corona can prevent the outbreaks that we are facing currently as well as can help in the earlier designing of drugs. This is the reason why the classification of DNA sequences plays an important role in computational biology. In this paper, a framework has been proposed in which whenever a patient (infected from any virus or any other disease) visits a doctor his/her DNA samples are collected and matched with the database to identify the occurrence of the disease. However, due to the shortage of patterns available for various diseases, it becomes difficult to identify them. This paper has presented a solution in which samples collected from NCBI are used for the classification of DNA sequences. DNA sequence classification will in turn gives the pattern of various diseases; these patterns are then compared with the samples of a newly infected person and can help in the earlier identification of disease. As we know the size of FASTA-based DNA sequences is too big and complex which means this data cannot be given directly for feature extraction. For this data needed to be converted to some equivalent numerical form, in this paper a new hot vector-based numerical representation is introduced, where the position of each nucleotide is reserved by using binary 0 or 1. This hot vector matrix is then given as an input to traditional CNN for feature extraction. The models are then trained and evaluated on test data. Finally, the model can be applied to unseen data (Unknown DNA sequence) for disease prediction (Fig. 11).

Fig. 11
figure 11

Proposed framework

6 Result analysis and discussion

For evaluation purposes, we have used a python based tool that can perform feature extraction as well as classification of DNA sequences. The overall architecture used for result analysis is shown in Fig. 12. In this work, we have compared 5 well-known classifiers namely Convolution neural network (CNN), Support Vector Machines (SVM), K-Nearest Neighbor (KNN) algorithm, Decision Trees, and Recurrent Neural Networks (RNN) on several parameters. The Kmer feature descriptor with a Kmer size of 3, the K-means method for clustering of the dataset where the number of clusters is 3, for feature normalization Zscore method has been used, 100 features using the Chi-square selection method have been selected. The calculation and comparison of 7 different parameters, including sensitivity (Sn), specificity (Sp), accuracy (Acc), Matthews correlation coefficient (MCC), Recall, Precision, F1-score, the area under the receiver operating characteristic (ROC) curve (AUROC) and the area under the Precision-Recall (PRC) curve (AUPRC) (Liu 2017; Song et al. 2018; Liu et al. 2016, 2018; Li et al. 2018), has been done.

Fig. 12
figure 12

The overall architecture used in result visualization

They are defined as

$$ {\text{Sensitivity }} = 1 - \frac{{\text{false negatives }}}{{\text{ true positives}}} $$
(1)
$$ {\text{Specificity}} = 1 - \frac{{\text{false positives}}}{{\text{true negatives }}} $$
(2)
$$ {\text{Accuracy}} = 1 - \frac{{{\text{false negatives}} + {\text{false positives}}}}{{{\text{true positives}} + {\text{true negatives}}}} $$
(3)
$$ {\text{Matthews correlation coefficient }} = \frac{{{\text{Accuracy}}}}{{\sqrt {\left[ {1 + \frac{{{\text{false positives}} {-} {\text{false negative}}}}{{\text{true Positives}}}} \right]\left[ {1 + \frac{{{\text{false negatives}} {-} {\text{false positive}}}}{{\text{ true Negatives}}}} \right]} }} $$
(4)
$$ {\text{Precision}} = 1 - \frac{{\text{false positives }}}{{ {\text{true positives}} + {\text{false positives}}}} $$
(5)
$$ {\text{F1 - score}} = 1 - \frac{{{\text{false positives}} + {\text{false negatives}}}}{{2 * {\text{true positives}} + {\text{false positives}} + {\text{false negatives}}}} $$
(6)

Here, true positive, true negative, false positive, and false negative terms are defined as follows. The correct positive (+ ve) outcome predictions done by the model are termed “true + ve”. The correct negative (-ve) outcome predictions done by the model is termed “true -ve”. The incorrect positive (+ ve) outcome predictions done by the model are termed “false + ve”. The incorrect negative (-ve)) outcome predictions done by the model are termed as “false -ve”.

In the case of multiclass classification accuracy is defined as:

$$ {\text{Accuracy}} = 1 {-}\frac{{\left( {{\text{false negatives}}\left( {\text{i}} \right) + {\text{false positives}}\left( {\text{i}} \right)} \right)}}{{ \left( {{\text{true positives}}\left( {\text{i}} \right) + {\text{true negatives}}\left( {\text{i}} \right)} \right)}} $$
(7)

Here, false negatives(i), false positives(i),true positives(i), true negatives(i) represents the samples in ith class. The Naive Bayes algorithm is a popularly used multiclass algorithm. The AUROC value is calculated based on the ROC curve, and takes values between 0 and 1, while the AUPRC value is calculated based on the precision-recall curve. We have compared our model with other test data classification approaches based on these parameters. For estimation and comparison of the proposed model with other machine learning models k- fold cross-validation is used, where k = 4.

6.1 Comparison of evaluation metrics

From the above-calculated result, CNN gives an accuracy of 73.5, Decision tree with an average accuracy of 62.5, MLP with 78.0, RNN with 69.0, SVM with 50.0, and proposed method with 93.9, so we can say that the proposed method gives the highest accuracy as compared to other classification methods (Tables 3, 4, 5, 6, 7, 8).

Table 3 The evaluation result of CNN
Table 4 The evaluation result of the Decision Tree
Table 5 The evaluation result of MLP
Table 6 The evaluation result of the RNN
Table 7 The evaluation result of the SVM
Table 8 The evaluation result of the Proposed Method

6.2 Clustering

As per the final updated population of the cluster center session, the proposed work can use these sessions only for training the neural network. Here, this will increase the learning capacity of the work. As more patterns from the same class increase the confusion of the system. This can be said as small input learning cluster training data improving the detection rate of the proposed work. Clustering results of all given classification algorithm along with the proposed algorithm is shown below (Fig. 13).

Fig. 13
figure 13

a CNN clustering. b Decision tree clustering. c RNN clustering. d SVM Clustering. e MLP clustering. f Proposed method

Table 9 shows that the proposed method gives a significant improvement over the previous best results in terms of both precision and accuracy. The improvement in precision is nearly 1.112% and the improvement in terms of accuracy is about 15.9%. The proposed method's improvement is very high as compared to other methods because we have used an improved method for sequence representation (hot-vector representation) and also we have chosen the best feature selection method. The results have proved that features extracted by the proposed method are very useful for classifying the sequence into a true category (Fig. 14).

Table 9 Comparison with the previous best results
Fig. 14
figure 14

Average accuracy comparison of classification techniques

7 Conclusion

After the covid-19 outbreak, earlier identification of disease is the willingness of every person as their earlier identification can save more lives. Various ML (machine learning) techniques are there that can be used for classification purposes. This paper has presented a solution in which samples collected from NCBI are used for the classification of DNA sequences. DNA sequence classification will in turn gives the pattern of various diseases; these patterns are then compared with the samples of a newly infected person and can help in the earlier identification of disease. However, the size of these DNA sequences is too big and complex which means this data cannot be given directly for feature extraction and needed to be converted to some equivalent numerical form. In this paper a new hot vector-based numerical representation is introduced, where the position of each nucleotide is reserved by using binary 0 or 1. This hot vector matrix is then given as an input to traditional CNN for feature extraction. A comparison of about 7 classifiers like Convolution neural network (CNN), Support Vector Machines (SVM), K-Nearest Neighbor (KNN) algorithm, Decision Trees, Artificial Neural Networks (ANN), and proposed method on 7 different parameters, including sensitivity, specificity, accuracy, Matthews correlation coefficient, Recall, Precision, F1-score, the area under the receiver operating characteristic (ROC) curve (AUROC), and the area under the Precision-Recall (PRC) curve (AUPRC), has been done and the result shows that proposed method gives highest accuracy of 93.9%, which is highest compared to other classifiers.