A comprehensive tool for rapid and accurate prediction of disease using DNA sequence classifier

Mathur, Garima; Pandey, Anjana; Goyal, Sachin

doi:10.1007/s12652-022-04099-y

A comprehensive tool for rapid and accurate prediction of disease using DNA sequence classifier

Original Research
Published: 25 June 2022

Volume 14, pages 13869–13885, (2023)
Cite this article

Download PDF

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

A comprehensive tool for rapid and accurate prediction of disease using DNA sequence classifier

Download PDF

1671 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

In the current pandemic situation where the coronavirus is spreading very fast that can jump from one human to another. Along with this, there are millions of viruses for example Ebola, SARS, etc. that can spread as fast as the coronavirus due to the mobilization and globalization of the population and are equally deadly. Earlier identification of these viruses can prevent the outbreaks that we are facing currently as well as can help in the earlier designing of drugs. Identification of disease at a prior stage can be achieved through DNA sequence classification as DNA carries most of the genetic information about organisms. This is the reason why the classification of DNA sequences plays an important role in computational biology. This paper has presented a solution in which samples collected from NCBI are used for the classification of DNA sequences. DNA sequence classification will in turn gives the pattern of various diseases; these patterns are then compared with the samples of a newly infected person and can help in the earlier identification of disease. However, feature extraction always remains a big issue. In this paper, a machine learning-based classifier and a new technique for extracting features from DNA sequences based on a hot vector matrix have been proposed. In the hot vector representation of the DNA sequence, each pair of the word is represented using a binary matrix which represents the position of each nucleotide in the DNA sequence. The resultant matrix is then given as an input to the traditional CNN for feature extraction. The results of the proposed method have been compared with 5 well-known classifiers namely Convolution neural network (CNN), Support Vector Machines (SVM), K-Nearest Neighbor (KNN) algorithm, Decision Trees, Recurrent Neural Networks (RNN) on several parameters including precision rate and accuracy and the result shows that the proposed method gives an accuracy of 93.9%, which is highest compared to other classifiers.

Word2vec neural model-based technique to generate protein vectors for combating COVID-19: a machine learning approach

Article 19 May 2022

Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms

Article 22 July 2021

Identification of Pathogenic Viruses Using Genomic Cepstral Coefficients with Radial Basis Function Neural Network

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Earlier identification of viruses and diseases is the biggest concern for everyone after the covid-19 outbreak as their earlier identification can save more lives and can help in the earlier designing of drugs (Herath et al. 2022; Shadab et al. 2020; Herath et al. 2021a, b). However, due to the shortage of patterns available for various diseases, it becomes difficult to identify them. The classification of DNA sequence can be considered one of the solutions and this is the reason why the classification of DNA sequence plays an important role in computational biology.

NCBI has the largest collection of genome sequences, that's why it is named Genbank. It consists of millions of distinct DNA sequences with billions of nucleoid bases (Benson et al. 2010). DNA sequences are the blueprints of any organism. DNA stands for deoxyribonucleic acid which is made up of four nucleotides namely A, C, G, and T i.e. Adenine, cytosine, guanine, and thymine. The DNA sequence of every disease and virus is unique which helps to extract unique patterns (Momenzadeh et al. 2020). The nucleotides in any form are always bound in double-stranded with its harmonious pair (A-T and C-G pair) as shown in Fig. 1.

In this paper, a framework has been proposed in which whenever a patient (infected from any virus or any other disease) visits a doctor his/her DNA samples are collected and matched with the database to identify the occurrence of the disease. However, due to the shortage of patterns available for various diseases, it becomes difficult to identify them. This paper has presented a solution in which samples collected from NCBI are used for the classification of DNA sequences. DNA sequence classification will in turn gives the pattern of various diseases; these patterns are then compared with the samples of a newly infected person and can help in the earlier identification of disease. As we know the size of FASTA-based DNA sequences is too big and complex which means this data cannot be given directly for feature extraction. For this data needed to be converted to some equivalent numerical form, in this paper a new hot vector-based numerical representation is introduced, where the position of each nucleotide is reserved by using binary 0 or 1. This hot vector matrix is then given as an input to traditional CNN for feature extraction. The models are then trained and evaluated on test data. Finally, the model can be applied to unseen data (Unknown DNA sequence) for disease prediction.

Various ML (machine learning) techniques are there that can be used for classification purposes. For analysis of our work, we have taken 37,272 COVID samples, 1418 Middle East Respiratory Syndrome (MERS), 6503 Severe Acute Respiratory Syndrome (SARS), 1886 DENGUE, 8226 HEPATITIS, and 10,848 INFLUENZA samples. These samples are firstly used for training and then testing our model on trained datasets. To train machine models for various kind of disease, feature extraction of some well-known sequences are done. In this work, a new neural network approach has been proposed that takes a hot vector matrix as input to convolution layers for extracting features from given input data. The position of each nucleotide of the DNA sequence is represented using a hot vector and the previous layer’s extracted features are used by the convolution layer’s neurons for extracting high-level abstraction features (Fig. 2).

2 Background

In this paper Sect. 2.1 highlights the DNA sequence in FASTA file format. Section 2.2 gives a detailed description of what is classification, types of classifiers, and their basic introduction.

2.1 DNA sequence in FASTA file format

In this paper, we have also studied DNA sequences in FASTA file format, which is a text-based format used to denote sequences in bioinformatics (Hach et al. 2014; Mohammed et al. 2012). It consists of base pairs using the single-letter code as shown in Table 1.

Table 1 DNA base pairs

Full size table

The layout of organisms depends upon the order in which they are stored. The format allows sequences of preceding sequence names and comments. FASTA file format sequence starts with a one-line description of identifier along with DNA data sequence.

> AR147821.1 Sequence 1 from patent US 6225051

GGCATCTGAGACCAGTGAGAA

The details about the identifier are separated from the sequence of data by the ' > ' sign. The identifier of the sequence is the word after the" > " symbol and the remaining part represents the description which is an optional part that separates the identifier from a white space or tab. The data sequence will start from the next line after the text line and the second line begins with another " > " sign indicating the ending of the sequence and also the beginning of another sequence.

2.2 Pattern recognition

For the identification of various kinds of disease, pattern recognition is considered an important factor. It consists of three primary elements which are data perception, extraction of features, and data classification (Fig. 3) (Sathish kumar et al. 2005; Jain et al. 2004).

2.2.1 Preprocessing

To increase the efficiency of the work, the dataset should be pre-processed instead of giving direct input of the raw dataset to selected classifiers; the raw dataset is preprocessed in different ways to overcome different issues like training overhead, and classifier confusion, false alarms, and detection rate ratios. Separating feature spaces from one another is very necessary and arranged in vector. CNN can be used for various textual data problems such as for classification and categorization purposes, but the only difference is that digital pictures are 2-dimensional matrixes whereas textual data is a 1-dimensional form consisting of letters. That’s why we need to convert it to its equivalent numerical value with the goal that it can be now given as input to CNN.

2.2.2 Features selection

Feature selection is an important factor in any kind of classification either supervised or unsupervised. Since the large numbers of features can be monitored taking into account the large variety of possible values especially for a continuous feature even for a small network.

2.2.3 Final cluster data

As per the final updated population of the cluster center session, the proposed work can use these sessions only for training the neural network. Here, this will increase the learning capacity of the work. As more patterns from the same class increase the confusion of the system. This can be said as small input learning cluster training data improving the detection rate of the proposed work.

2.2.4 Classification

The proposed model uses the latest deep learning neural network for the training as a finding of the subclass is possible by using these networks. Based on this neurons of the network will adjust their weight. The feature vector is grouped during the feature collection steps of the different type of class which is matched, in the network. Finally, TN (Trained neural network) is obtained. For the classification of text data, CNN is proved to give the best results, so in this work, we are using CNN for the classification of textual DNA data.

3 Classification techniques

The most common way of perceiving, comprehending, and gathering thoughts and items into preset categories or "sub-populaces" is known as classification. ML uses several algorithms and some pre-categorized datasets i.e. training datasets to classify the new dataset. These training datasets are then used to predict which sequence will come into which pre-categorized set. One simplest example of classification is to filter mails into spam or nonspam categories or we can say that “It is a form of pattern recognition where classification algorithm is used along with training dataset to check the patterns such as words with similar sentiments etc. for future dataset”.

3.1 Machine learning-based systems

In a machine, learning-based text classification system past experiences/observations are used for making the system learn anything (Ikonomakis et al. 2005). The first step for training an ML-based system is feature extraction, where the textual data is converted into its equivalent number form and represented in the form of a vector. One such method is to represent a word with its frequency of appearing, this approach is known as a “bag of words”. Suppose we have a dictionary of words- {I, have, has, dog, cat, a} and we want to represent the text “I have a dog” in vector form then the vector representation of the text will be {1, 1, 0, 1, 0, 1}.To generate a classification model, a training dataset is given as an input to the ML algorithm that consists of feature set pairs as well as tags (Fig. 4).

Once the ML model gets trained with several training data samples it becomes ready to make exact predictions. This feature extractor can now be used in the transformation of unseen text into some feature set and these feature sets are then given as an input to the classification model to get accurate predictions on tags.

Classifying text using ML is very much accurate and precise as compared to when performed by any man-made system, it makes learning new examples very much easier.

Text classification algorithms based on ML: two text classifiers are listed below:

3.2 SVM (support vector machines)

SVM stands for support vector machine, which is well known for its more precise and even fastest results than any other ML algorithm without much training (Shanahan et al. 2003). However, it requires more resources for performing its computational work. We can say it sketches a hyperplane that separates the subspace into two parts, the first part consists of vectors that belong to some group, and the second part consists of remaining vectors which do not belong to any group. Hyperlane can be defined as the longest distance between each tag. Figure 5a shows the example of a hyperplane in 2 dimensions.

Vectors represent the training text and the group represents tags used for tagging texts. As the complexity of data increases, it becomes difficult to categorize/classify vectors and tags into 2 subgroups only.

However, even when the data becomes more complicated, SVM gives more accurate and precise results, which is the best thing about any classification technique. Figure 5c shows the best hyperplane in 3 dimensions (in circle form).

3.3 Deep learning

Deep learning consists of an algorithm that was inspired by the phenomena of how a human brain works, which is well known as neural networks. For textual classification, deep learning provides the best and most accurate results with the lowest computations. Two well-known text classification deep learning techniques are:

Convolutional neural networks (CNN)
Recurrent neural networks (RNN).

3.3.1 RNN (recurrent neural network)

In this variety of neural networks, the output of the previous layer is feedback as an input to the current layer. It can help in predicting the output of the layer. Here every neuron has some memory that they kept before going to the next step. One of the most popular applications of RNN is in text-to-speech conversion (Ikonomakis et al. 2005).

3.3.2 CNN (convolution neural network)

A deep learning model with a vital thought of utilizing convolutional layers to separate features from input information is a convolution neural network. It was enlivened by the mechanism of living creatures, where the previous layer’s extracted features are used by the convolution layer’s neurons for extracting high-level abstraction features (Kassim et al. 2017). It consists of several artificial neurons in multiple layers. These neurons are used for computing the sum (weighted) of its input and giving activation value as an output.

Weight defines a neuron’s behavior and by applying it with pixels CNN’s neurons can extract the features. When an input image is given to CNN then an activation map is generated by each layer, which will highlight the image features. Every neuron now picks the pixels of an image and multiplies the colored value with the weight, adds them, and finally, the activation function runs them. Lines such as vertical, horizontal, and diagonal or we can say simple shapes are recognized through the first layer. Complicated shapes for example curves and line endpoints are recognized by second layer neurons. The ear, nose, and mouth are recognized by the third layer. Finally, the complete human face is recognized by the last layers of neurons. One such example is shown in Fig. 6b.

3.3.3 K-nearest neighbors

KNN is another popular algorithm used for recognizing patterns with the help of a training dataset. It always finds the kth close relative for this purpose. While using KNN for classification we always have to find a place for data to put it in its nearby neighbor's category (Lim 2004). For example, if the value of k is 1 then we have to place the data in a class that is near 1.

3.3.4 Decision tree

Another method used for text classification is a decision tree that can order the class on some level. The functionality of the decision tree is more similar to that of the flow chart; it separates the dataset into two categories of similar type i.e. from tree trunk to branches to leaves (D. E. Johnson. et al. 2002). It allows nested categorization without any supervision by a human. Figure 7 shows the example of a decision tree when applied to a sports dataset.

4 Literature review

In this section a survey of related work in the field of DNA Sequence classification has been carried out, Table 2 shows the research work done by several authors and answers questions such as why there is a need for DNA sequence classification and what are the methods that can be used for classification purpose.

Table 2 The research was done in the field of DNA sequence classification

Full size table

CNN has already proved its excellent performance in various fields and is now for DNA sequence classification. In a paper (Bosco. et al. 2016), where an author has compared CNN with other machine learning methods and the results proved that it also gives the best results when applied to nucleotides. Therefore, in this work for feature extraction and classification CNN is used as a basic model.

5 Model

CNN can be used for various textual data problems such as for classification and categorization purposes, but the only difference is that digital pictures are 2-dimensional matrixes whereas textual data is a 1-dimensional form consisting of letters. That’s why we need to convert it to its equivalent numerical value with the goal that it can be now given as input to CNN. A lookup table is maintained to match each letter with its numerical value known as a word vector. However using the lookup table have some limitation, so one new approach of using hot vectors to represent words has been proposed. Figure 8 shows an example, where dictionary D has some words with their hot vector representation.

Now suppose we have 5 words in a dictionary I; car; bike; bought; a; with their respective hot vectors in the 2 D matrix. To represent the sentence “I bought a car” in a 2 D numerical matrix we take a region of size two and write a hot vector for each pair of words. After converting all words into hot vectors we got a 2-dimensional matrix that can be directly given as input to traditional CNN for further classification. This model is well known as the CNN model for sequences or sequential convolutional neural networks.

5.1 Proposed method

As we know FASTA based DNA sequences are in textual format, but they are the sequences of continuous letters with no space, which means there is no concept of word in it. That’s why we need to convert the sequences into words so it can work the same as CNN on text. The conversion is done in such a way that each nucleotide holds its position in the DNA sequence. This concept is better understood with the help of the below figure. Here we are using a fixed size window of 4 which slides with fixed strides 1. Every time a window reads some continuous sequence of DNA and considered it as a word. Finally, we have derived sequences of words from DNA data. Now textual-based CNN technique can be applied to this data.

As we know that in the FASTA DNA file format there are only 4 letters i.e. A, C, G, and T (Garima et al. 2020, 2021a, b). If the chosen window size is 4 then we have …256 distinct words in a dictionary. That means each word can be represented by using a 256-size hot vector. Finally, from this generated words of DNA sequence, a 2 D matrix has been generated where the position of every nucleotide is maintained, the resultant matrix is now given as an input to CNN for classification purposes.

In the above example every time 4 consecutive words are picked with the slide of stride 1, such as GGCA, GCAT, CATC, ATCT …and added to the destination sequence. Now we have a dictionary of 256 words (4⁴) and each word can be represented using its corresponding hot vector to make a 2 D matrix of numerical values (Fig. 9).

Now using this dictionary hot vector is generated for given DNA sequences (Fig. 10).

5.2 Proposed framework

Earlier identification of deadly viruses like corona can prevent the outbreaks that we are facing currently as well as can help in the earlier designing of drugs. This is the reason why the classification of DNA sequences plays an important role in computational biology. In this paper, a framework has been proposed in which whenever a patient (infected from any virus or any other disease) visits a doctor his/her DNA samples are collected and matched with the database to identify the occurrence of the disease. However, due to the shortage of patterns available for various diseases, it becomes difficult to identify them. This paper has presented a solution in which samples collected from NCBI are used for the classification of DNA sequences. DNA sequence classification will in turn gives the pattern of various diseases; these patterns are then compared with the samples of a newly infected person and can help in the earlier identification of disease. As we know the size of FASTA-based DNA sequences is too big and complex which means this data cannot be given directly for feature extraction. For this data needed to be converted to some equivalent numerical form, in this paper a new hot vector-based numerical representation is introduced, where the position of each nucleotide is reserved by using binary 0 or 1. This hot vector matrix is then given as an input to traditional CNN for feature extraction. The models are then trained and evaluated on test data. Finally, the model can be applied to unseen data (Unknown DNA sequence) for disease prediction (Fig. 11).

6 Result analysis and discussion

For evaluation purposes, we have used a python based tool that can perform feature extraction as well as classification of DNA sequences. The overall architecture used for result analysis is shown in Fig. 12. In this work, we have compared 5 well-known classifiers namely Convolution neural network (CNN), Support Vector Machines (SVM), K-Nearest Neighbor (KNN) algorithm, Decision Trees, and Recurrent Neural Networks (RNN) on several parameters. The Kmer feature descriptor with a Kmer size of 3, the K-means method for clustering of the dataset where the number of clusters is 3, for feature normalization Zscore method has been used, 100 features using the Chi-square selection method have been selected. The calculation and comparison of 7 different parameters, including sensitivity (Sn), specificity (Sp), accuracy (Acc), Matthews correlation coefficient (MCC), Recall, Precision, F1-score, the area under the receiver operating characteristic (ROC) curve (AUROC) and the area under the Precision-Recall (PRC) curve (AUPRC) (Liu 2017; Song et al. 2018; Liu et al. 2016, 2018; Li et al. 2018), has been done.

They are defined as

$$ {\text{Sensitivity }} = 1 - \frac{{\text{false negatives }}}{{\text{ true positives}}} $$

(1)

$$ {\text{Specificity}} = 1 - \frac{{\text{false positives}}}{{\text{true negatives }}} $$

(2)

$$ {\text{Accuracy}} = 1 - \frac{{{\text{false negatives}} + {\text{false positives}}}}{{{\text{true positives}} + {\text{true negatives}}}} $$

(3)

$$ {\text{Matthews correlation coefficient }} = \frac{{{\text{Accuracy}}}}{{\sqrt {\left[ {1 + \frac{{{\text{false positives}} {-} {\text{false negative}}}}{{\text{true Positives}}}} \right]\left[ {1 + \frac{{{\text{false negatives}} {-} {\text{false positive}}}}{{\text{ true Negatives}}}} \right]} }} $$

(4)

$$ {\text{Precision}} = 1 - \frac{{\text{false positives }}}{{ {\text{true positives}} + {\text{false positives}}}} $$

(5)

$$ {\text{F1 - score}} = 1 - \frac{{{\text{false positives}} + {\text{false negatives}}}}{{2 * {\text{true positives}} + {\text{false positives}} + {\text{false negatives}}}} $$

(6)

Here, true positive, true negative, false positive, and false negative terms are defined as follows. The correct positive (+ ve) outcome predictions done by the model are termed “true + ve”. The correct negative (-ve) outcome predictions done by the model is termed “true -ve”. The incorrect positive (+ ve) outcome predictions done by the model are termed “false + ve”. The incorrect negative (-ve)) outcome predictions done by the model are termed as “false -ve”.

In the case of multiclass classification accuracy is defined as:

$$ {\text{Accuracy}} = 1 {-}\frac{{\left( {{\text{false negatives}}\left( {\text{i}} \right) + {\text{false positives}}\left( {\text{i}} \right)} \right)}}{{ \left( {{\text{true positives}}\left( {\text{i}} \right) + {\text{true negatives}}\left( {\text{i}} \right)} \right)}} $$

(7)

Here, false negatives(i), false positives(i),true positives(i), true negatives(i) represents the samples in ith class. The Naive Bayes algorithm is a popularly used multiclass algorithm. The AUROC value is calculated based on the ROC curve, and takes values between 0 and 1, while the AUPRC value is calculated based on the precision-recall curve. We have compared our model with other test data classification approaches based on these parameters. For estimation and comparison of the proposed model with other machine learning models k- fold cross-validation is used, where k = 4.

6.1 Comparison of evaluation metrics

From the above-calculated result, CNN gives an accuracy of 73.5, Decision tree with an average accuracy of 62.5, MLP with 78.0, RNN with 69.0, SVM with 50.0, and proposed method with 93.9, so we can say that the proposed method gives the highest accuracy as compared to other classification methods (Tables 3, 4, 5, 6, 7, 8).

Table 3 The evaluation result of CNN

Full size table

Table 4 The evaluation result of the Decision Tree

Full size table

Table 5 The evaluation result of MLP

Full size table

Table 6 The evaluation result of the RNN

Full size table

Table 7 The evaluation result of the SVM

Full size table

Table 8 The evaluation result of the Proposed Method

Full size table

6.2 Clustering

As per the final updated population of the cluster center session, the proposed work can use these sessions only for training the neural network. Here, this will increase the learning capacity of the work. As more patterns from the same class increase the confusion of the system. This can be said as small input learning cluster training data improving the detection rate of the proposed work. Clustering results of all given classification algorithm along with the proposed algorithm is shown below (Fig. 13).

Table 9 shows that the proposed method gives a significant improvement over the previous best results in terms of both precision and accuracy. The improvement in precision is nearly 1.112% and the improvement in terms of accuracy is about 15.9%. The proposed method's improvement is very high as compared to other methods because we have used an improved method for sequence representation (hot-vector representation) and also we have chosen the best feature selection method. The results have proved that features extracted by the proposed method are very useful for classifying the sequence into a true category (Fig. 14).

Table 9 Comparison with the previous best results

Full size table

7 Conclusion

After the covid-19 outbreak, earlier identification of disease is the willingness of every person as their earlier identification can save more lives. Various ML (machine learning) techniques are there that can be used for classification purposes. This paper has presented a solution in which samples collected from NCBI are used for the classification of DNA sequences. DNA sequence classification will in turn gives the pattern of various diseases; these patterns are then compared with the samples of a newly infected person and can help in the earlier identification of disease. However, the size of these DNA sequences is too big and complex which means this data cannot be given directly for feature extraction and needed to be converted to some equivalent numerical form. In this paper a new hot vector-based numerical representation is introduced, where the position of each nucleotide is reserved by using binary 0 or 1. This hot vector matrix is then given as an input to traditional CNN for feature extraction. A comparison of about 7 classifiers like Convolution neural network (CNN), Support Vector Machines (SVM), K-Nearest Neighbor (KNN) algorithm, Decision Trees, Artificial Neural Networks (ANN), and proposed method on 7 different parameters, including sensitivity, specificity, accuracy, Matthews correlation coefficient, Recall, Precision, F1-score, the area under the receiver operating characteristic (ROC) curve (AUROC), and the area under the Precision-Recall (PRC) curve (AUPRC), has been done and the result shows that proposed method gives highest accuracy of 93.9%, which is highest compared to other classifiers.

Data availability

The datasets generated during and/or analyzed during the current study are not publicly available due to [security reasons] but are available from the corresponding author on reasonable request.

References

Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (2010) GenBank. Nucleic Acids Research. vol. 38. Supplement 1:46–51
Google Scholar
Bosco GL, Di Gangi MA (2016) Deep learning architectures for DNA sequence classification. In: Proceedings of the international workshop on fuzzy logic and applications. Springer, Cham, pp 162–171. https://doi.org/10.1007/978-3-319-52962-2_14
Garima M, Anjana P, Sachin G (2020) Immutable DNA sequence data transmission for next-generation bioinformatics using blockchain technology. In: 2nd international conference on data, engineering, and applications (IDEA)
Garima M, Anjana P, Sachin G (2021a) An approach to compress human genome sequence by delta computation and secure storage by Blockchain. DE. Pp 7130–7144
Garima M, Anjana P, Sachin G (2021b) Blockchain-based healthcare information exchange systems for the security of healthcare data. Turk Online J Qual Inquiry (TOJQI) 12(8):4498–4507
Hach F, Numanagic I, Sahinalp SCD (2014) Reference-based compression by local assembly. Nat Methods 11:1082–1084
Article Google Scholar
Herath HMKKMB, Karunasena GMKB, Madhusanka BGDA, Priyankara HDNS (2021a). Internet of medical things (IoMT) enabled TeleCOVID system for diagnosis of COVID-19 patients. In: Agrawal R, Mittal M, Goyal LM (eds) Sustainability measures for COVID-19 pandemic. Springer, Singapore
Herath HMKKMB, Karunasena GMKB, Herath HMWT (2021b) Development of an IoT based systems to mitigate the impact of COVID-19 pandemic in smart cities. In: Ghosh U, Maleh Y, Alazab M, Pathan ASK (eds) Machine intelligence and data analytics for sustainable future smart cities. Studies in Computational Intelligence, vol 971. Springer, Cham
Herath HMKKMB, Karunasena GMKB, Madhusanka BGDA (2022) Early detection of COVID-19 pneumonia based on ground-glass opacity (GGO) features of computerized tomography (CT) angiography. 5G IoT and Edge Computing for Smart Healthcare Intelligent Data-Centric Systems, pp 257–277
https://monkeylearn.com/blog/classification-algorithms/
Ikonomakis M, Kotsiantis S, Tampakas V (2005) Text classification using machine learning techniques. WSEAS Trans Comput 4(8):966–974
Jain AK, Duin RPW (2004) Introduction to pattern recognition. In: The Oxford companion to the mind, second edition, Oxford University Press, Oxford, UK, pp 698–703
Johnson DE, Oles FJ, Zhang T, Goetz T (2002) A decision-tree-based symbolic rule induction system for text categorization. IBM Syst J
Kassim NA, Abdullah A (2017) Classification of DNA sequences using convolutional neural network approach. UTM Comput Proc Innov Comput Technol Appl 2:1–6
Google Scholar
Levy S, Stormo GD (1997) DNA sequence classification using DAWGs. Struct Logic Comput Sci. https://doi.org/10.1007/3-540-63246-8_21
Article Google Scholar
Li F, Li C, Marquez-Lago TT et al (2018) A comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome. Bioinformatics 34:4223–4231
Article Google Scholar
Lim H (2004) Improving kNN based text classification with well estimated parameters. LNCS 3316:516–523
Google Scholar
Liu B, Fang L, Long R et al (2016) A two-layer predictor for identifying enhancers and their strength by pseudo k tuple nucleotide composition. Bioinformatics 32:362–369
Article Google Scholar
Liu B, Yang F, Huang DS et al (2018) A two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics 34:33–40
Article Google Scholar
Liu B (2017) BioSeq-analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief Bioinform. https://doi.org/10.1093/bib/bbx165
Ma Q, Wang JTL, Shasha D, Wu CH (2001) DNA sequence classification via an expectation maximization algorithm and neural networks: a case study. IEEE Trans Syst 31:468–475. https://doi.org/10.1109/5326.983930
Article Google Scholar
Mohammed MH, Dutta A, Bost T, Chadaram S (2012) DELIMINATE—A fast and efficient method for lossless compression of genomic sequences. Bioinformatics 28:2527–2529
Article Google Scholar
Momenzadeh M, Sehhati M, Rabbani H (2020) Using hidden Markov model to predict recurrence of breast cancer based on sequential patterns in gene expression profiles. J Biomed Inf 111
Müller HM, Koonin SE (2003) Vector space classification of DNA sequences. J Theor Biol 223:161–169. https://doi.org/10.1016/S0022-5193(03)00082-1
Article MathSciNet MATH Google Scholar
Nguyen N, Tran V, Ngo D, Phan D, Lumbanraja F, Faisal M, Abapihi B, Kubo M, Satou K (2016) DNA sequence classification by convolutional neural network. J Biomed Sci Eng 9:280–286. https://doi.org/10.4236/jbise.2016.95021
Article Google Scholar
Ohno-Machado L, Vinterbo S, Weber G (2002) Classification of gene expression data using fuzzy logic. J Intell Fuzzy Syst 12(1):19–24
MATH Google Scholar
Ranawana R, Palade V (2005) A neural network-based multi-classifier system for gene identification in DNA sequences. Neural Comput Appl 14:122–131. https://doi.org/10.1007/s00521-004-0447-7
Article Google Scholar
Sathish kumar S, Duraipandian N (2005) Int J Comput Technol 4(2c2):722–730. https://doi.org/10.24297/ijct.v4i2c2.4190
Shadab S, Alam Khan MT, Neezi NA, Adilina S, Shatabda S (2020) DeepDBP: deep neural networks for the identification of DNA-binding proteins. Inf Med Unlock 19:100318
Shanahan J, Roma N (2003) Improving SVM text classification performance through threshold adjustment. LNAI 2837:361–372
Google Scholar
Song J, Li F, Takemoto K et al (2018) an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework. J Theor Biol 443:125–137
Article Google Scholar
Wang JTL, Marr TG, Shasha D, Shapiro BA, Chirn G, Lee TY (1996) Complementary classification approaches for protein sequences. Protein Eng 9(5):381–386
Article Google Scholar
Yang A, Zhang W, Wang J, Yang K, Han Y, Zhang L (2020) Review on the application of machine learning algorithms in the sequence data mining of DNA. Front Bioeng Biotechnol. https://doi.org/10.3389/fbioe.2020.01032
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, UIT, RGPV, Bhopal, India
Garima Mathur
Department of Information Technology, UIT, RGPV, Bhopal, India
Anjana Pandey & Sachin Goyal

Authors

Garima Mathur
View author publications
You can also search for this author in PubMed Google Scholar
Anjana Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Sachin Goyal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Garima Mathur.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mathur, G., Pandey, A. & Goyal, S. A comprehensive tool for rapid and accurate prediction of disease using DNA sequence classifier. J Ambient Intell Human Comput 14, 13869–13885 (2023). https://doi.org/10.1007/s12652-022-04099-y

Download citation

Received: 10 December 2021
Accepted: 06 June 2022
Published: 25 June 2022
Issue Date: October 2023
DOI: https://doi.org/10.1007/s12652-022-04099-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A comprehensive tool for rapid and accurate prediction of disease using DNA sequence classifier

Abstract

Similar content being viewed by others

Word2vec neural model-based technique to generate protein vectors for combating COVID-19: a machine learning approach

Computational predictions for protein sequences of COVID-19 virus via machine learning algorithms

Identification of Pathogenic Viruses Using Genomic Cepstral Coefficients with Radial Basis Function Neural Network

Explore related subjects

1 Introduction

2 Background

2.1 DNA sequence in FASTA file format

2.2 Pattern recognition

2.2.1 Preprocessing

2.2.2 Features selection

2.2.3 Final cluster data

2.2.4 Classification

3 Classification techniques

3.1 Machine learning-based systems

3.2 SVM (support vector machines)

3.3 Deep learning

3.3.1 RNN (recurrent neural network)

3.3.2 CNN (convolution neural network)

3.3.3 K-nearest neighbors

3.3.4 Decision tree

4 Literature review

5 Model

5.1 Proposed method

5.2 Proposed framework

6 Result analysis and discussion

6.1 Comparison of evaluation metrics

6.2 Clustering

7 Conclusion

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation