Keywords

1 Introduction

Automatic detection of diverse species of viral pathogens associated with emerging deadly ailments within human populations cannot be over-emphasized as they remain a big threat to both personal and public health. Recent advances in molecular biology, next generation sequencing and online bioinformatics platforms offer a vast computational ecosystem for accurate identification of causative viral pathogens associated with the deadly human diseases. While allowing for extensive analysis, the rapidly growing databases of genomic sequences also provide an avalanche of resources for improved epidemic surveillance, diagnostics and therapeutics towards promoting healthy living. Furthermore, newer digital signal processing-based bioinformatics methods utilize numerical and/or visual encoding of nucleotide sequences collected from laboratory and environmental surveillances for effective non-alignment analysis [1, 2].

The application of digital signal processing techniques to genomic analysis, coined Genomics Signal Processing (GSP), requires that the nucleotide sequences be encoded numerically or graphically for alignment-free sequence comparison [1, 3, 4]. Next, discriminatory genomic features are extracted from the numeric genome representations to improve on species- or genome-level classification, usually based on a machine learning technique [5]. GSP-based techniques provide alignment-free analyses of the genomes to address the problems of unequal lengths of the sequences, the computational speed and large memory requirements encountered during alignment-based analysis [6, 7]. However for an accurate detection, it is necessary to ensure that the numeric or visual encoding of the nucleotide sequences represents the unique and salient characteristic of the genome as desirable.

Unlike other methods, the Chaos Game Representation (CGR) visually expresses the local patterns of the nucleotide sequences and hence the global structure of the genome in a two-dimensional graphical form [2, 8]. CGR is a scale independent representation developed by Jeffrey [9]. It was derived from the chaos theory, which allows the illustration of frequencies of oligonucleotides in the form of images. With CGR, the oligonucleotides of a genome exhibit the main physiognomies of the whole genome [7]. However, in the original form, CGR is not convenient for processing with a computer, hence, another form of CGR named Frequency Chaos Game Representation (FCGR) was introduced [7, 8, 10]. The CGR pattern of the nucleotide sequences of the same genome are found to be similar but differs quantitatively from the CGR patterns of the genome from another specie. This biological attribute makes the unique genomic signature of CGR and the subsequent features extracted from it, an accurate representation for alignment-free analysis suitable for classification, clustering and identification as proposed in many researches reported recently.

Karamichalis et al. [5] investigated the intra-specie and inter-specie variations of the genomic signatures generated by CGR patterns, using six different distance measures. The study validated the hypothesis that the CGR patterns of the nucleotide sequences of the same genome are similar but differs quantitatively from the CGR patterns of the genome from another specie. The CGR-based genomic signatures also accurately classified the genomic DNA sequences of Homo sapiens and Mus musculus genomes at lower taxonomic levels – class and order.

Messaoudi et al. [11] encoded the genomic sequence of Caenorhabditis elegans (C. elegans) with frequency of CGR patterns, otherwise called Frequency Chaos Game Representation (FCGR), for a time-frequency investigation using the Continuous Wavelet Transform (CWT). The complex Morlet wavelet based CWT revealed significant biological characteristics from the genomic signature of the FCGR patterns.

Kari et al. [12] proposed a molecular distance map developed with the unique genomic signature of CGR suitable for defining relationships between species to identify species, clarify taxonomies and related evolutionary history. Multi-Dimensional Scaling (MDS) was applied to the distance metrics computed based on the Structural Dissimilarity Index (DSSIM) to produce the map. The map successfully characterized organisms into several taxonomy levels within the Euclidean space that showed the spatial proximity between the nucleotide sequences.

Tanchotsrinon et al. [13] adopted the CGR and Singular Value Decomposition (SVD) for Human Papillomavirus (HPV) genotyping as an approach to fight cervical cancer. Two classes of features were obtained from the SVD-reduced matrices of the original CGR: ChaosCentroid, which captured the structure of the sequences and ChaosFrequency, which represented relevant statistical distribution of nucleotides in the sequences. Their study demonstrated comparative results with no significant difference between their proposed method and the NCBI viral genotyping tool irrespective of the four classification techniques used i.e. Multi-layer Perceptron, Radial Basis Function, K-Nearest Neighbor and Fuzzy K-Nearest Neighbor.

In the current study, we experimentally explored the applicability of FCGR and its appropriate order for classification of viral pathogens from genomic sequences into the right species. This endeavor is aimed at laying a foundation for the development of an alternative, accurate and in silico genomic viral diagnostic tool, which could help in rapid medical interventions in the event of viral pathogens epidemic.

2 Materials and Methods

2.1 Dataset

As shown in Table 1, we extracted the genome sequences of Ebola virus (N = 249), Enterovirus (N = 632), Dengue virus (N = 390), HepatitisC virus (N = 567) and Zika virus (N = 351) from the Virus Pathogen Database and Analysis Resource (ViPR) corpus. This corpus was developed to provide free access to genomic and proteomic sequences of viral pathogens for research and development of vaccines, therapies and diagnostic tools. The Universal Resource Locator (URL) for ViPR as at the time this study was carried out is https://www.viprbrc.org/brc/home.spg?decorator=vipr. The total sample size of the dataset extracted for this study is 2,189. Although, there is a huge collection of pathogenic viral datasets on the corpus, the five viruses were selected due to their prominence as causative agents of diseases that is currently of concern among researchers on a global scale. These viruses are also specifically featured on the home page of the ViPR corpus and the structural diversity of their genomes provide a good basis to investigate the efficacy of FCGR for viral species classification.

Table 1. Extracted dataset for five pathogenic viruses

2.2 FCGR Computation at Different Orders and Naïve Bayes Classifier

FCGR is a numerical matrix in contrast to CGR, which is a graphical representation. Instead of plotting a CGR first and converting it to a FCGR [7, 8]. Wang et al. [10] posits that FCGR can be derived directly from a sequence. Furthermore, Wang et al. [10] introduced the concept of FCGR order, which provides variants in matrix dimensions when FCGR are derived directly from the sequences. For instance, given a sequence S, with fw representing the frequency of the oligonucleotide w, the matrix structure of a first order FCGR is given as Eq. (1) [10].

$$ FCGR_{1} (S) = \left( {\begin{array}{*{20}c} {f_{C} } & {f_{G} } \\ {f_{A} } & {f_{T} } \\ \end{array} } \right)\,\,\,\, $$
(1)

The FCGR of (k + 1)th order can be computed by substituting each element fx in a kth order FCGR with the four elements

$$ \left( {\begin{array}{*{20}c} {f_{CX} } & {f_{GX} } \\ {f_{AX} } & {f_{TX} } \\ \end{array} } \right)\,.\,\,\, $$
(2)

Therefore, the matrix structure of a second order FCGR is as shown in Eq. (2) and higher order FCGR can be sequentially computed.

$$ FCGR_{2} (S) = \left( {\begin{array}{*{20}c} {f_{CC} } & {f_{GC} } & {f_{CG} } & {f_{GG} } \\ {f_{AC} } & {f_{TC} } & {f_{AG} } & {f_{TG} } \\ {f_{CA} } & {f_{GA} } & {f_{CT} } & {f_{GT} } \\ {f_{AA} } & {f_{TA} } & {f_{AT} } & {f_{TT} } \\ \end{array} } \right)\, $$
(3)

From Eqs. (1) and (3), it can be seen that a k-th order FCGR is a 2k x 2k matrix and it contains 4 k occurrences of the k length oligonucleotides [10, 14]. The direct correspondence of CGR and FCGR, in which a kth order FCGR is equivalent to a CGR of resolution 1/2k was also reported [10]. This makes it possible to observe the major features that are inherent in higher order FCGR (which ordinarily is incomprehensible because of the size) by visual observation of the equivalent CGR. Researchers have also opined that CGR images and correspondingly the FCGR obtained from subsequence of a genome present similar structure as the whole genome [7]. This implies that the CGR image or FCGR of a subsequence is a sufficient genomic signature for species classification rather than the CGR image or FCGR of the whole genome [7, 10]. Therefore, in this study, we ventured to experimentally investigate the efficacy of FCGR at different resolutions or orders (\(1 \le k \le 7 \)) for pathogenic virus species classification. We stopped at 7th order because the huge dimension of the matrix elements at 8th order and beyond is computationally expensive without providing any benefits with respect to classification accuracy.

The accurate classification of the viral species from the FCGR-encoded nucleotide sequences was carried out with the Naïve Bayes (NB) classifier, which is a very popular classifier in bioinformatics [15, 16]. NB classifier utilizes a key statistical assumption of conditional independence of the FCGR features within the same class, to assign a class label to each sequence [15]. The class label in this context refers to the viral species, drawn from Table 1. The NB classifier can be trained using different kernel functions such as uniform, epanechnikov, normal and triangular to detect through classification the most probable viral specie from each encoded sequence.

2.3 Experiments

Experiments were performed in this study to determine the efficacy as well as the appropriate order of FCGR for classifying viral pathogens from genomic sequences. The curated sequences for the five viruses were first converted to their numeric equivalents with 1st to 7th order of FCGR. For each of these orders, the FCGR encoded viral sequences were transmitted to train Naïve Bayes classifier using four different kernel functions, namely; uniform, epanechnikov, normal and triangular. Both the FCGR algorithm and naïve Bayes classifier were implemented in MATLAB R2015a, which also provided the in silico platform to perform all the experiments in this study. The PC on which the experiments were performed contains an Intel Core i5-4210U CPU operating at 2.40 GHz speed, with 8.00 GB RAM and runs 64-bit Windows 8 operating system.

3 Results and Discussion

Figure 1 shows the plot of the number of elements in the computed FCGR matrices against the FCGR order. As illustrated on the graph, a first order FCGR matrix contains 4 elements, a second order contains 16 elements, third order contains 64 elements, fourth order contains 256 elements, fifth order contains 1024 elements, sixth order contains 4096 elements and seventh order contains 16,384 elements. The relationship between the number of elements in the FCGR matrix and the number of FCGR order is clearly an exponential growth which is represented as:

Fig. 1.
figure 1

The number of elements in the FCGR matrix against the FCGR order

$$ y = e^{1.3863x} $$
(4)

where y is the number of elements in the FCGR matrix and x is the FCGR order. The results of the experiments, which are hereafter reported provide an insight on the effects of FCGR order vis-à-vis the number of elements in the corresponding FCGR matrices on pathogenic viral classification accuracy.

Tables 2, 3, 4, 5, 6, 7, and 8 show the results we obtained when the first to seventh order FCGR matrices were respectively utilized to encode the viral genomic sequences. We deemed it expedient to compute the average classification accuracies and average misclassification errors for the four different kernels across the FCGR orders. This provides a compact scheme for the comparison of our results based on the FCGR orders. The average classification results for the first order FCGR is shown in Table 2 (Accuracy = 89.8161%, ME = 0.1018). About 7% increase in performance was obtained for the second order FCGR (Accuracy = 96.6281%, ME = 0.0337) over the first order. Furthermore, increase in performance continued with the third order (Accuracy = 98.3651%, ME = 0.0163) up to the fourth order (Accuracy = 98.1607%, ME = 0.0184). There is however a drastic reduction in the performance results for the fifth order (Accuracy = 94.2098%, ME = 0.0579), which continued for the sixth order (Accuracy = 85.7857%, ME = 0.1422) and the lowest performance results in this study was posted for the seventh order FCGR (Accuracy = 78.0654%, ME = 0.2194). It is clearly apparent that approximately, both the third and fourth FCGR order with 64 and 256 elements respectively, gave the highest performance results (Approximate Accuracy = 98%, Approximate ME = 0.02). The summary of the entire result is graphically represented in Fig. 2.

Table 2. First order FCGR
Table 3. Second order FCGR
Table 4. Third order FCGR
Table 5. Fourth order FCGR
Table 6. Fifth order FCGR
Table 7. Sixth order FCGR
Table 8. Seventh order FCGR
Fig. 2.
figure 2

Summary of the average classification accuracy and misclassification error.

The results in this study clearly agree with the curse of dimensionality philosophy in machine learning, in which too small training data features (first and second order FCGR in this context) may hinder the creation of a reliable classification model for assigning a class to all possible objects in the dataset. Conversely, high dimensions in the training features (sixth and seventh order FCGR in this context) tend to make the contiguity among data points more identical and often lead to lower classification accuracy. Training data with high features has also being reported to lead to high computational cost and memory usage [17], which is the case with the eight order FCGR in this study that motivated its exclusion from the experiments.

Our literature search while undertaking this research yielded few studies that have employed FCGR and other schemes for identification of species ([1, 18,19,20,21]. The study by Vijayan et al. [18] utilized a third order FCGR (64 element vector) to encode Eukaryotic organisms with Probabilistic Neural Network (PNN) as a classifier to obtain a classification accuracy of 92.3%. In codicil, the study reported in [1] where a 15-element real Genomic Cepstral Coefficients (GCC) with Radial Basis Function Neural Network (RBFNN) were utilized for identification of four pathogenic viruses gave an accuracy of 97.3%. Obviously, the current result of 98% classification accuracy for five pathogenic viruses with 64 (third order) and 256 (fourth order) elements FCGR and naïve Bayes classifier is comparable to the highlighted similar results in the literature. However, based on the experimental results obtained in this study, the third order FCGR is recommended as an appropriate genomic feature for pathogenic viral classification. This will appositely culminate in economy of memory space, computational efficiency and acceptable accuracy for viral pathogens classification, which is an important contribution albeit moderate, to GSP and bioinformatics body of knowledge.

4 Conclusion

Thus far, we have been able to achieve the objectives of the current study, which are to determine the efficacy of FCGR and its appropriate order for accurate classification of pathogenic virus from genomic sequences. The 98% classification accuracy obtained with the third order FCGR is clearly promising for developing in silico and accurate diagnostic tool for viral pathogens classification using next generation genomic sequences. In the future, we hope to substantially extend this study by increasing the viral pathogens coverage and further experiment with state-of-the-art machine learning methods like deep learning and hierarchical classifiers.