Keywords

1 Introduction

The so-called DNA barcode sequence is a small segment ( ∼ 650 bp) of DNA, usually from “cythosome c oxidase subunits 1” mitochondrial gene (COI) [8, 13]. The sequence is a good marker for DNA and is widely used for identification and taxonomic rank assignment of many species [5].

DNA barcoding is difficult if the biological samples under analysis are degraded: in this case only fragments of the barcode sequence is available. A suitable solution for this problem is studied in [14]: in this work the barcode sequence is analysed in order to find small subsequences that are still useful for identification of the sample specie.

We started from a different point of view: we addressed the identification and rank assignment of degraded barcode sequences, usually sequence fragments of about 200 bp, building a robust classifier based on the spectral representation and a modified version of the General Regression Neural Network (GRNN).

Using spectral representation the DNA barcode sequence is represented using the frequency of very short strings of length \(k = 3,4,\ldots\), called k-mers. This sequence representation is often addressed as k-mers decomposition or, more generally, as alignment-free sequence decomposition. In this representation the order of k-mers in the sequence is discarded and only their count is considered; if a sequence fragment has a k-mers frequency distribution similar to the one of the whole barcode sequence then the two will have a similar representation.

The set of the frequencies of the k-mers in a sequence constitutes the representing vector for the sequence in a vector space. The dimension of the representation space is 4k and the distance among these representing vector can be calculated using Euclidean norm in \(\mathfrak{R}^{4^{k} }\).

The GRNN is a neural network originally developed for regression and adapted to classification of DNA sequences in [17]. This modification made the network a prototype-based classification tool that classifies a new input looking at the distance from the memorized training samples. It is clear that different distance models, like Euclidean, manhattan and so on, can change the performances of the network, as we found in [17].

In this paper we want to go further in this study and analyse and compare the performances of other distance models on the GRNN, considering classification results of both full length sequences and degraded samples.

With regards to barcode classification, very interesting results have been obtained in the works presented in [10, 15, 20]. In particular both the algorithms described in [15, 20] propose alignment-based methods in order to classify barcode specimen. In [20], after the training sequences are aligned, a set of logic rules are extracted in the form “if pos35 is G and pos300 is A then the sequence is classified as …”, where pos X represents a sequence locus. In [15], first a phylogenetic tree of input sequences is computed; then at each branching node, a set of “characteristic attributes” (CA) is identified for the corresponding leaf nodes. Considering a branch node, CAs are single nucleotide position or multiple nucleotide positions that are shared only by one of the branch descending from that node. Another alignment-free approach more similar to our proposed method is the one presented in [10]. There authors introduce the spectral representation for the barcode sequences and they use two machine learning algorithms, k-Nearest Neighbour (kNN) [2] and Support Vector Machine (SVM) [18], to train different classifiers. In this paper we are going to compare our GRNN approach with the classifiers proposed in [10] because they represent alignment-free approaches, differently from [15] and [20], that also implement the spectral representation. The comparison between our GRNN method and the SVM classifier has been already done in [17], where we demonstrated our method outperforms SVM when dealing with sequence fragments. Therefore in this paper we compare our GRNN method against the k-NN classifier.

2 Methods

Prototype-based classification tools are based on sequence distance; there are many algorithm to evaluate sequence distance besides the evolutionary distance, for example the compression distance used in [11, 12]. The vector space representation is obtained by considering the frequency of all possible 5-letters substring in the DNA barcode sequence (k-mers), these k-mers are obtained by using a sliding window on the sequence. A deeper discussion on this representation can be found in [3, 10]. In the following sections the GRNN modified algorithm is explained and the different distance measures applied are described; moreover the barcode dataset used is introduced.

2.1 The General Regression Neural Network

Artificial neural networks (ANN) are a set of algorithms used to approximate functions or cluster large sets of input values. A neural network usually have a very large set of parameters (the network weights) adapted using a set of training examples and a specific learning algorithm (the training algorithm). The training phase is aimed at reducing the error of the network on a specific task, classification or regression, by changing the weight values.

Among the neural networks the GRNN [19] is a network created for regression i.e. the approximation of the values of a dependent continuous variable y given a set of samples \((\mathbf{x}_{i},y_{i})\ i = 1,2,\ldots N\).

In the following we will discuss the one dimensional output case, the extension to an output vector y being straightforward (see [19] for details).

The GRNN do not have a training phase, it is based on the memorization of all the training examples in the hidden layer: one neural unit for each training samples (see Fig. 1). When a new pattern x is presented to the network input the output y is calculated using the following equation:

$$\displaystyle{ y^{{\prime}} = \frac{\sum w_{i} {\ast} y_{i}} {\sum w_{i}}. }$$
(1)

where the weight w i are obtained from each hidden unit as

$$\displaystyle{ w_{i} =\exp \left \{-\frac{d(\mathbf{x^{{\prime}}},\mathbf{x_{i}})} {2\sigma ^{2}} \right \} }$$
(2)
Fig. 1
figure 1

The representation of the GRNN neural network. The hidden layer contains all the training patterns and calculates the w i considering the distance from the input pattern x. These weights are used to calculate the output. On the right there is the output layer composed by three units: the upper one collects all the terms \(w_{i} {\ast} y_{i}\) and the lower one collects the terms w i : these terms are combined in the third unit that generates the output

The σ value, called spread factor, is the only parameter of the GRNN network. The weight w i is considered by some literature the excitation level of the neural unit i corresponding to the input x .

There are some studies on the optimal value of σ that can be a single value for the whole network or a specific value for each hidden unit. In [7] it is suggested a formula that depends on the maximum distance and number of patterns in the training set.

The GRNN can be used in classification problems: considering a set of classification examples \((\mathbf{x_{i}},c_{h})\) where x i (with \(i = 1,2,\ldots N\)) is the input pattern and c h (with \(h = 1,2,\ldots H\); H is the number of available classes) is the class assigned to the pattern x i it is possible to build a set of training examples for a GRNN network as \((\mathbf{x_{i}},\mathbf{y_{i}})\) were \(y_{i} = [y_{i,1},y_{i,2},\ldots,y_{i,H}]\) is given by

$$\displaystyle{ y_{i,j} = \left \{\begin{array}{l l} 0&\text{if}\ j\neq h\\ 1 &\text{if} \ j = h \end{array} \right. }$$
(3)

where c h is the class of the pattern x i .

The set of couples \((\mathbf{x_{i}},y_{i})\) can be used as a training set for the GRNN and the class of the new input x can be calculated as

$$\displaystyle{ c_{h}(\mathbf{x^{{\prime}}}) =\arg \max _{ j}\{y_{j}^{{\prime}}\vert \,j = 1,2,\ldots H\} }$$
(4)

In order to implement our classification tool for DNA sequences, we obtained the vector representation of the DNA sequences using a k-mer decomposition, as shown in [10], in which sequences are coded as fixed size vectors whose components are the number of occurrences of short DNA snippets of k fixed-length, called k-mers. Considering k = 5, as proposed in [10], we have vectors of dimension 45 = 1024 to represent genomic sequences.

The GRNN is used with different distance models, in particular some of the L p norms, the correlation norm and the cosine norm.

2.2 The Distance Models

In this section the L p norms used are introduced, together with the cosine and correlation distances.

2.2.1 L p Norms

The norm is a function that assigns a strict positive number to a vector in a vector space \(f: (\mathbf{x}) \rightarrow \mathfrak{R}\) that satisfies the following properties:

$$\displaystyle\begin{array}{rcl} f(\alpha \mathbf{x}) = \left \vert \alpha \right \vert f(\mathbf{x})& &{}\end{array}$$
(5)
$$\displaystyle\begin{array}{rcl} f(\mathbf{x} + \mathbf{y}) \leq f(\mathbf{x}) + f(\mathbf{y})& &{}\end{array}$$
(6)
$$\displaystyle\begin{array}{rcl} \mbox{ if }f(\mathbf{x}) = 0\mbox{ then }\mathbf{x}\mbox{ is the vector zero}& &{}\end{array}$$
(7)

the L p family norms, or p-no r m s, defined as:

$$\displaystyle{ \left \|\mathbf{x}\right \|_{p} = \left (\sum _{i}\left \vert x_{i}\right \vert ^{p}\right )^{\frac{1} {p} }. }$$
(8)

The most common norm is the Euclidean norm with p = 2, but are also used the p = 1 norm namely City-block or Manhattan, and the Chebyshev norm, or L . Although should be p ≥ 1 there are also fractional norms with p < 1, that are interesting in the case of high dimensional spaces.

In case of high-dimensionality data, such as the 1024 sized vectors representing DNA sequences, the Euclidean norm used to define the distance tend to concentrate [4]. That means all pairwise distances between high-dimensional objects appear to be very similar. Authors in [4] also state that the concentration phenomenon is intrinsic to the norm. In order to overcome this phenomenon, fractional norms can be used in place of Euclidean norm [1, 9]; whereas with 0 < p < 1 L p norms are called fractional norms, which induce fractional distances. Moreover, fractional norms are able to deal with non-Gaussian noise [4]. In this work we adopted fractional norms, considering different values of p, in order to compute Eq. (2) and to limit the effects of the curse of dimensionality.

If p = 1 in Eq. (8) the norm is called the Manhattan norm, or taxicab norm, and is defined as

$$\displaystyle{ L_{1} =\sum _{i}\left \vert x_{i}^{{\prime}}- x_{ i}\right \vert. }$$
(9)

both the names are related to the distance a taxi as to drive in a city with a rectangular grid.

The Chebyshev distance is obtained from the formula:

$$\displaystyle{ d(\mathbf{x^{{\prime}}},\mathbf{x}_{ i}) =\max _{i}(\left \vert x_{i}^{{\prime}}- x_{ i}\right \vert ). }$$
(10)

this is usually considered as L norm.

2.2.2 Cosine and Correlation Distance

Cosine and correlation distance are both based on scalar product \(\mathbf{x^{{\prime}}}\cdot \mathbf{x}_{i}\), instead of the difference \(\mathbf{x^{{\prime}}}-\mathbf{x}_{i}\). The cosine distance is defined by the following equation:

$$\displaystyle{ d(\mathbf{x^{{\prime}}},\mathbf{x}_{ i}) = 1 -\frac{\mathbf{x^{{\prime}}}\cdot \mathbf{x}_{i}} {\|\mathbf{x^{{\prime}}}\|\ \|\mathbf{x}_{i}\|} }$$
(11)

where the \(\|.\|\) is the Euclidean norm. The correlation distance is defined by:

$$\displaystyle{ d(\mathbf{x^{{\prime}}},\mathbf{x}_{ i}) = 1 -\frac{(\mathbf{x^{{\prime}}}-\overline{\mathbf{x^{{\prime}}}}) \cdot (\mathbf{x}_{i} -\overline{\mathbf{x}_{i}})} {\|\mathbf{x^{{\prime}}}-\overline{\mathbf{x^{{\prime}}}}\|\ \|\mathbf{x}_{i} -\overline{\mathbf{x}_{i}}\|} }$$
(12)

where \(\overline{\mathbf{x^{{\prime}}}}\) is the mean of the input vectors x and \(\overline{\mathbf{x}_{i}}\) is the mean of the training samples.

2.3 Barcode Dataset

We downloaded barcode sequences from the Barcode of Life Database (BOLD) [16]. In our study, we considered 10 barcode datasets belonging to different BOLD projects and living organisms. These datasets have been selected according to some criteria: we chose only barcode compliant dataset, i.e certified by BOLD as true barcode sequences, with sequence length not shorter than 500 bp and not longer than 800 bp. These datasets differ each other on the basis of the number of species and specimen, the sequence length and the sequence quality (in terms of undefined nucleotides). Following these criteria, we collected 2210 sequences. The dataset composition, in terms of number of different taxa and number of specimen for each taxa, is summarized in Table 1, where it is possible to note how the dataset is unbalanced.

Table 1 Barcode dataset composition at each taxonomic level

3 Results and Discussion

In this section, we describe the parameter setup for the GRNN algorithm and the adopted training/testing procedure. Then we report classification results in terms of accuracy, precision and recall scores, and finally we discuss those results.

3.1 Experimental Setup

The only parameter of the GRNN algorithm is the spread factor σ (Eq. 2). In our experiments, we tuned the σ value by means of a ten fold cross validation procedure, considering as training set the dataset composed of the full length sequences. This procedure has been carried out implementing each distance model (see Sect. 2.2), and for values of σ ranging from 0. 5 to 0. 8, with a step of 0. 1. For each value of σ we noticed that the behaviour of the GRNN was substantially the same regardless the distance model, and the best results, in terms of error rate, were obtained with σ = 0. 6. As for the fractional distances, Eq. (8) with p < 1, we considered three values for p: 0.3, 0.5, 0.7. All the experiments have been done using Python scripts on a Windows 7 machine equipped with i7 Intel CPU at 2.8 GHz with 8 GB of RAM. Computational times of the GRNN algorithm are about 1 min for a single experiment.

The classification performances of the GRNN algorithm have been tested considering full length barcode sequences and sequence fragments of 200 consecutive bp randomly extracted from the original sequences. We want to assess the GRNN predictive power and its robustness with regards to the sequence sizes. In fact, in the study of environmental species, for example, usually only small portions of the barcode sequences are available.

For each distance model, the training and testing procedures have been done in two ways. In the first case, we adopted a ten fold cross validation method: in each fold, we trained the GRNN with the 90 % of the full-length sequences and we used as test set the remaining 10 % of both the full-length sequences and their corresponding sequence fragments of 200 bp. In the second case, we trained the GRNN with the whole dataset of the full-length sequences and then we tested it with all the sequence fragments. In the first scenario, we want to assess the classification performances of the GRNN considering full-length sequences and its generalization degree when used to classify sequence fragments whose corresponding original sequence does not belong to the training set. In the second scenario, we supposed the GRNN is used to recognize small random fragments, by “knowing” all the original full-length sequences. Comparison with the k-NN classifier has been carried out following the same training and testing procedure. We used the k-NN implementation provided by the Weka Experimenter Platform [6], considering k = 1 and k = 3, as done similarly in [10].

3.2 Classification Results

Classification scores have been evaluated by means of the accuracy, precision and recall performance measures.

These scores are summarized in Tables 23, and 4, respectively. Each table is composed of three parts, according to the adopted training/testing procedure. “Full-length” means the classification results are obtained through a ten fold cross validation scheme considering full length sequences both for training and testing; the scores are averaged over the ten folds. “Full vs. 200-bp” means the classification results are obtained through a ten fold cross validation scheme considering full-length sequences for training and 200 bp fragments for testing; once again the scores are averaged over the ten folds. “200-bp” means the classification results are obtained training with the whole dataset of full-length sequences and tested with all the sequence fragments. In each table, in the first column there is the distance model used to train the GRNN, and in the second row there is the taxonomic level, from Phylum to Species. The last two rows of each table part show the results obtained from the k-NN classifiers.

Table 2 Accuracy scores at each taxonomic level the GRNN algorithm, considering each distance model, and the k-NN classifier
Table 3 Precision scores at each taxonomic level the GRNN algorithm, considering each distance model, and the k-NN classifier
Table 4 Recall scores at each taxonomic level the GRNN algorithm, considering each distance model, and the k-NN classifier

3.3 Discussion

From the classification results shown in Tables 23, and 4, it is evident that the GRNN and the k-NN algorithms are able to correctly classify full-length barcode sequences, with scores around 100 % at each taxonomic level. The GRNN reaches those scores with all the distance models except for the correlation and the cosine distances.

Using those distances, the performances of the GRNN drop significantly, reaching about 62 % in terms of accuracy at phylum level, and only about 20 and 12 % in terms of recall and precision respectively. That means distances based on scalar product of the patterns are not suitable with the GRNN algorithm. The most interesting results are therefore the ones obtained during the classification evaluation of the sequence fragments. First of all, the performances decrease with respect to taxonomic level, as it is also evident in the chart of Fig. 2. As the taxonomic rank goes down, indeed, the number of categories to classify increases (see Table 1) and, as a consequence, it is more difficult to correctly classify the patterns. Considering the “Full vs. 200-bp” part, the only meaningful scores are provided by the GRNN implementing fractional and city block distances. In particular while the correlation and the cosine distances keep on giving low scores as in the case of full-length sequences, the Chebyshev and the Euclidean distance have a strong drop of performances, with scores about 40 % for Euclidean distance at Phylum level and about 20 % for Chebyshev distance at Phylum level. The same drop of performances also affects the k-NN classifiers, with very similar scores regardless the value of k. On the other hand, considering fractional and city block distances, the GRNN is still able to provide acceptable classification results for sequence fragments, with scores ranging from about 85 % at phylum level to abut 57 % at Species level. These results further confirm that fractional norms contrast the effects of distance concentration. It is important to remember that in the case of “Full vs. 200-bp” the GRNN network classify the sequence fragments without “knowing” the corresponding full length sequences during the training phase. It is interesting to note (see Fig. 2) that at the family level there are the best scores: that because the distribution of specimen at family level is very unbalanced, with one family collecting about the 40 % of available samples, as reported in Table 1. Finally, considering the “200-bp” part of Tables 23, and 4, once again only the GRNN implementing the fractional and the city block distances are able to provide a proper classification for sequence fragments. In this last case, the performance scores are higher than the “Full vs. 200-bp” scenario, because in this situation we carried out a complete training procedure of the GRNN considering all full-length sequences. Of course, because the spectral representation of full-length and sequence fragments are different from each other, no sequence fragment used in the test set belong to the training set.

Fig. 2
figure 2

Accuracy scores at each taxonomic level for the “Full vs. 200-bp” training/testing scheme of the GRNN classifier with different distance models

4 Conclusion

In this work, a modified version of the GRNN algorithm implementing different distance models for barcode sequence classification is presented. The GRNN classification performances have been assessed with regards to sequence sizes. Experimental trials have been carried out considering full-length sequences and sequence fragments that simulate a very common scenario in which only environmental samples are available. In the case of full-length sequences, 6 out of 8 distance models provided near perfect results, in terms of accuracy, precision and recall, with scores ranging between 100 % at Phylum level and 90 % at Species level. The same scores are reached using the k-NN classifier. Only correlation and cosine distance did not provide acceptable results. In the case of sequence fragments, fractional and city block distances only gave meaningful results: in the “Full vs. 200-bp” scenario, accuracy ranged from 85 % at Phylum level to 57 % at Species level; in the “200-bp” scenario, accuracy ranged from 95 to 100 % at Phylum level to 70–79 % at Species level. In both scenarios our GRNN approach outperformed the k-NN classifier. That means GRNN implementing fractional and city block distances was able to correctly predict the similarity between original full-length sequences and their corresponding sequence fragments. All the other distance model were affected by a strong classification performance drop.