Keywords

1 Introduction

Various approaches are used for clustering the DNA sequences such as the WNN, which is applied to construct a classification system. Cathy H. et al. used an artificial neural network to classify the DNA sequences [10]. Moreover, Agnieska et al. are proposed a method to classify the mitochondrial DNA Sequences. This approach joins the WNN and a Self-Organizing map method. The feature vector sequences constructed by using the WNN [11]. Xiu Wen et al. used a Wavelet packet analysis to extract features of DNA sequences, which are applied to recognize the types of other sequences [12]. C. Wu et al. applied the neural network to classify the nucleic acid sequence. This classifier used three-layer and feed-forward networks that employ back-propagation learning algorithm [13]. Since a DNA sequence can be converted into a sequence of digital signals, the feature vector can be built in time or frequency domains. However, most traditional methods, such as k-tuple and DMK,… models build their feature vectors only in the time domain, i.e., they use direct word sequences [14,15,16,17,18,19].

The construction of the neural networks structure suffers from some deficiencies: the local minima, the lack of efficient constructive methods, and the convergent efficiency, when using ANNs. As a result, the researchers discovered that the WNN, is a new class of neural networks which joins the wavelet transform approach. The WNN were presented by Benveniste and Zhang. This approach is used to approximate the complex functions with a high rate of convergence [1]. This model has recently attracted extensive attention for its ability to effectively identify nonlinear dynamic systems with incomplete information [1,2,3,4,5]. The satisfying performance of the WNN depends on an appropriate determination of the WNN structure. To solve this task many methods are proposed to optimize the WNN parameters. These methods are applied for training the WNN such as the least-square which is used to train the WNN when outliers are present. These training methods are applied to reduce some function costs and improve performed the approximation quality of the wavelet neural network. On the other hand, the WNN has often been used on a small dimension [6]. The reason is that the complexity of the network structure will exponentially increase with the input dimension. The WNN structure has been studied by several researchers. Moreover, the research effort has been made to deal with this problem over the last decades [6,7,8,9]. The application of WNN is usually limited to problem of small dimension. The number of wavelet functions in hidden layer increases with the dimension. Therefore, building and saving WNN of large dimension are of prohibitive cost. Many methods are used to reduce the size of the wavelet neural networks to solve large dimensional task. In this study, we use the Least Trimmed Square (LTS) method to select a little subset of wavelet candidates from MLWNN constructing the WNN structure in order to build a method to classify a collection containing a dataset of DNA sequences. This method is used to optimize an important number of inputs of DNA sequences. The Beta wavelet function is used to build the WNN. This wavelet makes the WNN training very efficient a reason of adjustable parameters of this function.

This paper contains five sections: in Sect. 2, we present our proposed approach. Section 2.6 presents the wavelet theory used to construct the WNN of our method. Section 3 shows the simulation results of our approach and Sect. 4 ends up with the conclusion.

2 Proposed Approach

This paper presents a new approach based on the wavelet neural network and Power Spectrum. The WNN is constructed by using the Multi-Library Wavelet Neural Networks (MLWNN). The WNN structure is solved by using the LTS method. The power spectrum is used to construct mathematical moments to solve the DNA sequence lengths. Our approach is divided into two stages: approximation of the input signal sequence and clustering of feature extraction of the DNA sequences using the WNN and the Euclidean distances is used to classify the feature extraction of the DNA sequences.

2.1 Fourier Transform and Power Spectrum Signal Processing

The proposed approach uses a natural representation of genomic data by binary indicator sequences of each nucleotide (adenine (A), cytosine (C), guanine (G), and thymine (T)). Afterwards, the discrete Fourier transform is used to these indicator sequences to calculate spectra of the nucleotides [11,12,13,14,15,16,17,18,19,20]. For example, if x[n] = [TT A A …], we obtain: x[n] = [000100011000 1000. ..]. The indicator sequence is manipulated with mathematical methods. The sequence of complex numbers, called f(x) (1), is obtained by using the discrete Fourier Transform:

$$ f(x) = \sum\limits_{n = 0}^{N - 1} {\mathop X\nolimits_{e} (n)\mathop e\nolimits^{ - j\pi k/N} ,k = 0,1,2, \ldots N - 1} $$
(1)

The Power Spectrum is applied to compute the Se[k] (2) for frequencies k = 0, 1, 2, …, N-1 is defined as,

$$ Se[k] = \mathop {\left| {f(x)} \right|}\nolimits^{2} $$
(2)

Se [k] has been plotted (Fig. 1).

Fig. 1.
figure 1

Signal of a DNA sequence using Power Spectrum

2.2 Wavelet Neural Network

The wavelet neural network is defined by the combination of the wavelet transform and the artificial neuron networks [33, 34]. It is composed of three layers. The salaries of the weighted outputs are added. Each neuron is connected to the other following layer. The WNN (Fig. 2) is defined by pondering a set of wavelets dilated and translated from one wavelet candidate with weight values to approximate a given signal f. The response of the WNN is:

Fig. 2.
figure 2

The three layer wavelet network

$$ \hat{y} = \sum\limits_{i = 1}^{{N_{w} }} {w_{i} }\Psi \left( {\frac{{x - b_{i} }}{{a_{i} }}} \right) + \sum\limits_{k = 0}^{{N_{i} }} {a_{k} } x_{k} $$
(3)

where (x1, x2,…, xNi) is the vector of the input, Nw is the number of wavelets and y is the output of the network. The output can have a component refine in relation to the variables of coefficients a k (k = 0, 1… Ni) (Fig. 2). The wavelet mother is selected from the MLWNN, which is defined by dilation (ai) which controls the scaling parameter and translation (bi) which controls the position of a single function (Ψ(x)). A WNN is used to approximate an unknown function:

$$ y = f\left( x \right) + \varepsilon $$
(4)

where \( f \) is the regression function and \( \varepsilon \) is the error term.

2.3 Multi Library Wavelet Neural Network (MLWNN)

Many methods are used to construct the Wavelet Neural Network. Zhang applied two stages to construct the Wavelet neural Network [2, 3]. First, the discretely dilated and translated version of the wavelet mother function Ψ is used to build the MLWNN [21, 22].

$$ W = \left\{ {\mathop \psi \nolimits_{i} :\mathop {\mathop \psi \nolimits_{i} \left( x \right)}\nolimits_{{}} = \mathop \alpha \nolimits_{i} \psi \left( {\frac{{(x_{k} - b_{i} )}}{{a_{i} }}} \right),\mathop \alpha \nolimits_{i} = \left( {\sum\limits_{k = 1}^{n} {\left[ {\psi \left( {\mathop {\frac{{(x_{k} - b_{i} )}}{{a_{i} }}}\nolimits_{{}} } \right)} \right]} {}^{2}} \right){}^{{\frac{1}{2}}},i = 1, \ldots L} \right\}, $$
(5)

where L is the number of wavelets in W and xk is the sampled input. Then the best M wavelet mother function is selected based on the training sets from the wavelet library W, in order to construct the regression:

$$ \mathop f\nolimits_{M} \left( x \right) = \hat{y} = \sum\limits_{i \in I} {\mathop w\nolimits_{i} \mathop \psi \nolimits_{i} } \left( x \right), $$
(6)

where M \( \le \) L and I is a subset wavelet from the wavelet library.

Secondly, the minimized cost function:

$$ j\left( I \right) = \mathop {\mathop {\hbox{min} }\limits_{{\mathop w\nolimits_{i} ,i{ \in }I}} \frac{1}{n}}\limits_{{}} \sum\limits_{k = 1}^{n} {\left( {\mathop y\nolimits_{k} - \sum\limits_{i \in I} {\mathop w\nolimits_{i} } \mathop \psi \nolimits_{i} \left( {\mathop x\nolimits_{k} } \right)} \right)} {}^{2}, $$
(7)

The gradient algorithms used to train the WNN, like least mean squares to reduce the mean-squared error:

$$ j\left( w \right) = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left( {y_{i} - \hat{y}\left( w \right)} \right)} {}^{2}, $$
(8)

where \( j(w) \) is the output of the Wavelet neural networks. The time-frequency locality property of the wavelet is used to give a signal \( f \) , a candidate library \( w \) of wavelet basis can be constructed.

2.4 Wavelet Network Construction Using the LTS Method

The set of training data \( TN = \left\{ {x_{1} ,x_{2} , \ldots ,x_{k} ,f(x_{k} )} \right\}_{k = 1}^{N} \) is used to adjust the weights and the WNN parameters, and the output of the three layers of the WNN in Fig. 2 can be expressed via (7). The model selection is used to select the wavelet candidates from the Multi Library Wavelet Neural Networks (MLWNN). These wavelet mothers are used to construct the wavelet neural network structure [37, 38]. In this study, the Least Trimmed Squares estimator (LTS) is proposed to select a little subset of wavelet candidates from the MLWNN. These wavelet candidates are applied to construct the hidden layer of the WNN [30,31,32, 36]. Furthermore, the Gradient Algorithm is proposed to optimize the wavelet neural networks parameter. The residual (or error) ei at the ith ouput of the WNN due to the ith example is defined by:

$$ \mathop e\nolimits_{i} = y_{i} - \hat{y}_{i} ,i \in n $$
(9)

The Least Trimmed Square estimator is used to select the WNN weights that minimize the total sum of trimmed squared errors:

$$ \mathop E\nolimits_{total} = \frac{1}{2}\sum\limits_{k = 1}^{p} {\sum\limits_{i = 1}^{l} {\mathop e\nolimits_{ik}^{2} } } $$
(10)

The Gradient Algorithm used to optimize the parameters (ai,bi,wi) of the WNN.

2.5 Approximation of DNA Sequence Signal

The classification of DNA sequences is an NP-complete problem; the alignment is outside the range of two sequence of DNA, the problem rapidly becomes very complex because the space of alignment becomes very high. The recent advance of the sequence technology has brought about a consequent number of DNA sequences that can be analyzed. This analysis is used to determine the structure of the sequences in homogeneous groups using a criterion to be determined. In this paper, the Power Spectrum is used to process the signal of the DNA sequence. These signals are used by the wavelet neural networks (WNN) to extract the signatures of DNA sequences, which are used to match the DNA test with all the sequences in the training set [17,18,19,20,21,22,23,24,25,26,27,28,29]. Initially, the signatures of DNA sequences developed by the 1D wavelet network during the learning stage gave the wavelet coefficients which are used to adapt the DNA sequences test with all the sequences in the training set. Then, the DNA test sequence is transmitted onto the wavelet neural networks of the learning DNA sequences and the coefficients specific to this sequence are computed. Finally, the coefficients of the learning DNA sequences compared to the coefficients of the DNA test sequences by computing the Correlation Coefficient. In this stage, the Euclidean distances is used to classify the signatures of the DNA sequences [27].

The Euclidean distances of different DNA sequences are measured and applied as a measure of similarity for these DNA sequences. The pairwise Euclidean distances of DNA sequences are used to generate a similarity matrix, which can be used to classify the DNA sequence.

2.6 Learning Wavelet Network

In this section, we show how the library wavelet is used to learn a wavelet neural network [15, 16, 26, 27].

  • Learning approach

  • Step 1: The data set of DNA sequence is divided into two groups: training and testing dataset. These groups are applied to train and test the wavelet neural network.

  • Step 2: Conversion of DNA sequence to a genomic signal using a binary indicator and Power Spectrum Signal Processing

  • Step 3: The discretely dilated and translated versions are used to construct the library W. The training data are proposed to create this library wavelet, apply the Least Trimmed Square (LTS) algorithm to select the optimal mother wavelet function (10)(11) and choose, from the library, the N wavelet candidate that best matches an output vector.

  • Step3.1: Initializing of the mother wavelet function library

  • Step 3.2: Randomly initialize \( {\texttt{w}}_{{{\texttt{jk}}}} \) and \( {\texttt{V}}_{{{\texttt{ij}}}} \) .

  • Step 3.3: For k   =   1,…,m

  • Calculate the predicted output \( \hat{y}_{i} \) via (3).

  • Compute the residuals \( e_{ik} = y_{i} - \hat{y}_{i} \) via (9).

    The algorithm is stopped when the criteria diverged, then stop; otherwise, go to the next step.

  • Find the arranged values \( e^{2}_{ik} \le \ldots \le e^{2}_{im} \). Choosing the N best mother wavelet function to initialize the WNN.

    • Step 4: The values of \( \mathop w\nolimits_{ij}^{opt} ,\mathop a\nolimits_{i}^{opt} and\mathop b\nolimits_{i}^{opt} \) are computed using the Gradient algorithm go to step 3.3.

  • Clustering using the Euclidean distances

  • Step 1: Generate a similarity matrix, which can be used to classify the DNA sequence \( (\mathop w\nolimits_{ij}^{opt}, \mathop a\nolimits_{i}^{opt} and\mathop b\nolimits_{i}^{opt}). \)

  • Step 2: Use the similarity matrix to classify to the DNA sequence.

  • To construct a phylogenetic tree of these sequences and Generate the classes of the DNA sequences. (The phylogenetic trees constructed from a similarity matrix reflect groups(classes) information, hierarchical similarity and evolutionary relationships of the DNA sequences) (Figure 3).

    Fig. 3.
    figure 3

    Proposed approach

3 Results and Discussion

This paper used three datasets HOG100, HOG200, and HOG300 selected from microbial organisms [23]. In this study, different experiments are used to evaluate the performance of our approach. The data set of DNA sequences are divided into test and train data. The published empirical and synthetic datasets are selected to perform the clustering comparative analysis [23] (Table 1).

Table 1. Distribution of available data into training and testing set of DNA sequence

3.1 Classification Results

Experiment results were performed to prove the effectiveness of our proposed approach. Evaluation metrics namely Precision, Recall and accuracy are used to compare our approach with other competitive methods. The classification accuracy Ai of an individual program i depends on the number of samples correctly classified (true positives plus true negatives) and is evaluated by the formula:

$$ A_{i} = \frac{t}{n}*100 $$
(11)

where t is the number of sample cases correctly classified, and n is the total number of sample cases.

Table 2 and Fig. 4 show that WNN-PS (our method) is better than other models(WFV, K-tuple and DMK) in terms of the classification results and optimal settings. The number of classes obtained by our approach is little less than in the other methods. The accuracy proves the efficiency of our method. The accuracy is increased using WNN and LTS method. The LTS is applied to optimize the WNN structure.

Table 2. The classification results of WNN- PS(Our Method) and other models (WFV, K-tuple, DMK) on different datasets of DNA sequences
Fig. 4.
figure 4

The number of classes obtained using the proposed approach and the other models

3.2 Running Time

Tables 2 and 3 show that the WNN can produce very good the prediction accuracy. The results of our approach WNN-PS tested on datasets show that accuracy outperforms the other techniques in terms of percentage of the correct species identification. Tables 2 and 3 show the distribution of the good classifications by class as well as the rate of global classification for all the DNA sequences of the validation phase. The WNN-PS(our approach) is faster than the other methods. This speed is due to the use of the Least Trimmed Square (LTS) algorithm; this method is a robust estimator.

Table 3. Running time in seconds of each method on all datasets

4 Conclusions

In this study, we have used the LTS method to select a subset of wavelet function from the Library Wavelet Neural Network Model. This subset wavelet is applied to build Wavelet Neural Network (WNN). The WNN is used to approximate function f (x) of a DNA sequence signal. Firstly, the binary codification and Power Spectrum are used to process the DNA sequence signal. Secondly, the Library Wavelet is constructed. The LTS method is used to select the best wavelet from library. These wavelets are applied to construct the WNN. Thirdly, the Euclidean distances of signatures of DNA are used to classify the similar DNA sequences according to some criteria. This clustering aims at distributing DNA sequences characterized by p variables X1, X2…Xp in a number m of subgroups which are homogeneous as much as possible while every group is well differentiated from the others. The proposed approach helps to classify DNA sequences of organisms into many classes. These clusters can be used to extract significant biological knowledge.