1 Introduction

Helitrons form an important part of Transposable Elements (TEs) DNA class II in eukaryotic genomes [16, 45]. This specific DNA type transposes by a rolling circle replication mechanism via a single-stranded DNA intermediate. During transposition, helitrons are the only TEs DNA that does not create target site duplications [16]. Helitrons frequently capture miscellaneous host genes, some of which can evolve into novel host genes. In the evolution of host genomes, the helitrons seem to play a major role because some of the host genes become essential for the helitrons transposition [34, 45]. They are also involved in a number of biological processes such as heterochromatin development, gene expression regulation [54] and genome rearrangements [51]. In addition, helitrons have the ability to synthesize new genes by nearby exons capture, also by transcription readthrough and unrelated exons placement into common transcripts [2]. Helitrons were firstly discovered in the nematode Caenorhabditis elegans [15] then in plants (Arabidopsis thaliana and Oryza sativa) [33] and in fruit fly (Drosophila) [4]. At present, they are identified in a diverse range of species like: fungus [11], lucifugus [35], animals [55], vertebrates, specifically in the fish Daniorerio and Sphoeroidesnephelus genomes [34]. In this paper, we focus on the helitrons characterization and classification from a different point of view. Indeed, we harness the power of the signal processing tools to identify these interesting elements in a visual way and we use the support vector machine (SVM) as a classification technique. The SVM solves recognition problems of the two classes and multi-class. It has been widely used in numerous domains due to its inherent discriminating learning and generalization capabilities; it is often applied to solve statistical learning problems [50]. The SVM classification was applied with success in bioinformatics studies [22], DNA [32], molecular genetics [7] and for the identification and characterization of microRNAs and target prediction [12, 28]. In this framework, we propose using this classification strategy to help non-specialists to easily annotate Helitrons. This work was driven by the fact that helitron recognition using the signal processing methods has not been yet addressed.

The paper includes an introduction, a state of the art of the related works, two sections describing the work and a conclusion. In the section proposed method, we provide a description of the DNA coding technique and the analysis methods used to extract features as well as the classification technique. Finally, we give the conclusion of this work. In the section experimental results, we describe how we partitioned the parameters extracted for the C.elegans genome into helitrons and non-helitrons databases. Then, we expose the classification results endorsed by a comparison between results we obtained for two groups of parameters.

2 Related works

Since their discovery, helitrons have attracted widespread attention. To identify and analyze these elements many computational tools were developed: HelitronFinder [5], HelSearch [54], a combination of BLAST search and the hidden Markov models [41] and HelitronScanner [53]. These tools are typically based on the search for canonical helitrons which begin with a 5′T (C/T) and terminate with a CTRR 3′as well as the existence of a hairpin structure (∼ 11 base pairs). In this sense, a comparison of the searched area with the reference helitron is performed by different alignment algorithms [6, 43].

These methods are generally hindered by the lack of information about the reference sequences as well as the need of an enormous memory space [37, 42]. Besides the asymmetric structure of helitrons, their abundance and diversity in genomes, present an enormous identification and annotation problem. Taking into account all these factors, we can comprehend why the structure and the dynamic of the helitron sequences have not been yet well studied. In fact, the available bioinformatics tools aim to identify the presence of helitron on the basis of previous knowledge of its biological features [14] and do not provide a visual tool to detect the presence of this element in a given sequence. Further, despite the availability of several methods of helitron identification, a systematic classification method based on the information (features) revealed by the sequence itself has not been yet taken forward.

As a solution, we propose in this work the combination of the signal processing tools with the SVM-approach (a supervised learning algorithm) to identify helitronic sequences. The main steps of the systematic recognition of helitrons are based on the choice of the classifier, the features extraction methods and the choice of the non-helitrons databases.

3 FCGS Coding technique and Wavelet transform

To present a DNA sequence (a chainlike molecule composed of four bases: A, T, C, and G) into a numerical form we need to transform these bases into a series of numerical values. For this aim, we provide two ways to represent the DNA sequences into:

  • explicit signals when we applied the FCGS coding technique

  • explicit images when we applied the CWT to these signals.

3.1 FCGS2 coding technique

Thanks to the development of DNA coding techniques, different methods have emerged like the binary coding [30], the structural bending trinucleotide coding (PNUC) [31], the Frequency Chaos Game Signal (FCGS) [25, 26, 46], etc. The latter technique is a new one dimensional coding. It is based on assigning the frequency value of each sub-pattern in the chromosome to the same group of nucleotides existing in the sequence. In this work, we consider the Frequency Chaos Game Signal of dinucleotides: FCGS2. It is based on the apparition’s probability of all two successive nucleotides for an entry DNA sequence [25, 26, 46]. The probability of a given dinucleotide (P2nucleotide) is calculated following this equation:

$$ {P}_{2 nucleotide}=\frac{N_{2 nucleotide}}{LN_{Chr}} $$
(1)

N2nucleotide represents the apparition number in the whole sequence of a given dinucleotide. LNChr represents the length of the chromosome in base pairs.

Our goal is to establish a signals database for both helitrons and the repetitive sequences which are considered as non-helitrons. For this, a dinucleotide (i), at a position (k) is replaced by its occurrence’s probability:

$$ {S}_{2 nucleotide}(k)={\sum}_i{P}_{2 nucleotide}\left(i,k\right) $$
(2)

The FCGS2 consists in calculating the sum of all dinucleotide indicators (S2nucleotide):

$$ {\mathrm{FCGS}}_2(k)={\sum}_k{S}_{2 nucleotide} $$
(3)

Therefore, the FCGS2 technique represents the temporal evolution of the dinucleotides frequency along the chromosome which means that it reflects the statistical features of the chromosome itself. To enhance such characteristics, we propose to apply a time- frequency method, which is the wavelet transform.

3.2 Wavelet transform

The Wavelet Transform is widely used in the time-frequency analysis of biomedical and biological signals [1, 13, 23, 29, 47]. In this work, we use the wavelet energy features that we extract from the wavelet coefficients’ matrix to classify helitrons. The coefficients matrix is obtained by applying the continuous wavelet transform to the FCGS2 signal with the Complex Morlet Wavelet is taken as analysis window. After that, we prepare the features database (energy dataset) of all genomic sequences to be passed to the SVM-classifier for the helitrons classification purpose.

The CWT decomposes a given signal into a sum of windows called wavelets. The latters are obtained by shifting and expanding a mother wavelet ψ(t) [9, 24, 27]. The set of wavelet windows is obtained by:

$$ {\uppsi}_{s,u}(t)=\frac{1}{\sqrt{s}}{\uppsi}^{\ast}\left(\frac{t-u}{s}\right),\mathrm{s}>0,\mathrm{u}\in \mathbb{R} $$
(4)

Here * is the complex conjugate.

The Complex Morlet function is expressed by:

$$ {\psi}_{cmor}(t)={\pi}^{-\frac{1}{4}}\left({e}^{i{\omega}_0t}-{e}^{i{\omega}_02}\right){e}^{-\frac{t^2}{2}} $$
(5)

Here ɷ0 is the oscillation’s number.

The continuous wavelet transform is performed by applying this formula:

$$ {W}_{\left(s,u\right)}\left[{\mathrm{FCGS}}_2(t)\right]=\frac{1}{\sqrt{s\ }}{\int}_{-\infty}^{+\infty }{\mathrm{FCGS}}_2(t){\psi}^{\ast}\left(\frac{t-u}{s}\right) dt $$
(6)

In the following, we consider the complex Morlet wavelet transform (CWT) along 64 scales with the parameter ɷ0 fixed at 5.4285.

The final result is a matrix of coefficients which we use to generate the scalogram representation by calculating the absolute value of these coefficients. Here, we use the scalogram presentation as a new way to visualize the DNA sequences. The time-frequency plan allows us to distinguish a DNA class by a specific signature (specific motifs with different periodicities). Thanks to this property, we are able to visually recognize a given helitron class. In the following (Fig. 1), we provide the representation that characterizes each helitron class. Since we will classify helitrons and non helitrons (repetitive DNA), we also give an example of the time-frequency signature of a repetitive DNA sequence containing the microsatellite (CAACG)n. The horizontal axis indicates the DNA position in base pairs while the vertical axis indicates the frequency.

Fig. 1
figure 1

Examples of Helitron and repetitive DNA signatures

Note that these scalograms are the result of the wavelet analysis applied to the FCGS2 of some concatenated helitrons belonging to chromosome I of C.elegans. The scalogram of the microsatellite type in Fig. 1 corresponds however to two concatenated sequences of repetitive (CAACG)n DNA in chromosome I. These scalograms examples present the overall behavior of the considered DNA types. For example, helitrons of type Helitron2_CE, HelitronY2_CE, HelitronY3_CE and Ndnax2_CE have specific signature presented by small repetitive motifs spread over a large frequency band. On the other hand, helitrons of type Helitron1_CE, HelitronY1_CE, HelitronY4_CE and Helitron1A_CE present similar signatures characterized by a pronounced energy around the frequency 0.15 (which is equivalent to periodicity 6). Other similar repetitive motifs are noticed for Helitron1_CE and HelitronY1A_CE around the frequency 0.1 (which corresponds to the periodicity 10). These helitrons present various motifs around different periodicities compared to other helitron families. As for helitrons NDNAX1 and NDNAX3, they have specific repetitive patterns compared to the other helitron classes which renders them very distinctive.

Finally, the microsatellite adopt a different time-frequency signature which facilitates the distinction between this DNA class and helitrons.

From these figures, we can see that the repetitive patterns and the energy concentration around the frequency bands allow to visually differentiate between the helitron classes and the non helitron ones. In the following, we will exploit these results to automatically classify helitrons based on SVM.

4 SVM classification

Our aim is to recognize helitrons from the non-helitron sequences. For this goal, we use the supervised learning model: the Support Vector Machines: SVMs. The SVM classifiers are based on the VapnicChervonenkis (VC) theory and the principle of the structural risk minimization (SRM) [38]. The SVM model was developed by Vapinik in 1998 [50]. The main principle that encompasses this technique, is the structural risk minimization (SRM) [10, 38, 49]. Using the function of Kernel, the original input is set and remodeled to a high-dimensional feature space to achieve optimal classification performance in the new feature space [50]. Maximizing the error margin could give effective discriminate SVMs classifiers. They have also the advantage of being able to deal with samples of a very higher dimensionality. They have been successfully used in different pattern recognition applications like face, EEG signals, speaker and DNA [8, 17, 18, 40, 48, 52]. The SVMs are particularly attractive to the biological sequences analysis due to their ability to handle large dataset and large input spaces [20, 39].

The SVM have better generalization abilities which are achieved through the minimization of the upper bound of the generalization error. It aims to maximize the margin; distance from a separating hyperplan to the closest negative (−1) or positive (+1) sample between classes. One or several hyperplans are constructed in order to separate different classes. Elsewhere, an optimal hyperplan must be found. This optimal hyperplan [3], a linear decision function, has to be with the maximal margin. However, this margin should be between the vectors of the two classes. The hyperplan can be described as:

$$ f(x)={w}^tx+b $$
(7)

Were w is the weight vector, x is the input vector, and b is the bias.

4.1 Kernel functions

The major tricks of SVM are the kernel functions. For the case in which no linear separation is possible, these functions are used. In the case when data are not linearly separable the kernel tricks extend to the class of decision functions. In addition, the kernel function can be explained as a measure of similarity between the input samples xi and xj [18], which allows SVM classifiers to meet the separation rule even with highly divergent and complex boundaries.

In what follows, we focus on finding the best kernel and the kernel function parameters to classify helitrons. Several choices for the kernel function are available, including: linear, polynomial, sigmoid, RBF. In the next paragraphs, we present a quick overview of the most frequently used kernel functions.

  1. 1)

    Polynomial Kernel

The Polynomial kernel, a non-stationary kernel, is well adapted for problems where all the training samples are normalized. Its parameters should be carefully tuned; which are the slope σ, the polynomial degree d and the constant term r.

$$ K\left({x}_i,{x}_j\right)={\left(\upsigma {x}_i^T{x}_j+r\right)}^d,\kern0.5em \upsigma >0 $$
(8)

Here, we consider d = 3 and r = 0.

  1. 2)

    RBF kernel

The Gaussian functions (Radial basis functions: RBF) are a family of kernels that measures the distance between feature vectors smoothed by an exponential function [36]. The carefully chosen parameters (c, σ) can play a major role in the performance of the kernel.

Below, we present the equation of RBF (radial basis function) kernel [21].

$$ K\left({x}_i,{x}_j\right)={\left(\upsigma \left\Vert {x}_i-{x}_j\right\Vert \right)}^2,\kern0.5em \upsigma >0 $$
(9)

The accuracy of the classifier is highly sensitive to the choice of the parameter σ. The latter must be tuned to control the amount of smoothing. In fact, the behaviors of SVM change when σ becomes too small or too large.

In this work, we use the following couples (c, σ):

$$ \upsigma =c=\left[{2}^{-6},{2}^{-5},\dots {2}^9,{2}^{10}\right] $$
(10)
  1. 3)

    Sigmoid kernel

Typically, the kernel must satisfy Mercer’s theorem (the kernel is a positive definite function). Despite its widespread use, sigmoid kernel is not positive semi-definite for certain values of parameters.

$$ K\left({x}_i,{x}_j\right)=\mathit{\tanh}\left(\upsigma {x}_i^T{x}_j+r\right) $$
(11)

Here, gamma is the scale parameter of the input samples and r is the shift parameter that controls the threshold of mapping (r = 0). Hence, the parameters σ and r have to be properly chosen. If this choice is not properly done, it yields to wrong results.

5 Material and Method

This section is devoted to the helitron recognition in C.elegans genome based on the FCGS2 coding technique and the SVM classification.

5.1 Material

For this work, the C.elegans sequences were retrieved from the NCBI public database [44]. Two genomic datasets were then used for the current study: one is the “helitrons” dataset and the other represents the “non-helitrons” dataset. In this context, the non-helitronic sequences we have chosen are also consisting of small repetitive motifs (one or more nucleotides) that frequently appear in the genome. The basic repetitive patterns we used here are of a microsatellite’s type with a ranging length from 2 to 5 base pairs (pb) [7, 28]. Our choice goes to the following patterns: (A)n, (AATAG)n, (ATG)n, (ATGGTG)n, (ATTG)n, (CA)n, (CAA)n, (CAGG)n, (CAGA)n,(CAACG)n, (CAAT)n and (CAATA)n. This choice is justified by the fact that helitrons contain themselves microsatellites sequences; which misleads the helitron recognition rates in most of bioinformatics tools.

The helitron classes, we have investigated here, are of the number of ten: Helitron1_CE (H1), Helitron2_CE (H2), HelitronY1_CE (Y1), HelitronY1A_CE (Y1), HelitronY2_CE (Y2), HelitronY3_CE (Y3), and HelitronY4_CE (Y4), NDNAX1_CE (N1), NDNAX2_CE (N2) and NDNAX3_CE (N3). These families possess complex and variable structures. More of this, the size of the helitronic sequences varies according to the family. Globally, the apparition number of helitrons in the C.elegans genome varies from 77 to 1093 according to their family (Table 1 in the next sub-section). The variability in terms of length, composition and structure makes difficult the identification of these elements.

Table 1 The Helitron occurence number in training and testing databases in C.elegans

5.2 Method

In the following Fig. 2, we provide the flowchart describing this work.

Fig. 2
figure 2

SVM -Helitron recognizer flowchart

The adopted methodology is composed by five steps.

  • The first step consists in extracting all chromosomes data for the C.elegans model [44] and generating the corresponding FCGS2 sequences. In this way, the DNA database will be converted to a 1-D signals database.

  • The second step consists in extracting the helitron and the non-helitron sequences from the NCBI database. Here the non helitronic DNA consists in repetitive sequences that contain microsatellites and do not form helitrons.

  • In the third step, helitron and non-helitron signals database regarding dimmers (sequences of 2 bp length) is prepared. Then, a signal database that corresponds to all helitron types is established as well as another database that do not contain helitrons but repetitive DNA of microsatellite type [44].

  • In the fourth step, we prepared the database which contains the corresponding temporal, spectral and time frequency features of each of these helitronic and non-helitronic sequences. In this way two types of features were extracted: the combination of the spectral and the temporal features and the energy calculated based on the wavelet transform. The ultimate database was splitted into two sub-databases: 70% for training and 30% for testing. For SVM-classification system, the Table 1 specifies the helitrons number which we considered for training and test and thus for each helitron class. In the other hand, we took the same number for the sequences which contain microsatellites.

From Table 1, we can see that the helitron of type helitronY1A_CE has the highest occurrence number in the genome.

  • Finally, the SVM technique was used for classification. The classification step involves separating data into testing and training sets. In order to have accurate results (when the system tends to give better recognition rates), all kernel functions were tested.

5.2.1 Features extraction

For the features extraction task, several methods have been reported in the signal processing field including: the time domain, the frequency domain (like the Fourier transform) and the Time-frequency domain (such as the short-time Fourier and the wavelet transforms). Here, we propose using these methods to extract features from FCGS2 signals to be considered later in the classification step. In the features extraction step, various independent variable values are prepared as input of the classifier to predict the corresponding class to which belongs the independent variable.

  1. a)

    Temporal features

For the temporal features, we use statistical measures including: maximum number of peaks (Picsoccurence), average (μ), standard deviation (Std), Mahalanobis distance (X), variance (Var), median(median), energy (E) and Root Mean Square (RMS).

$$ {Pics}_{occurence}=\frac{number\left(\max FCGS{2}_{Chri}\right)\ }{LN_{Chri}} $$
(12)

Where i is the number of the chromosome (Chr) and LN is the length of the chromosome i. For the chromosomes I, IV and V the Maxpic feature presents the number of apparition of the dinucleotide ‘TT’ and for the other chromosomes it presents the number of the dinucleotide ‘AA’.

  1. b)

    Spectral features

As spectral features, we use the following parameters: mean power spectral density (Smean), Power spectral density (PSD) and the Power root mean square (Prms).

Here, the Fourier transform is used to convert the time-based signal into the frequency domain. The features we extract are:

  • Mean power spectral density (Smean): it is the average Power spectral density. It measures the energy of a signal when distributed in the frequency domain. Its mathematical expression is given by:

$$ Smean= mean\left(\frac{2\ast \left\Vert \frac{X(f)}{N}\right\Vert }{N}\right) $$
(13)

Here N is the signal length and X(f) is the Fourier transform of the signal x.

  • Power spectral density (PSD): the computation of PSD is done by applying the Fast Fourier Transform on the autocorrelation function (rxx(τ)).

  • Power root mean square (Prms): the Prms measures the power of the signal’s magnitude. It is calculated from the following equation:

$$ Prm{s}_{freq}=\sqrt{\sum \mathrm{FCGS}2{(f)}^2}=\sqrt{\sum_{i=1}^{N-1} FFT\left(\mathrm{FCGS}{2}_i^2\right)} $$
(14)

We further use the absolute value of Prms_freq whose expression is as follows:

$$ Prms\_ abs=\left| Prms\_ freq\right| $$
(15)
  1. c)

    Time-frequency features

For the time-frequency features, the genomic sequences are firstly transformed into signals using the FCGS2 technique. The signals are secondly transformed into the time–frequency domain based on the continuous wavelet transform (CWT). Finally, a vector containing the wavelet coefficients and their relative energies are used as features for helitrons and non-helitron classification (Fig. 3).

Fig. 3
figure 3

SVM -Helitron recognizer flowchart based on features extracted from CWT

The procedure consists of calculating the energy at each scale of the wavelet decomposition using the following formula:

$$ {E}_{wavelet}={\sum}_{s=1}^L{\left|{W}_{\left(s,u\right)}\left[{FCGS}_2(t)\right]\right|}^2 $$
(16)

Here, L is the length of the FCGS2 signal. Since we have considered 64 scales for the CWT computation (which means that the wavelet coefficients matrix contains 64 power spectra), we have calculated 64 relative energy values.

The sub-figures (a) and (c) in Fig. 4 provide the scalogram representation (absolute value of the wavelet coefficients) of two helitron examples; While the correspondent energy (energy concentration around frequencies) are illustrated in the sub-figures (b) and (d). For the scalograms, the horizontal axis indicates the helitron’s position in base pairs. As for the energy representation, the horizontal axis gives the energy amplitude.

Fig. 4
figure 4

The scalogram and the relative energy representations of two helitrons

The first example is an HelitronY4_CE which is positioned at [354,726 bp–355,316 bp] in the C.elegans chromosome I. The second example is an NDNAX2_CE which is positioned at [5,640,357 bp- 5,640,991 bp] in the same chromosome.

As it can be seen, the scalogram serves to visualize the signature of helitrons [13]. Regarding the energy plot, the pronounced peaks indicate the presence of repetitive motifs in the correspondent scalogram. For example, in the case of HelitronY4_CE, the highest energy is concentrated around the frequencies 0.05 and 0.15. This translates into a special repetitive motif within the frequencies band [0–0.05] in the scalogram representation. We also note that the most pronounced peak in the spectrum corresponds to the frequency 0.15, which is equivalent to the periodicity 6. In addition, the frequency 0.027 indicates the existence of the periodicity 35 in the DNA sequence.

As for the helitron NDNAX2, particular repetitive motifs with a special energy pattern mark their presence in the frequency band [0–0.2]. The energy plot proves that the highest energy is included into this sub-band. The large energy band has high amplitudes that correspond to the main periodicities: 6, 10 and 20. Finally, Fig. 5 presents the scalogram representation (sub-figure (e)) and the energy-spectrum (sub-figure (f)) of a repetitive DNA sequence of type (CAGA)n. The sequence has the size of 376 bases pair and the position [13,273,486 bp–13,273,861 bp] at the C.elegans chromosome I. This sequence consists of the repetition of the microsatellite ‘CAGA’, which size is obviously of 4 bp. The repetition of this subsequence in the DNA sequence creates a periodicity 4. This periodicity is well translated in the scalogram representation in the form of high energy frequency band around the frequency 0.25. Likely, we can also see the existence of the periodicity 5 in the scalogram. These two periodicities correspond to the peaks with the highest amplitudes in the sub-figure (f).

Fig. 5
figure 5

The scalogram and the relative energy representations of repetitive DNA sequence of type (CAGA)n

Based on this, we can see that the time-frequency is an effective way to represent DNA in such way all periodicities that characterize the considered sequence can be detected.

Next, we will exploit this manner to extract information about DNA to be taken into account by the SVM classification.

  1. I.

    Experimental results

In this work, we mixed helitron sequences with the genomic sequences that do not contain helitrons and whose number of training and testing are equal to the helitrons ones. One third of data is taken then for testing and two-thirds for training.

We used, further, all the temporal, the spectral and the time-frequency features for all the sequences. As for the kernel tricks, we used the Linear, Polynomial, RBF and Sigmoid functions. For the SVM-kernels parameters, we found that using the Cross-Validation function gives the optimal value of two parameters: the kernel width (σ) and the regularization parameter (c) [19]. It is known that the recognition task is roughly divided into two stages: the feature extraction and the recognition. The performance of the recognition system strongly depends on the choice of the feature extraction method. Thus, we tried different methods to extract features from the genomic data. In fact, we based the helitron prediction on two features databases: the first database contains the temporal and the spectral features extracted from the FCGS2 signals; while the second database contains the relative energy obtained by the continuous wavelet transform of the FCGS2 signals. Consequently, a comparison between the accuracy rates of the helitron prediction based on of the two methods was conducted. The first database consists of the combination of 8 temporal features and 3 spectral features. The parameters we considered are as follows:

  • The temporal features are: max, μ, Median, Std, X, Var, E and RMS

  • The spectral features are: Smean, PSD and Prms

As for the kernel tricks, we used Linear, Polynomial, RBF and Sigmoid functions which parameters are varied according to the cross-validation function. For each kernel function we defined and calculated the best parameters (d, c and σ) which give the best accuracy rates for all helitron types’ recognition. In the following table we provide the best recognition accuracy rate for all helitron classes in the C.elegans. The results of the classification are based on the combination of the temporal (8 parameters) and the spectral features (3 parameters).

The prediction performances, illustrated in Table 2, show very good results. In fact, the accuracy values are between 75.88% and 95.65, depending on the helitron family. The best accurate prediction (which is 95.65) is obtained for the helitron NDNAX1 class. In addition, six helitron types were highly predicted (with an accuracy rate greater than 91%) which are: HelitronY1_CE, helitronY1A_CE, HelitronY3_CE, NDNAX1_CE, NDNAX2_CE and NDNAX3_CE. On the other hand, the SVM-Linear performs the better results for HelitronY1_CE (ACC = 92.43%), Helitron1_CE (ACC = 83.5%), Helitron2_CE (ACC = 75.88%) and HelitronY2_CE (ACC = 83.66%). When we used the cross validation, we found that with the Polynomial kernel d = 2 we get the best values of kernel parameter. As for the RBF kernel, different values of c and σ have given best accuracy of some helitrons classes. Given the variability of the helitron’s sequences in composition and size, the wavelet coefficients matrix leads to a set of features which are not balanced in length. However, the SVM method is limited when it is applied at imbalanced datasets. For this reason, we need to apply a reduction method to obtain balanced datasets for the wavelet analysis while conserving the useful information. Therefore, we have chosen the energy measure to balance these features. Based on the energy vector calculated from the wavelet coefficients matrix, the features database was established in a second step for both helitrons and non-helitronic sequences. The experimental results are illustrated in Table 3 which represents the best SVM results.

Table 2 The best SVM results of helitrons identification with different SVM-kernels and using the group of temporal and spectral feautures extracted from FGCS2
Table 3 The best SVM results of helitron identification with different SVM-kernels using the feautures extracted from the continuous Wavelet Transform

This table shows that the high prediction accuracy (average of all accuracy which is equal to 92.27%) of our method is due to the ability of the time-frequency features to capture helitron/non-helitron attributes.

We notice that the best rate for the NDNAX3 class, which is 96.25%, was obtained using the RBF-Kernel with c = 65,536 and σ = 0. 000000015625. Also, the HelitronY3_CE class was recognized with an accuracy rate reaching 92.64% with these parameters.

Overall, the kernel width σ = 0.000000015625, the penalty c = 60 and the SVM-RBF have given best performance in terms of recognition rates for the Helitron1_CE class with a global accuracy: ACC = 95.76%. Six other notable helitron classes were showing high accuracy rates: HelitronY1_CE, HelitronY1A_CE, NDNAX1_CE, NDNAX2_CE, HelitronY4_CE, and Helitron2_CE with the respective values of 95.43%, 95.38%, 93.48%, 92.98%, 91.24 and 85.46%. As it can be noted, with the polynomial kernel width parameter d = 2, we obtained the best global accuracy rate reaching the value of 88.11% for the HelitronY2_CE class.

Based on these results, we can see that introducing a set of time-frequency features reveal interesting results that can categorize the helitrons sequences.

Now, if we compare these results (Table 3) with those obtained when we used the first group of parameters (Table 2), we can clearly see that the SVM classifier better recognizes helitrons when it is based on the features extracted from the wavelet transform. Indeed, the best global accuracy rate using the temporal and the spectral features extracted from the FCGS2 signals reached 88.29%. However, the best global accuracy rate we obtained using the energy wavelet features attained 92.27%.

From the present work, we can conclude that the choice of the features we have to extract from the helitronic signals play a major role in their recognition. It turns that using a time-frequency analysis gives better results than using temporal and spectral analysis in terms of the helitron classification. In addition, the SVM-classifier parameters have shown a great influence on the classification results.

6 Conclusion

In this work, we developed a highly accurate method for predicting the Helitron sequences. A support vector machine classification approach based on all SVM-kernels has been adopted for the helitron recognition in C.elegans. The obtained results revealed very encouraging classification accuracies. The detailed experimental results presented here have shown the great effect of the feature extraction step on the helitrons classification rates. In fact, for the feature extraction, we have proposed two methods. The first method consists in extracting temporal and spectral features from the FCGS2 of helitrons and non-helitrons sequences (a repetitive DNA which contains microsatellites). However, the second method relies on the continuous wavelet transform of the FCGS2 signals of helitrons and non-helitron sequences. To make a balanced features database, we extracted the relative energy from the wavelet coefficients matrix. The classification results have shown the superiority of the time-frequency analysis compared to the temporal and spectral analysis in terms of the helitron classification. Furthermore, we demonstrated that choosing the optimal parameters for the SVM-kernels (d, c, and σ) would greatly help improve the accuracy rates of the helitron prediction.