Introduction

Meiotic recombination plays a preliminary role in the maintenance of sequence diversity in human genomes (Chen et al. 2013; Qiu et al. 2014a). The process of recombination is carried out in two steps. In first step, the genome is divided into two equal parts called daughter cells, which participate in sexual reproduction; this process is referred to as meiosis. In the second step, these diverse gametes are joined to form new combination of genetic variations, it is known as recombination. Recombination is very crucial to genetic variations and is considered a main driven force in these variations. In human chromosomes, it targets very narrow spots, which are called hotspots and coldspots. The region of chromosomes where the frequency of recombination is high is called hotspot and the region where the frequency is low recombination is referred to as coldspot. The identification of recombination spots is very essential to understand the reproduction and growth of the cells. A recent study demonstrated that meiotic recombination events occur in 1–2.5 kilo base regions rather than its random occurrence across a genome. A schematic drawing of the meiotic recombination pathway in a DNA system is illustrated in Fig. 1.

Fig. 1
figure 1

An illustration to show the process of meiotic recombination in a DNA system, adopted from Akbar et al. (2014)

The process of recombination is initiated by double-strand break (broken DNA ends) (Chou 2001a; Keeney 2008; Liu et al. 2012). The hotspots, coldspots and the pattern that is formed by these sites provide fundamental in-depth information on processes of human crossover and gene conversion. Due to large exploration of genome sequences, it is highly desired to develop a precise, consistent, robust and automated system for timely identification of recombination spots. A considerable progress has been made in this area; still need for further improvements in terms of accuracy exists. A series of efforts have been reported in the literature (Chen et al. 2013; Qiu et al. 2014a). Initially, recombination of spots has been predicted using nucleotide composition. However, the main issue in nucleotide composition was only considering little sequence into account where some important hereditary information was lost (Liu et al. 2012). However, the number of possible patterns for DNA sequence is extremely large. Thus, it is very difficult to incorporate the sequence-order information into a statistical predicator with such a large length. To compensate this problem, the concept of pseudo-amino acid composition (PseAAC) was introduced by Chou (Chou 2001a). Furth er, this concept of PseAAC was adopted by almost all fields of computational proteomics such as predicting protein subcellular localization (Lin et al. 2008, 2009a; Khan et al. 2011; Dehzangi et al. 2015; Mandal et al. 2015), protein structural class (Sahu and Panda 2010), DNA-binding proteins (Fang et al. 2008); identifying bacterial virulent proteins (Nanni et al. 2012), predicting metalloproteinase family (Beigi et al. 2011), protein folding rate (Guo et al. 2011), GABA(A) receptor proteins (Mohabatkar et al. 2011), protein super secondary structure (Zou et al. 2011), cyclin proteins (Mohabatkar 2010); classifying amino acids (Georgiou et al. 2009); predicting enzyme family class (Zhou et al. 2007), identifying risk type of human papillomaviruses (Esmaeili et al. 2010); predicting allergenic proteins (Mohabatkar et al. 2013); identifying G protein-coupled receptors and their types (Khan 2012) and discriminating outer membrane proteins (Hayat and Khan 2012a), among many others.

As demonstrated in series of recent publication and comprehensive review demonstrated in (Xu et al. 2013a, b, 2014a, b; He et al. 2015; Jia et al. 2015; Liu et al. 2015f) and in compliance with Chou’s 5-step rule (Chou 2011), to establish a really useful sequence-based statistical predictor for a biological system, we should follow the following five guidelines: (a) construct or select a valid benchmark dataset to train and test the predictor; (b) formulate the biological sequence samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (c) introduce or develop a powerful algorithm (or engine) to operate the prediction; (d) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (e) establish a user-friendly web server for the predictor that is accessible to the public.

In this study, we propose genetic algorithm (GA)-based ensemble model iRSpot-GAEnsC for identification of DNA recombination hotspots and coldspots. Numerical descriptors are extracted using two powerful sequence representation techniques namely: dinucleotide composition and trinucleotide composition. Various classification algorithms are investigated individually and finally their predicted outcomes are combined to form ensemble model. The ensemble model is formed using majority voting and GA. Leave-one-out test was applied to assess the performance of proposed model.

Methods and materials

Dataset

To construct a promising computational model, there need some valid benchmark datasets to train the model effectively. For this purpose, we have used dataset S in this study, which has been taken from (Chen et al. 2013; Qiu et al. 2014a). This dataset contains 490 sequences for hotspot recombination and 591 sequences for coldspot recombination. The dataset S of both hotspots and coldspot sequences of recombination can be formulated by:

$$S = S^{ + } \cup S^{ - }$$
(1)

where S + is the subset for the hotspot recombination and S is the subset of coldspot recombination, while the symbol “∪” shows union of both hotspot and coldspot recombination.

Feature extraction strategies

Feature extraction is considered one of the fundamental steps in machine learning process. In feature extraction phase, numerical attributes are extracted from biological sequences because statistical models require numerical descriptors for training. With the explosive growth of biological sequences generated in the post-genomic age, one of the most important but also most difficult problems in computational biology is how to formulate a biological sequence with a discrete model or a vector, yet still keep considerable sequence pattern information. This is because all the existing operation engines, such as SVM (Support Vector Machine) and NN Neural Network), can only handle vector but not sequence samples, as elaborated in (Chou 2015). However, a vector defined in a discrete model may completely lose all the sequence-order information. To avoid completely losing the sequence-order or pattern information for proteins, the pseudo-amino acid composition or PseAAC was proposed (Chou 2001a). Ever since the concept of pseudo-amino acid composition was proposed, it has penetrated into nearly all the areas of computational proteomics (Chen et al. 2015). Because of its successes to deal with protein/peptide sequences in computational proteomics, the concept of PseAAC has been recently extended to dealing with DNA/RNA sequences in computational genetics and genomics (Chen et al. 2012, 2014c, d, 2015; Feng et al. 2013; Liu et al. 2014, 2015a, b, c, d, e, f). Based on the concept of PseAAC, the “pseudo k-tuple nucleotide composition (PseKNC)” (Chen et al. 2014c, d; Liu et al. 2015c, e) was proposed in genome analysis. Owing to wide usage of PseAAC, recently the PseKNC was proposed and demonstrated the effectiveness in predicting nucleosome (Guo et al. 2014), identifying splicing sites (Chen et al. 2014b), identifying translation site (Chen et al. 2014a) and origin of replication (Li et al. 2015). Both PseAAC and PseKNC achieved very exciting results and have played very important roles in relevant fields. In this study, we are to use the concept of pseudo-components to predict the recombination spot in DNA. In practical applications, particularly in developing high-throughput tools for predicting various important attributes for biomacromolecules, many different descriptors to represent biological sequence samples have been developed and widely used, such as those by means of cellular automata image (Xiao et al. 2009), those by complexity measure factor (Xiao et al. 2011), and those by grey dynamic model (Lin et al. 2009b, 2012; Qiu et al. 2014c; Xiao et al. 2015), as well as a long list of the relevant references cited in a recent comprehensive review (Chou 2009). Two powerful DNA sequence representation approaches are used to extract high discriminative features.

Dinucleotide composition (DNC)

A DNA sequence is a polymer of four nucleotides namely adenine (A), cytosine (C), guanine (G) and thymine (T). Let us consider the following DNA sequence X with L residues long, i.e.,

$$X = \left[ {N1N2N3N4N5N6N7 \ldots NL} \right],$$
(2)

where N1 is first position residue of DNA sequence, N2 the second position residue and NL the Lth position residue of the DNA sequence. Simple nucleotide composition has four values, which represents the occurrence frequency of these four nucleotides (Chou et al. 2012). It can be represented as:

$$X = \left[ {f\left( A \right),\,f\left( C \right),\,f\left( G \right),\,f\left( T \right)} \right]^{T} ,$$
(3)

where f(A)represents the occurrence frequency of nucleotide A, f(C) determines the occurrence frequency of nucleotide C, f(G) shows the occurrence frequency of nucleotide G and f(T) denotes the occurrence frequency of nucleotide T, whereas symbol T in the superscript represents the transpose operator. However, the main drawback of simple nucleotide composition is not preserving sequence-order information. To amalgamate the occurrence frequency along with sequence-order information, dinucleotide composition (DNC) was introduced. In DNC, the relative frequency of nucleotide pair is computed (Chen et al. 2014b). It can be demonstrated as:

$$X = \left[ {f\left( {\text{AA}} \right),\,f\left( {\text{AC}} \right),\,f\left( {\text{AG}} \right),\,f\left( {\text{AT}} \right), \ldots ,\,f\left( {\text{TT}} \right)} \right]^{T}$$
(4)
$$X = \left[ {f_{1} ,\,f_{2} ,\,f_{3} ,\,f_{4} , \ldots ,\,f_{16} } \right]^{T} ,$$
(5)

where f(AA) represents the occurrence frequency of AA pair, f(AC) represents the occurrence frequency of AC pair and f(TT) denotes the occurrence frequency of TT pair. As a result, 4 × 4 = 16-D corresponding features are contained in feature space.

Trinucleotide Composition (TNC)

In dinucleotide composition, only two nucleotides are paired. In contrast, in trinucleotide composition (TNC), three nucleotides are combined. In TNC, the occurrence frequency of three nucleotides is calculated. It can be formulated as:

$$X = \left[ {f\left( {\text{AAA}} \right),\,f\left( {\text{AAC}} \right),\,f\left( {\text{AAG}} \right),\,f\left( {\text{AAT}} \right), \ldots ,\,f\left( {\text{TTT}} \right)} \right]^{T}$$
(6)
$$X = \left[ {f_{1} ,\,f_{2} ,\,f_{3} ,\,f_{4} , \ldots ,\,f_{64} } \right]^{T} ,$$
(7)

where f(AAA) shows the occurrence frequency of AAA in the DNA sequence, f(AAA) shows the occurrence frequency of AAC in the DNA sequence and so forth (Duda et al. 2012). It revealed that the corresponding feature space will contain 4 × 4 × 4 = 64 pairs of the nucleotides. Equation (6) is written into generalized form so the corresponding space X having 4k components, i.e.,

$$X = \left[ {f_{1}^{k} ,\,f_{2}^{k} ,\,f_{3}^{k} ,\,f_{4}^{k} , \ldots ,\,f_{{4^{k} }}^{k} } \right]^{T}$$
(8)

The above-mentioned procedure revealed that as the number of nucleotides in pair is increased the number of tuple increased (Chen et al. 2014a). The local or short range of sequence-order information is gradually included into information but the global order sequence information is not reflected by the formulation (Qiu et al. 2014a).

Classification algorithms

Classification is the subfield of data mining and machine learning in which the data are categorized into predefined classes. In this study, several supervised classification algorithms are utilized to select the best one for identification of hotspots and coldspots.

K-Nearest Neighbor (KNN)

KNN is widely used algorithm in the field of pattern recognition, machine learning and many other areas. KNN is simple but widely used algorithm for classification (Duda et al. 2012). KNN algorithm is also known as instance-based learning (Lazy learning) algorithm. It does not build classifier or model immediately but save all the training data samples and wait until new observation needs to be classified. Lazy learning nature of KNN makes it better than eager leaning, which constructs classifier before new observation needs to be classified. It is significant for dynamic data that change and update rapidly (Han and Kamber 2006). KNN algorithm has the following five steps;

Step 1: Provide feature space to KNN algorithm to train the system.

Step 2: Measure the distance using the Euclidean distance formula.

$${Edis}\left( {xi,xj} \right) = \sum\nolimits_{i = 1}^{n} {\sqrt {\left( {xi1 - xi2} \right)^{2} } }$$
(9)

Step 3: Sort the Euclidean distance values as di ≤ di + 1, where i = 1, 2, 3…k.

Step 4: Apply voting or means according to the data nature.

Step 5: Number of nearest neighbor (value of K) depends upon the nature and volume of data provided to KNN. For huge data, the k value should be large and for small data, k value should be small.

Probabilistic neural network (PNN)

The probabilistic neural network (PNN) was first introduced by Specht in 1990 (Specht 1990). It is based on Bayes’ theorem. PNN provides an interactive way to interpret the structure of the network in terms of probability density function (Georgiou et al. 2004). PNN has a similar structure as feed-forward networks but it has four layers. The first layer in known as input layer, second layer is known as pattern layer, third layer is known as summation layer and the fourth layer is known as output layer (Khan et al. 2015). The first layer contains the input vector, which is connected to the input neurons and passed to the pattern layer. The dimension of the pattern layer and the number of samples presented to the network is equal in number. Pattern and input layers are connected to each other by exactly one neuron for each training data sample. The summation layer has the same dimension as the number of classes in the set of data samples. Finally, the decision layer predicts the novel sample into one of the predefined classes.

Random forest

Random forest (RF) is a well-known ensemble technique, which was proposed by Breiman (Breiman 2001; Lou et al. 2014). It is widely used for the pattern classification in the field of bioinformatics (Kumar et al. 2009). The prediction performance of RF is high (Kumar et al. 2009; Chou et al. 2012). The information provided by RF is on variable basis for classification (Ebina et al. 2011; Boulesteix et al. 2012; Touw et al. 2013). RF has a large number of decision trees and every tree produces a classification (Breiman 2001). The final result is obtained by combining the results of all the decision trees by means of voting (Jiang et al. 2007). In addition, RF selects the features randomly. Instead of using all the features for one single tree, it splits the features into different trees and then combines the result of each tree (Jiang et al. 2007).

Support vector machine (SVM)

Support vector machine is an effective method used for the classification of supervised pattern recognition process and was first introduced by Vapnik in 1995 (Vapnik 2000; Qiu et al. 2009; Gu et al. 2010). Later on, it was updated by Vapnik in 1998 (Hayat and Khan 2011). Originally it was developed for two class problems but later it was adopted for multiclass problems (Ahmad et al. 2015). In two class problems, SVM transfers data to the high-dimensional feature space and then determines the optimal hyper plane (Chen et al. 2014b). It is very good classifier for identifying linear as well as non-linear patterns (Akbar et al. 2014). SVM uses different types of kernel functions including but not limited to linear, polynomial, Gaussian [RBF] and sigmoid. In this study, the ‘OVO’ strategy was employed for making predictions using the popular radial base function (RBF) as a kernel function with parameters γ and ∁ (Qiu et al. 2014a). The regularization parameter ∁ and the kernel width parameter γ were determined via an optimization procedure using a grid search approach for identification of recombination hotspots and coldspots.

$$K\left( {xi,xj} \right) = \exp \left( { - \varUpsilon \left| {xi - xj} \right|^{2} } \right),$$
(10)

where in the above equation the parameter γ shows the width of the Gaussian function. The values for the above parameters of RBF are calculated using a grid search during the training phase of SVM model. In our work, LIBSVM package (Chang and Lin 2011) has been used to predict the hotspots and coldspots in the DNA sequence. This software is free for download and is available at http://www.csie.ntu.edu.tw/~cjlin/libsvm .

Generalize regression neural network

Generalize regression neural network (GRNN) is mostly used for function approximation. The structure and functionality of GRNN and PNN are similar. It has four layers, i.e., input layer, radial base layer, special linear layer and the output layer. The total number of neurons in the input and output of GRNN is equal to the dimension of the input and output vectors. GRNN is a well-suited network for small- and medium-sized datasets. The overall process of GRNN is carried out in three steps. In first step, a set of training data and target data is created. Next step of GRNN, the input data, target data and spread constant value are passed to newgrnn as arguments. Finally, the response of the network is noted by simulating it according to the data provided (Cherian and Sathiyan 2012).

Feed forward neural network

A feed forward neural network (FFBPNN) is an artificial neural network (ANN), which consists of N layers. The first layer of FFBPNN is connected to the input vector. The preceding layer has a connection with each subsequent layer. The resultant output is produced by final layer of the network. The training of FFBPNN is carried out using the following Eq. (11);

$${U}_k\left( t \right) = \sum\nolimits_{j = 1}^{n} {{w}_{jk}\left( t \right)} \, \times \,{{{ {xj}}}}\left( t \right) + b 0_{{k}} \left( t \right)$$
(11)
$${\text{Y}}_k\left( t \right) = \varphi \left( {{\text{U}}_k\left( t \right)} \right),$$
(12)

where in Eq. (10), xj(t) shows the input value of j to the neuron at time t, wjk(t) the weight that is assigned to input value by neuron k and b0 is the bias of k neuron at time t. In Eq. (11), Yk(t) is the output of neuron k and φ is the activation function (ALAllaf 2012). The FFBPNN has two special versions of network which are;

Fitting network

Fitting network (FitNet) is a type of FFBPNN. It is used to fit an input–output relationship (ALAllaf 2012). Levenberg–Marquardt algorithm is the default algorithm used for the training of the system. The algorithm divides the feature vector randomly into three sets: (i) the training data (ii) the validation set data and (iii) the test data. A fitting network with one hidden layer and enough number of neurons can fit any finite number of input and output relationship.

Pattern recognition network

The pattern recognition neural network (PatternNet) is also a type of FFBPNN. It is used for solving pattern recognition problems like DNA. It is trained in such a way that it takes feature vector and classifies it according to the target vector. The training of PatternNet is performed using Scaled Conjugate Gradient algorithm. At each training cycle, the sequences are presented to the network. The PatternNet divides the data into three groups; (i) the training set (ii) the validation set and (iii) the test set. The process of training the system through FitNet and PatternNet is same as discussed above but the main difference between FitNet and PatternNet is that both networks use different algorithms for training the system. FitNet uses Levenberg–Marquardt algorithm whereas PatternNet uses Scaled Conjugate Gradient algorithm for training.

Ensemble classification

Ensemble classification has got a reasonable attention in the last decades. It has been successfully used to enhance the prediction power and widely applied for predicting protein subcellular location (Chou and Shen 2007a), predicting signal peptide (Chou and Shen 2007c), predicting subcellular location (Chou and Shen 2007b) and enzyme subfamily prediction (Chou 2005). The performance of ensemble classification approach is relatively better reported than the individual classifiers. The individual classifiers are diverse and can make different errors during the classification process but when these individual classifiers are combined, the errors can be reduced because the classification error of one algorithm is compensated by another algorithm (Hayat and Khan 2012a). The working of ensemble classification has been designed in such a way that it combines the results of different classification techniques and reduces the variance caused by anomaly in these single classification techniques. In this paper, seven different classification techniques have been used, which are GRNN, PNN, KNN, SVM, RF, PatternNet and FitNet. First, the individual classifiers are trained and tested. The individual predictions of each classification algorithm were then combined to form ensemble classifier. It can be represented as follows:

$${\text{EnsC}} = {\text{GRNN}} \oplus {\text{PNN}} \oplus {\text{KNN}} \oplus {\text{SVM}} \oplus {\text{RF}} \oplus {\text{PatternNet}} \oplus {\text{FitNet,}}$$
(13)

where EnsC shows the ensemble classifier and the symbol ⊕  shows the combination operator. The working of ensemble classifier EnsC by fusing the seven individual classifiers can be explained as: suppose that the predicted results of individual classifier for classification of DNA recombination hotspots and coldspots are:

$$\left\{ {C1,C2,C3,C4,C5,C6,C7} \right\} \in \left\{ {D1,D2} \right\},$$
(14)

where {C1, C2, C3, C4, C5, C6, C7}are the individual classifiers and {D1, D2} are the two classes of DNA recombination hotspots and coldspots (Hayat et al. 2012).

$$Yj = \sum\nolimits_{i = 1}^{7} {\delta \left( {C_{i} D_{i} } \right)} \,\,\,{\text{where }}\left( {j = 1,\,2} \right)\,,$$
(15)

where

$$\delta \left( {C_{i} D_{i} } \right) = \left\{ \begin{aligned} 1,\quad{\text{if}}\,C_{i} \in D_{j} \hfill \\ 0,\quad{\text{otherwise}} \hfill \\ \end{aligned} \right\}.$$
(16)

The output of the ensemble classifier using GA is obtained as:

$${\text{GAEnsC}} = {\text{Max}}\left\{ {w1y1,w2y2, \ldots ,w7y7} \right\}$$
(17)

where GAEnsC is the classification output of ensemble technique GA, Max shows the maximum result and the optimum weight of the individual classifier is w 1, w 2, w 3, … w 7. Majority voting-based ensemble classifier is a simple approach. In this approach, each classifier assigns equal weight, which represents all classifiers equally. However, the predictions of all classifiers are not in favor of all types of classes. Some classifiers are good for one class while others are good for other class. In such situation, the success rate of majority voting-based ensemble is not considerable. On the other hand, GA-based ensemble technique has the ability to automatically determine the appropriate weight for each classifier. It effectively finds the proper weights of all the eligible classes depending upon the prediction confidence. Initially, random weight assigns to each classifier. Further, the weight of the classifier is optimized on the basis of prediction confidence. Finally, those classifier outcomes whose confidence level is high are given more importance.

Frame work of proposed model

Looking at the importance of recombination spots, iRSpot-GAEnsC model is proposed for identification of hotspots and coldspots. Two powerful feature extraction methods called dinucleotide composition (DNC) and trinucleotide composition (TNC) were used to extract features from the dataset S. The extracted features were passed to seven different classification algorithms namely; GRNN, KNN, PNN, SVM, RF, PatternNet and FitNet. The best results of the individual classifiers were noted. The predicted results of the individual classifiers were combined to form ensemble model. Simple majority voting and optimization approach genetic algorithm were used to form ensemble model. The proposed model was trained on 64 features of TNC. Our model iRSpot-GAEnsC produced higher performance compared to the existing methods in the literature so far. The proposed model of ensemble classifier is shown in Fig. 2.

Fig. 2
figure 2

Framework of iRSpot-GAEnsC predictor

Performance measures

Several performance measures are applied in classification. These performance measures are used to measure the performance of the machine learning algorithms. Confusion matrix is used to record both the correct values and incorrect values for each class. There are different performance measures as given below.

  1. I.

    Accuracy

$${\text{Acc}} = \frac{{{\text{TP}}\,{ + }\,{\text{TN}}}}{{{\text{TP}}\,{ + }\,{\text{FP}}\,{ + }\,{\text{TN}}\,{ + }\,{\text{FN}}}}\, \times \,100\,\%$$
(18)
  1. II.

    Sensitivity

$${\text{Sen}}\,{ = }\,\frac{\text{TP}}{{{\text{TP}}\,{ + }\,{\text{FN}}}}\, \times 100\,\%$$
(19)
  1. III.

    Specificity

$${\text{Spe}} = \frac{\text{TN}}{{{\text{TN}} + {\text{FP}}}} \times 100\,\%$$
(20)
  1. IV.

    Mathews correlation coefficient (MCC)

$${\text{MCC}} = \frac{{{\text{TP}} \times {\text{TN}} - {\text{FP}} \times {\text{FN}}}}{{\sqrt {[{\text{TP}} + {\text{FP}}][{\text{TP}} + {\text{FN}}][{\text{TN}} + {\text{FP}}][{\text{TN}} + {\text{FN}}]} }},$$
(21)

where TP is True Positive, TN is False Negative, TN is True Negative and FP is False Positive.

  1. V.

    F-measure

The weighted average of precision and recall is known as F-measure. It is used for the evaluation of statistical methods. F-measure can be calculated as;

$${\text{F-measure}} = 2 \times \frac{{{\text{Precision}}\, \times \,{\text{Recall}}}}{{{\text{Precision}}\, + \,{\text{Recall}}}}$$
(22)

F-measure depends on two things; precision p and recall r, where

$${\text{Precision}}\, = \,\frac{\text{TP}}{{{\text{TP}}\, + \,{\text{FP}}}}$$
(23)
$${\text{Recall}}\, = \,\frac{\text{TP}}{{{\text{TP}}\, + \,{\text{FN}}}}$$
(24)

The resultant best value for F-measure is 1 and the worst value is 0.

  1. VI.

    G-mean

G-mean can be defined by two parameters called sensitivity (Sen) and specificity (Spe). G-mean is calculated by:

$${\text{G-mean}}\, = \,\sqrt {{\text{Sen}}\, \times \,{\text{Spe}}} .$$
(25)

Sensitivity shows the performance of the positive class whereas specificity shows the performance of the negative class. G-mean incorporates the balanced performance of the learning algorithms between positive and negative class.

  1. VII.

    Q-statistics

Q-statistics is used to measure the diversity between two classification algorithms. The Q-statistics of any two classifiers C m and C n can be measured using the following formula:

$$Qm,n\, = \,\frac{{{\text{cw}} - {\text{ab}}}}{{{\text{cw}} + {\text{ab}}}},$$
(26)

where c is the correct prediction and w the wrong prediction of both classifiers. Likewise, a is the correct prediction of first classifier and b the incorrect prediction of second classifier and b the correct prediction of second classifier and incorrect prediction of the first classifier.

Although the four metrics (Eqs. 18, 19, 20, 21) were often used in literature to measure the prediction quality of a prediction method, they are no longer the best ones because of lacking intuitiveness and not easy to understand for most biologists, particularly the MCC (the Matthews correlation coefficient). To avoid this problem, we have adopted the following formulation proposed in the recent publication (Chou 2001b; Chou et al. 2011; Xu et al. 2013a; Qiu et al. 2014a, c).

$${\text{Acc}}\, = \,1 - \frac{{N_{ - }^{ + } + N_{ + }^{ - } }}{{N^{ + } + N^{ - } }}$$
(27)
$${\text{Sp}} = 1 - \frac{{N_{ + }^{ - } }}{{N^{ - } }}$$
(28)
$${\text{Sn}} = 1 - \frac{{N_{ - }^{ + } }}{{N^{ + } }}$$
(29)
$${\text{Mcc}} = \frac{{1 - \left( {\frac{{N_{ - }^{ + } + N_{ + }^{ - } }}{{N^{ + } + N^{ - } }}} \right)}}{{\sqrt {\left( {\left( {1 + \frac{{N_{ + }^{ - } - N_{ - }^{ + } }}{{N^{ + } }}} \right)} \right)\left( {1 + \frac{{N_{ - }^{ + } - N_{ + }^{ - } }}{{N^{ - } }}} \right)} }}.$$
(30)

The above-mentioned metrics given in Eqs. (27, 28, 29, 30) are valid only for single-label system. For multi-label systems whose existence has become more frequent in system biology (Chou et al. 2011) and system medicine (Xiao et al. 2013b), a completely different set of metrics as defined in (Chou 2013) is needed.

Results

Statistical methods are used to evaluate the predication performance of the classifiers. Mostly, three cross-validation tests that include independent dataset test, sub-sampling test and jackknife test are used for examining the performance of classifiers. However, among these tests, jackknife test is extensively applied because it always produces a unique result for a given dataset (Qiu et al. 2014a; Hayat and Tahir 2015). Therefore, the jackknife test has been increasingly and widely adopted by investigators to test the power of various predictors (Ding et al. 2009, 2012, 2014; Hayat and Khan 2012b; Zhang et al. 2012; Lin et al. 2013; Yuan et al. 2013; Lu et al. 2014). Hence, the jackknife cross-validation was utilized to examine the power of our method. Performance comparison of two feature spaces is discussed below.

Prediction performance of classifiers using DNC

The success rates of individual and ensemble classifiers using DNC feature space are listed in Table 1. Among individual classifiers, RF achieved the highest accuracy among the classification algorithms. SVM and PatternNet obtained similar results. Likewise, KNN and PNN also yielded relatively similar accuracies. GRNN has achieved worse results compared to other classification algorithms. After that the individual classifier prediction are combined through majority voting and optimization technique GA. The outcome of majority voting-based ensemble was not reasonable. On the other hand, GA-based ensemble model obtained good results compared to individual and ensemble classifiers. Besides, accuracy, sensitivity, specificity and MCC, other performance measures such as F-measure, G-mean and Q-statistics are used to show more strength of proposed model. Q-statistics will show the diversity between individual classifiers. The accuracy of iRSpot-GAEnsC is shown Fig. 3.

Table 1 Success rate of individual and ensemble classification algorithms using DNC
Fig. 3
figure 3

The performance of iRSpot-GAEnsC using DNC

Prediction performance of classifiers using TNC

The success rates of individual and ensemble classifiers using TNC feature space are reported in Table 2. Among individual classifiers, RF achieved the highest accuracy. SVM and PatternNet obtained somewhat comparable results. Likewise, KNN and PNN also yield relatively similar accuracies. GRNN has achieved the worse results compared to other classification algorithms. Further, the predicted outcomes of individual classifiers are combined through majority voting and optimization technique GA. The outcome of majority voting-based ensemble was not considerable. On the other hand, GA-based ensemble model obtained good results compared to individual and ensemble classifiers. The outcome of GA-based ensemble model is shown in Fig. 4.

Table 2 Success rate of individual and ensemble classification algorithms using TNC
Fig. 4
figure 4

The performance of iRSpot-GAEnsC using TNC

Comparison of iRSpot-GAEnsC with existing methods

Comparison has been drawn between proposed model and already existing methods in the literature reported in Table 3. The pioneer work on this dataset has been carried out by Wei et al. (2013) (Chen et al. 2013) by introducing iRSpot-PseDNC predictor for identification of recombination hotspots and coldspots. Recently, Qiu et al. (2014) has developed iRSpot-TNCPseAAC model for the identification of recombination hotspots and coldspots (Qiu et al. 2014a). In contrast, our proposed model iRSpot-GAEnsC has achieved quite promising results compared to existing methods. The empirical results demonstrated that the performance of GA-based ensemble model is quite promising. This achievement has been ascribed with high variant features of TNC and optimization-based ensemble classification.

Table 3 Performance comparison of iRSpot-GAEnsC with existing methods

Discussion

In this study, a high-throughput computational model has been developed for identification of DNA recombination hotspots and coldspots. Two feature extraction methods including dinucleotide composition and trinucleotide composition were used to extract high discriminant features from DNA sequences. The performances of both feature spaces were evaluated using seven classification algorithms of different nature. These include GRNN, KNN, PNN, SVM, RF, PatternNet and FitNet. After examining the performance of individual classifiers, the predicted outcomes of individual classifiers are combined through simple majority voting and optimization approach genetic algorithm. Genetic algorithm-based ensemble model achieved quite promising results, which are higher than the performance of individual classifiers, and ensemble by majority voting. In addition, its performance is also higher than already existing methods reported in the literature so far. This remarkable achievement has been ascribed with high discriminated features of TNC and the ensemble strength of optimization method of GA. It is ascertained that our proposed model might be helpful in drug-related applications. As demonstrated in a series of recent publications (Xiao et al. 2013a; Ding et al. 2014; Qiu et al. 2014b; Xu et al. 2014b; Jia et al. 2015; Liu et al. 2015e, f) in developing new prediction methods, user-friendly and publicly accessible web servers enhance their impact (Chou 2015), we will make efforts in our future work to provide a web server for the prediction method of recombination hotspots and coldspots.