Introduction

Cell is the rudimentary unit of all living organisms, which may be prokaryotic or eukaryotic. It accomplishes different functions such as reproduction, respiration, transportation of molecules, and identity maintenance. Cell constitutes nucleus, Golgi complex, mitochondria, endoplasmic reticulum, ribosomes, etc. Nucleus is a membrane-enclosed organelle, consisting of genetic material in the form of long DNA molecules (Athey et al. 1990; Mavrich et al. 2008a, c). DNA organizes in a supercoiling structure known as chromatin. Nucleosome is composed of histone proteins and DNA molecules, which is considered the basic unit of eukaryotic chromatin (Thoma et al. 1979). The core histone proteins contain four sub-units, namely H2A, H2B, H3 and H4; however, the linker histone is H1. Chromatin DNA is of two types: one is core DNA, which is a double helical DNA strand about 146 bp, coils around the core histones in a left-handed super-helix form, and the other is linker DNA (Berbenetz et al. 2010; Schwartz et al. 2009). Linker DNA is a short sequence of 20–60 bp through which nucleosomes are attached to each other (Athey et al. 1990; Mavrich et al. 2008a, b). Thus, in nucleosome, the final length of DNA becomes 166–167 bp, which may be two full turns (Thoma et al. 1979) known as chromatosome. The histone octamer around the packaging of DNA performs significant roles in biological processes, namely RNA splicing, DNA replication, repair mechanisms, and transcriptional control (Schwartz et al. 2009; Berbenetz et al. 2010; Yasuda et al. 2005). Various traditional methods such as nuclear magnetic resonance (NMR), filter binding assays, and X-ray crystallography were carried out for the recognition of DNA and proteins (Gabdank et al. 2010; Chen et al. 2014; Xi et al. 2010; Eddy 1996; Segal et al. 2006; Field et al. 2008). Owing to a confined number of genomic and proteomic structure availability, time, and lack of laboratory equipment, the traditional methods remained unsuccessful. Apart from that, a huge number of biological sequences are reported in databases owing to the fast technological advancement in the post-genomic era. However, the identification of these unprocessed data is a challenging job for the researchers in the field of bioinformatics and proteomics. Viewing the implications of traditional approaches, the investigators have diverted their attention towards the computational methods by utilizing contemporary machine learning methods (Field et al. 2008). Nucleosome positioning in genomes is identified by performing various studies (Peckham et al. 2007; Satchwell et al. 1986; Yuan et al. 2005; Goñi et al. 2008; Tahir and Hayat 2016; Yuan and Liu 2008; Tolstorukov et al. 2008; Nikolaou et al. 2010). Hidden Markov model (HMM) was applied to capture the central patterns from the provided data (Stolz and Bishop 2010). Segal et al. introduced a probabilistic model by calculating the probabilities of nucleotides and higher rank dependencies among nucleotides (Thoma et al. 1979). Several k-mer methods were utilized by Kaplan et al. (2009) and Field et al. (2008) for improving the success rates of the developed models (Goñi et al. 2008; Isami et al. 2015). Likewise, Xi et al. introduced a novel duration hidden Markov model (dHMM) by executing the linker DNA length as well as nucleosome positions to collect nucleosome positioning information (Nikolaou et al. 2010). In a sequel, Satchwell et al. introduced di-nucleotide and tri-nucleotide composition for the identification of nucleosome positioning in genome (Awazu 2017). Furthermore, SVM in combination with sequence-based features was used by Peckham et al. to analyze some oligo-nucleotides implicated in nucleosome formation and exclusion (Satchwell et al. 1986; Liu et al. 2015a).

“iNuc-PseKNC” predictor was developed by Gou et al. for the discrimination of nucleosome positioning in genomes (Peckham et al. 2007). Pseudo k-tuple nucleotide composition utilized six different DNA local structural physicochemical properties for expressing DNA sequences (Peckham et al. 2007).

The notion of pseudo-amino acid (PseAA) composition was broadly implemented in various computational models. It was further extended to DNA representation and introduced several predictors, namely repDNA (Li et al. 2015), Pse-in-One (YongE and GaoShan 2015), and iDNA-KACC (Xiang et al. 2016). Besides, some predictors such as iRSpot-EL (Dong et al. 2016) and iDHS-EL (Xiao et al. 2013) were also established by Liu et al. The concept of PseKNC was successfully implemented and illustrated in RNA/DNA, namely identifying nucleosome (Liu et al. 2015d), predicting splicing site, identifying translation initiation site (Che et al. 2016), predicting recombination spots (Liu et al. 2015d; Luo et al. 2016; Tian et al. 2015), predicting promoters (Liu et al. 2015d), identifying origin of replication (Li et al. 2015), identifying RNA and DNA modification (Yong and GaoShan 2015; Xiang et al. 2016), and others (Dong et al. 2016). According to previous research studies (Guo et al. 2014; Xiao et al. 2013; Chen et al. 2013; Liu et al. 2014a; Qiu et al. 2014; Xu et al. 2013a, b), a precise, reliable, and efficient predictor will be established for a biological system by accomplishing Chou’s 5-steps. They are defined as follows: (1) to choose or design a valid dataset to train and test the model effectively; (2) to mathematically express the samples in such way that can truly represent the motif of target class; (3) to develop or introduce an efficient algorithm for operational engine; (4) to apply a cross-validation test for evaluating the outcome of model; and (5) to develop a web-predictor for the model that can be easily accessible to the public.

Rest of the paper is structured as follows: the next section demonstrates materials and methods, “Results” section presents the performance of supervised algorithms followed by “Discussion” section and finally conclusion is reported at the end of the paper.

Methods

Datasets

In this study, we have targeted three different species such as D. melanogaster, C. elegans, and H. sapiens. The benchmark datasets for these species were selected from Guo et al. 2014. These datasets can be mathematically expressed as

$${S_1}=S_{1}^{+}+S_{1}^{ - },$$
(1)
$${S_2}=S_{2}^{+}+S_{2}^{ - },$$
(2)
$${S_3}=S_{3}^{+}+S_{3}^{ - }.$$
(3)

In the above equations, \(S1\), \({S_2}\) and \({S_3}\) represent the benchmark datasets for C. elegans, D. melanogaster, and H. sapiens, respectively. The \(S1\) benchmark dataset contains 4573 samples, of which 2273 belong to \(S_{1}^{+}\) nucleosome-forming samples and 2300 to \(S_{1}^{ - }\) nucleosome-inhabiting samples. The \({S_2}\) benchmark dataset contains 5175 samples, of which 2567 belong to \(S_{2}^{+}\) nucleosome forming and 2608 to \(S_{2}^{ - }\) nucleosome inhabiting. Similarly, \({S_3}\) represents the third benchmark dataset comprised of 5750 samples, of which 2900 belong to \(S_{3}^{+}\) nucleosome-forming and 2850 to \(S_{3}^{ - }\) nucleosome-inhabiting samples. The \(U\) symbol denotes the union of two sets. By removing redundant samples from benchmark datasets, the CD-HIT software was applied, with a cutoff threshold value of 80% (Guo et al. 2014).

Feature extraction techniques

Suppose S is the sequence of DNA with L nucleic acid residues as shown below:

$$S={N_1}{N_2}{N_3}{N_4} \ldots {N_L}.$$
(4)

In the above equation, N1 denotes the residue of nucleic acid at the first position in a sequence, N2 denotes the residue of nucleic acid at the second position in a sequence and NL denotes the last residue of the nucleic acid in a DNA sequence at position L (Ioshikhes et al. 1996). These nucleotides are expressed as

$${N_i} \in \left\{ {G(guanine),\,C(cyto\sin e),\,A(adenine),T(thy\hbox{min} e)} \right\},$$

where the value of i = 1, 2, …, L.

DNA sequence is numerically expressed by computing the frequency of each nucleotide, known as nucleic acid composition (NAC). It can be presented as below:

$$S={\left[ {f(A),f(C),f(T),f(G)} \right]^T}.$$
(5)

In the above equation, ƒ(A) indicates the frequency of adenine, ƒ(C) shows the frequency of cytosine and so on in the sequence of DNA; however, the T symbol indicates the transpose operator. Conventional NAC is a simple discrete method, but it does not maintain information regarding sequence order of nucleotides. Consequently, correlation factors among nucleotides are totally ignored. Viewing at the significance of correlation factors and local information, the idea of pseudo-amino acid (PseAA) composition was utilized and took place nearly all the fields of computational proteomics and genomics (Chou 2001a, 2005; Cao et al. 2013; Liu et al. 2014b; Chen and Li 2013). Subsequently, the PseAA composition idea has been extended to handle the sequences of RNA/DNA in the nature of PseKNC.

In this article, we have applied two different discrete feature extraction methods, namely PseDNC and PseTNC to collect variant and prominent numerical descriptors from the sequences of DNA.

Pseudo-di-nucleotide composition

PseDNC expresses a DNA sequence by making a pair of two nucleotides and then calculates the frequency of each pair. Let us suppose, N1N2 is the first pair of di-nucleotide, N2N3 is the second pair of di-nucleotide, and finally, NL1NL is the last pair of di-nucleotide. Subsequently, 4 × 4 = 16D feature vector is formed. It can be numerically represented as follows:

$$S={\left[ {f(AA)f(AG)f(AC) \ldots f(TT)} \right]^T},$$
(6)
$$S={\left[ {f_{1}^{{di}},f_{2}^{{di}},f_{3}^{{di}} \ldots f_{{16}}^{{di}}} \right]^T}.$$
(7)

In the above equations, the T symbol represents the transpose operator, \(f_{1}^{{di}}=f(AA)\) is the frequency of AA pair, \(f_{2}^{{di}}=f(AC)\) is the frequency of AC pair, and \(f_{4}^{{di}}=f(AT)\) is the frequency AT pair in the sequence of DNA and so on.

Pseudo-tri-nucleotide composition

PseTNC expresses the sequence of DNA by combining three nucleotides and then computes the occurrence frequency of three consecutive nucleotide pair. For example, N1N2N3 is the first component of tri-nucleotide, N2N3N4 is the second component of tri-nucleotide, and so on, while the last component of tri-nucleotide is NL2NL1NL; accordingly, the corresponding feature vector 4 × 4 × 4 = 64D is generated. The PseTNC is mathematically expressed as

$$S={\left[ {f(AAA),f(AAT),f(AAC),f(AAG), \ldots ,f(TTT)} \right]^T},$$
(8)
$$S={\left[ {f_{1}^{{3{\text{-tuple}}}},f_{2}^{{3{\text{-tuple}}}}f_{3}^{{3{\text{-tuple}}}}f_{4}^{{3{\text{-tuple}}}} \ldots f_{{64}}^{{3{\text{-tuple}}}}} \right]^T},$$
(9)

where \(f_{1}^{{3{\text{-tuple}}}}\) = \(f(AAA)\) is the frequency of AAA component, \(f_{4}^{{3{\text{-tuple}}}}\) = \(f(AAG)\) is the frequency of AAG component, while \(f_{{64}}^{{3{\text{-tuple}}}}\) = \(f(TTT)\) is the frequency of TTT component in the sequence of DNA.

Framework of proposed predictor

In this research, a novel predictor was introduced, namely iNuc-ext-PseTNC for the discrimination of nucleosome positioning in genomes. Two feature extraction methods: PseDNC and PseTNC are utilized for numerical representation of DNA sequences. Three distant natures of classifiers namely: K-nearest neighbor (KNN), probabilistic neural network (PNN) and support vector machine (SVM) are executed. The predicted outcomes of the individual classifier were then fused to develop an ensemble model “iNuc-ext-PseTNC”. The developed model shows outstanding performance compared to the current state of arts in the literature, so far. The framework of the proposed prediction ensemble model has been shown in Fig. 1.

Fig. 1
figure 1

The framework of iNuc-GA-PseTNC

Classification algorithms

In pattern recognition and machine learning, classification is a supervised learning, in which a novel observation is recognized as already defined target classes on the basis of a training dataset. The process of classification is accomplished in two steps: training and testing. In the training step, the pattern of the pre-defined classes is memorized from the provided data. In the testing step, the new observation is identified on the basis memorized pattern. In this study, we have applied KNN, PNN, and SVM classification algorithms (Guo et al. 2014; Tahir and Hayat 2016; Hayat and Khan 2012; Kabir and Hayat 2016).

Ensemble classification

In the last few decades, researchers have diverted their attention from individual classifier to the concept of ensemble classification to reduce prediction error and broadly utilize for signal peptide prediction (Chou and Shen 2007c), predicting protein subcellular location (Chou and Shen 2007a), for enzyme subfamily prediction (Chou 2005) and predicting subcellular location (Chou and Shen 2007b; Zhang et al. 2015b, 2017; Li et al. 2016). During the classification process, the predicted outcome of each classifier is varied and can yield different errors. However, when the prediction of each classifier is merged, the classification errors are minimized because the error of one classifier is recompensed by another classifier (Hayat and Khan 2012; Zhang et al. 2012, 2015, 2016). The ensemble classification fuses the prediction of various classifiers and tries to minimize the variance instigated in these individual classifiers. In this study, various classifiers, namely KNN, PNN, and SVM are used. First, a classifier is trained and the prediction is noted. The predictions of each classifier are then fused to develop the ensemble model (Kabir and Hayat 2016). It can be mathematically expressed as below:

$${\text{EnsC}}={\text{KNN}} \oplus {\text{SVM}} \oplus {\text{PNN.}}$$
(10)

In the above equation, the ensemble model is represented by EnsC and the symbol \(\oplus\) represents the combination operator.

$$\{ {C_1},{C_2},{C_3}\} \in \{ {S_1},{S_2}\} ,$$
(11)

where C1, C2 and C3 are the individual classifiers; S1 and S2 represent the two classes of nucleosome forming and nucleosome inhabiting.

$$Yj=\sum\limits_{{i=1}}^{3} {\delta ({C_i}} {S_i}),\quad {\text{where }}(j=1,2),$$
(12)

where

$$\delta ({C_i}{S_i})=\left\{ \begin{gathered} 1\quad {\text{if}}\,{C_i} \in {S_j} \hfill \\ 0\quad {\text{otherwise}} \hfill \\ \end{gathered} \right\}.$$
(13)

Outcome of the ensemble model adopting GA is generated as

$${\text{GAEnsC}}={\text{Max}}\{ {x_1}{y_1},{x_2}{y_2},{x_3}{y_3}\} ,$$
(14)

where GAEnsC is the outcome of the ensemble model, Max represents the maximum output, and x1, x2, and x3 are the optimum weight of the individual classifiers.

Metrics for measuring prediction performance

In the statistical prediction model, the fundamental task is the partition of provided data into training and testing subsets. In the literature, cross-validation test is extensively applied for evaluating the quality and effectiveness of the developed model. Sub-sampling or K-fold, self-consistency, independent dataset, and jackknife tests are the types of the cross-validation test. Here, six-fold cross-validation test is applied, in which the data are divided into six-fold, where onefold is used for testing and the rest of folds are utilized for the training process. The same process is repeated six times and finally, the outcome is yielded on the basis of average. The metrics for measuring the prediction performance are mathematically expressed as (Manavalan et al. 2018; Liu et al. 2015c, 2016a, 2017c, 2018; Hayat and Tahir 2015; Ahmad et al. 2017; Ehsan et al. 2018; Feng et al. 2018; Cheng et al. 2017a, b, c, d, 2018; Xiao et al. 2017, 2018)

$${\text{Specificity}}=\frac{{{\text{TN}}}}{{{\text{FP}}+{\text{TN}}}} \times 100,$$
(15)
$${\text{Sensitivity}}=\frac{{{\text{TP}}}}{{{\text{FN}}+{\text{TP}}}} \times 100,$$
(16)
$${\text{Accuracy}}=\frac{{{\text{TN}}+{\text{TP}}}}{{{\text{FP}}+{\text{TN}}+{\text{TP}}+{\text{FN}}}} \times 100,$$
(17)
$${\text{MCC}}=\frac{{{\text{TN}} \times {\text{TP}} - {\text{FN}} \times {\text{FP}}}}{{\sqrt {\left( {{\text{TN}}+{\text{FP}}} \right)\left( {{\text{TP}}+{\text{FN}}} \right)\left( {{\text{TN}}+{\text{FN}}} \right)\left( {{\text{TP}}+{\text{FP}}} \right)} }}.$$
(18)

Equations (1518) are widely utilized to compute the prediction of classifiers; however, in some cases, these equations are not suitable for biologists, because of the lack of intuitiveness. In this study, we have used the following equations to solve this complication (Schwartz et al. 2009; Xu et al. 2013a, 2014; Chou 2001b; Chen et al. 2013a, 2016, 2017; Lin et al. 2014; Jia et al. 2016; Zhang et al. 2016; Liu et al. 2016c, 2017a,b; Feng et al. 2017):

$${\text{Sensitivity}}=1 - \frac{{Z_{ - }^{+}}}{{{Z^+}}},$$
(19)
$${\text{Specificity}}=1 - \frac{{Z_{+}^{ - }}}{{{Z^ - }}},$$
(20)
$${\text{Accuracy}}=1 - \frac{{Z_{+}^{ - }+Z_{ - }^{+}}}{{{Z^ - }+{Z^+}}},$$
(21)
$${\text{MCC}}=\frac{{1 - \left( {\frac{{Z_{+}^{ - }+Z_{ - }^{+}}}{{{Z^ - }+{Z^+}}}} \right)}}{{\sqrt {\left( {1+\left( {\frac{{Z_{+}^{ - }+Z_{ - }^{+}}}{{{Z^+}}}} \right)} \right)} \left( {1+\left( {\frac{{Z_{+}^{ - }+Z_{ - }^{+}}}{{{Z^ - }}}} \right)} \right)}}.$$
(22)

In the above equations, \({Z^ - }\) denotes the whole number of the true nucleosome-inhibiting sample while \({Z^+}\) signifies the whole number of true nucleosome forming, whereas \(Z_{+}^{ - }\) represents the whole number of nucleosome inhibiting predicted incorrectly while \(Z_{ - }^{+}\) shows the whole number of nucleosome forming predicted incorrectly.

Results

The success rates of two feature spaces are empirically analyzed and performance comparisons have been drawn as well.

Performance comparison of classifiers using PseDNC feature space

Tables 1, 2 and 3 present the experimental results of individual and ensemble classifiers for the three datasets \({S_1}\), \({S_2}\), and \({S_3}\). Among the individual classifiers, PNN has obtained an efficient result for dataset \({S_1}\) on the value of spread = 4.51, whereas SVM has yielded the higher outcomes for dataset \({S_2}\) on the value of cost function (c = 1.33 and gamma (\(g\) = 0.0025)) and again PNN classifier has achieved an efficient result for dataset \({S_3}\) on the value of spread = 2). After that, the individual classifiers or learner hypotheses prediction is combined through optimization technique GA. GA-based ensemble model achieved efficient outcome compared to individuals. Besides, accuracy, specificity, sensitivity, and MCC are employed to illustrate the high strength of GAEnsC. The accuracy of GAEnsC using PseDNC is shown in Figs. 2, 3 and 4.

Table 1 Success rates of classification algorithms on PseDNC and PseTNC using dataset S1
Table 2 Success rates of classification algorithms on PseDNC and PseTNC using dataset S2
Table 3 Success rates of classification algorithms on PseDNC and PseTNC using dataset S3
Fig. 2
figure 2

The performance of GAEnsC PseDNC using S1

Fig. 3
figure 3

The performance of GAEnsC PseDNC using S2

Fig. 4
figure 4

The performance of GAEnsC PseDNC using S3

Performance comparison of classifiers using PseTNC feature space

Tables 1, 2 and 3 show the experimental results of individual and ensemble classifiers for the three datasets \({S_1}\), \({S_2}\), and \({S_3}\) using PseTNC feature spaces. SVM has obtained promising results for all the three datasets \({S_1}\), \({S_2}\), and \({S_3}\) on the value of cost function (c = 1.25 and gamma (g = 0.0035)). The success rate of GA-based ensemble model is quite efficient compared to individual classifiers. The accuracy of GAEnsC using PseTNC feature space is illustrated in Figs. 5, 6 and 7.

Fig. 5
figure 5

The performance of GAEnsC PseTNC using S1

Fig. 6
figure 6

The performance of GAEnsC PseTNC using S2

Fig. 7
figure 7

The performance of GAEnsC PseTNC using S3

Performance comparison with other methods

Our proposed predictor is also compared with other existing methods: 3LS (Awazu 2017), iNuc-STNC (Tahir and Hayat 2016), and iNuc-PseKNC (Guo et al. 2014) on the same benchmark datasets. Table 4 demonstrates that our proposed iNuc-ext-PseTNC model has obtained efficient outcomes compared to existing methods. The experimental outcomes proved that the success rates of GA-based ensemble model are more efficient. This success has been ascribed with optimization-based ensemble classification and high variant features of PseTNC.

Table 4 Comparison of the iNuc-ext-PseTNC predictor with other methods

Discussion

In this article, a predictor “iNuc-ext-PseTNC” is proposed for the identification of nucleosome positioning. The patterns are collected using PseDNC and PseTNC from protein sequences. Contemporary machine learning algorithms are applied to correctly identify nucleosome positioning in genomes. The empirical results explored that the pair of two nucleotides (PseDNC) did not clearly discern the pattern of nucleosome positioning compared to the pair of three nucleotides (PseTNC). It means that the sequence order information has more significance in identifying the motif of nucleosome positioning in genomes. Despite the substantial results of SVM in the combination of PseTNC feature space, the desired outcomes are not achieved. To obtain the desired outcomes, the notion of ensemble classification is introduced. The ensemble process is carried out through bio-inspired evolutionary approach genetic algorithm (GA). After combining the predicted outcome of each learner through GA, consequently, outstanding results have been obtained, which are not only higher than individual learners but also from existing models in the works of literature, so far.

Conclusion

In this study, iNuc-ext-PseTNC predictor is proposed for the prediction of nucleosome positioning in genomes. In this predictor, two discrete feature extraction methods namely: PseDNC and PseTNC are used for the formulation of DNA sequences. The extracted feature spaces are provided to different classifiers such as KNN, SVM, and PNN to comprehend the pattern of nucleosome positioning in genomes. After analyzing the success rates of the individual prediction model, the result of the single classifiers is fused through the GA optimization approach. GA-based ensemble predictor has achieved efficient outcomes than that of the individual classifiers. This significant success has been achieved on account of highly discriminated features of PseTNC and GA-based optimization method. It is discovered that “iNuc-ext-PseTNC” model might be helpful in drug-related applications. Several recent papers demonstrated that (Guo et al. 2014; Liu et al. 2015a, c; Lin et al. 2014; Levitsky 2004; Chen et al. 2015) user-friendly and publicly accessible web servers show future direction for constructing practically more useful models. Therefore, we shall make efforts in our future work to provide a web server for the computational method presented in this paper since doing so will significantly enhance its impact as revealed in two comprehensive review papers (Chou 2015, 2017).