1 Introduction

Nucleosomes are the basic unit of eukaryotic chromatin, and each one is constructed by a histone octamer that wrapped tightly by a DNA sequence with 147 base pair (bp). Adjacent nucleosomes are connected by linker DNA. As shown in Fig. 1, the histone octamer is constructed by two-molecular H2A, H2B, H3 and H4. Due to the influence of H1 histone, nucleosomes form a stable structure. Position of nucleosome is related to various biological processes, such as DNA replication and RNA splicing (Yasuda et al. 2005; Berbenetz et al. 2010). Besides, position of nucleosome is always dynamically changing in these processes. Therefore, nucleosome positioning is necessary to have an in-depth understanding about biological processes.

Fig. 1
figure 1

Nucleosomes are the basic unit of eukaryotic chromatin, and each one is constructed by a histone octamer that wrapped tightly by a DNA sequence with 147 base pair (bp)

Earlier, Satchwell et al. (1986) found a 10-bp interval repetition of AA/TT/TA that appeared in the region of core DNA. Besides, some sequences of the core DNA appeared periodically (Ioshikhes et al. 1996, 2006). Afterward, it was found that nucleosome deficiency appeared in poly (dA:dT) fragments (Segal and Widom 2009).The above findings were considered as important nucleosome positioning signal and indicated that nucleosome positioning was sequence dependent to some extent.

With the development of high-throughput techniques, high-resolution nucleosome positioning maps of many species have been obtained (Lee et al. 2007; Schones et al. 2008). Therefore, various methods and tools of nucleosome positioning were proposed. Many predictors were constructed based on frequencies information of nucleotide sequences combinations. Ioshikhes et al. (2006) analyzed characteristics of TATA box that occurred in core DNA and linker DNA, respectively and then applied these characteristics to nucleosome positioning. Peckham et al. (2007) predicted position of nucleosome based on the characteristics of gene sequences in promoter regions. Afterward, Kaplan et al. (2009) analyzed DNA encode of nucleosome organization in eukaryotic genome and then predicted position of nucleosome based on it. Afterward, Xi et al. (2010) used hidden markov model (HMM) in nucleosome positioning. Polishko et al. (2012) applied a modified Gaussian mixture model to nucleosome positioning. Struhl and Segal (2013) predicted position of nucleosome based on characteristics of gene sequences. Besides, Freeman et al. (2014) employed molecular models of DNA and proteins to elucidate various aspects of nucleosome positioning. Recently, Chen et al. (2016) have used deformation energy to analyze nucleosome positioning in saccharomyces cerevisiae genomes and Tahir and Hayat (2016) constructed a sequence predictor iNuc-STNC for nucleosome positioning. Core DNA sequences can be identified based on the above model; however, the key factors influencing nucleosome positioning remained unclear.

Afterward, Awazu (2017) constructed a linear regression model based on incorporation of frequencies and distributions for nucleotide sequences with different length, and applied it to nucleosome positioning. The position of nucleosome and the key factors influencing nucleosome positioning can be determined. However, the variables of model were chosen based on the stepwise forward selection method, which took a lot of time. Zhang et al. (2018a, b) studied nucleosome positioning using improved convolutional neural networks. Compared with other methods, Acc obtained by this method was higher than that obtained by other proposed method. However, this method was based on deep learning, which took a lot of time and required higher technical support.

Besides, combining DNA sequence information and DNA structural properties, some predictors were proposed. For example, based on the information of DNA sequences and physical structure, Chen et al. (2012) and Guo et al. (2014) proposed predictors iNuc-PhysChem and iNuc-PseKNC, respectively. Furthermore, Flores and Orozco (2011), Tolstorukov et al. (2008) and Woo et al. (2013) developed some tools for nucleosome positioning.

Information entropy is an abstract concept that is often used to measure the degree of confusion of a system. It includes relative entropy, cross-entropy and mutual information and is widely used in various fields (Zhang and Wu 2008, 2011; Yudong et al. 2015). Relative Entropy (RE) is an important method that described dissimilarity between two different probability distributions, which is widely used in various fields.

Based on RE, a new distance measure method to distinguish different gene sequences was proposed (Benson 2002; Vernikos and Parkhill 2006). Besides, Magliery and Regan (2005) applied RE to identify unconceived hypervariable positions, and Wang and Samudrala (2006) applied RE to search for conserved positions of gene sequences. Vacic et al. (2007) developed a tool for discovery and visualization differences of amino acid composition. Afterward, Astrovskaya et al. (2011) proposed a method to infer viral quasispecies spectra and utilized relative entropy to measure prediction quality. Chen and Zhou (2012) proposed a relative entropy approach in group decision making. Besides, Beigi and Gohari (2014) achieved quantum achievability proof via collision relative entropy. Gibb and Strimmer (2015) proposed an approach for identifying differentially expressed proteins using binary discriminant analysis based on relative entropy. Recently, Sarosi and Ugajin (2016) have studied the relative entropy and the trace square distance in two-dimensional conformal field theories. Shao et al. (2017) studied quantum coherence quantifiers based on α relative entropy.

Many predictors were proposed for nucleosome positioning in various species. However, the sequence predicted as core DNA sequences in one species may be predicted as linker DNA sequences in another species, which showed that function of the gene sequence in assisting or inhibiting nucleosome formation was dependent on species. Besides, it took a lot of time to find the key factors influencing nucleosome positioning. Furthermore, biological common and unique characteristics in nucleosome positioning were ignored in these methods. Therefore, GRE method was proposed to extract information of DNA sequences; meanwhile, the key factors influencing nucleosome positioning were determined based on random forest (RF).

The rest of paper was organized as follows. Section 2 introduces materials and methods, including sources of data and construction of benchmark datasets, generalized relative entropy, construction processes of prediction model and evaluation metrics of model performance; Sect. 3 introduces detailed experimental processes and analyzed experimental results; Sect. 4 summarizes the contents above.

2 Materials and methods

2.1 Benchmark datasets of core DNA and linker DNA sequences

In this paper, benchmark datasets of human, worm, fly were constructed by Guo et al. (2014) and benchmark dataset of yeast was constructed by Chen et al. (2015). In order to construct benchmark datasets, entire genome data and nucleosome positioning data were provided. Detailedly, benchmark dataset of human was constructed by Guo et al. (2014), entire genome sequences were available at UCSC genome database (http://hgdownload.cse.ucsc.edu), where 18hg version was used for human genome. Due to huge data, chromosome 20 was extracted as entire genome data of human (Liu et al. 2011). Nucleosome positioning data were from Schones et al. (2008) Detailedly, data were available from http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/hgtcellnucleosomes.aspx. Each of DNA fragments was assigned a nucleosome formation score to reflect its propensity to form nucleosome. The higher the score was, the more likely the fragment would be in forming a nucleosome. Thus, those fragments with the highest scores were chosen as core DNA sequences, while those with the lowest scores were chosen as linker DNA sequences. In order to eliminate the influence of redundant data on experimental results, the CD-HIT software was used to remove data with high sequences similarity and the threshold was set at 80% (Fu et al. 2012). Finally, benchmark dataset of human was obtained.

Similarly, benchmark datasets of worm and fly were constructed by Guo et al. (2014). Entire genome sequences of worm and fly were downloaded from UCSC database (http://hgdownload.cse.ucsc.edu), WS 170/ ce 4 version and BDGP Release 5 version were used for worm and fly genomes, respectively. Compared nucleosome positioning data were available at http://hgdownload.cse.ucsc.edu and http://atlas.bx.psu.edu, respectively. Detailedly, nucleosome positioning data of fly were from Mavrich et al. (2008). Using the same strategy as human, benchmark datasets of these species were obtained.

Besides, Chen et al. (2015) constructed benchmark dataset of yeast. Entire genome sequences of yeast were downloaded from http://www.yeastgenome.org/ and compared nucleosome positioning data came from Lee et al. (2007). With the same strategy, benchmark dataset of yeast was obtained.

Benchmark datasets are defined as Eq. 1

$$ S_{k} = S_{k}^{ + } + S_{k}^{ - } $$
(1)

where k ranges from 1 to 4, and \( S_{1} \), \( S_{2} \), \( S_{3} \) and \( S_{4} \) represent human, worm, fly and yeast benchmark datasets, respectively. Dataset \( S_{1}^{ + } \) involves 2273 core DNA sequences, and \( S_{1}^{ - } \) involves 2300 linker DNA sequences; dataset \( S_{2}^{ + } \) involves 2567 core DNA sequences, and \( S_{2}^{ - } \) involves 2608 linker DNA sequences; dataset \( S_{3}^{ + } \) involves 2900 core DNA sequences, and \( S_{3}^{ - } \) involves 2850 linker DNA sequences; dataset \( S_{4}^{ + } \) involves 1880 core DNA sequences, and \( S_{4}^{ - } \) involves 1740 linker DNA sequences. The sequence length of human, worm and fly genomes is 147 bp, and sequence length of yeast genome is 150 bp. Benchmark datasets of \( S_{1} \),\( S_{2} \) and \( S_{3} \) are given in supplementary data of references Guo et al. (2014), and benchmark dataset of \( S_{4} \) is given in supporting information of references Chen et al. (2016).

2.2 Generalized relative entropy

In probability theory, relative entropy is used to measure dissimilarity for two kinds of distributions. The smaller the relative entropy is, the more similar the two kinds of distribution are. In particular, the relative entropy of two identical distributions is zero. Relative entropy of discrete random variable is defined as Eq. 2

$$ {\text{RE}}\left( {X,Y} \right) = \sum\limits_{i = 1}^{s} {p_{x} \left( i \right)} \cdot \log \frac{{p_{x} \left( i \right)}}{{p_{y} \left( i \right)}} $$
(2)

where \( p_{x} \left( i \right) \) and \( p_{y} \left( i \right) \) denote two kinds of discrete probability distributions, \( s \) denotes the number of states in the state space, \( i \) is a random variable in the state space, and the value of \( i \) ranges from 1 to \( s \).

Relative entropy is not a distance metric, and it does not a finite upper bound. To measure the differences of two vectors in the high-dimensional space, GRE is proposed. It is defined as Eq. 3

$$ \begin{aligned} d\left( {X,Y} \right) & = \sum\limits_{i = 1}^{s} {\left( {p_{x} \left( i \right) \cdot \log \frac{{k \cdot p_{x} \left( i \right)}}{{\left( {k - 1} \right)p_{x} \left( i \right) + p_{y} \left( i \right)}}} \right)} \\ & \quad + \sum\limits_{i = 1}^{s} {\left( {p_{y} \left( i \right) \cdot \log \frac{{k \cdot p_{y} \left( i \right)}}{{p_{x} \left( i \right) + \left( {k - 1} \right)p_{y} \left( i \right)}}} \right)} \\ & \quad + r \cdot \log \left( {1 + \frac{1}{k - 1}} \right)^{2} \\ \end{aligned} $$
(3)

where \( k \ge 1,r = 0 \) when \( X = Y \); otherwise, \( r = 1 \) when \( X \ne Y \). Parameter k denotes the weight of the probability of occurrence of each state. The meaning of i and s is same as that in Eq. 2.

The Markov model is a statistical model that can be used to describe the Markov process. The generation of DNA sequences can be seen as a Markov process. Assuming that the sequence that appeared in the next state depends on the previous state, here we use a first-order 4 Short form of author list Markov chain to represent the DNA sequence. Therefore, the generation probability of DNA sequence can be expressed by Eq. 4

$$ p_{n} = p\left( {s_{0} } \right) * a_{ij}^{n - 1} $$
(4)

where \( p_{n} \) represents generation probability of DNA sequence with length n, s0 represents initial state, \( p\left( {s_{0} } \right) \) represents occurrence probability of initial state, and n represents sequence length of DNA. A single-nucleotide A C G T is represented by 1 2 3 4, respectively. i and j are two random variables, and the values range from 1 to 4. Therefore, \( a_{ij} \) represents state transition probability between different nucleotides.

Assuming that the length of the DNA sequence is m and m is far less than n, the generation probability of DNA sequence can be expressed by Eq. 5

$$ p_{m} = t * p\left( {s_{0} } \right) * a_{ij}^{m - 1} $$
(5)

where t belongs to Q and Q represents rational number. If DNA sequences have self-similarity, there must be a constant that satisfies Eq. 6

$$ \frac{{p_{n} }}{{p_{m} }} = k $$
(6)

where k is a constant. Because the ratio of \( p_{n} \) to \( p_{m} \) is \( a_{ij}^{n - m} /t \), DNA sequences satisfy self-similarity. Based on self-similarity of DNA sequences and GRE method, core DNA sequences can be identified.

2.3 Prediction model

In order to predict nucleosome positioning, generalized relative entropy (GRE) was proposed. The model was constructed by the following steps.

Step 1 Calculate all types of di-nucleotide frequency in core DNA sequences and obtain their distribution \( X_{2} = \left\{ {x_{1} ,x_{2} , \ldots ,x_{16} } \right\} \), and \( x_{i} \) denotes frequency of the ith di-nucleotide. Using the same methodology, calculate all types of di-nucleotide frequency in linker DNA sequences and obtain their distribution \( X_{2}^{\prime } = \left\{ {x_{1}^{\prime } ,x_{2}^{\prime } , \ldots ,x_{16}^{\prime } } \right\} \).

Step 2 Similarly, calculate all types of di-nucleotide frequency in all core DNA sequences and linker DNA sequences and then obtain their distribution \( Y_{2} = \left\{ {y_{1} ,y_{2} , \ldots ,y_{16} } \right\} \).

Step 3 Calculate generalized relative entropy of each DNA sequence according to Eq. 3, which constructs two-dimensional feature vectors \( d_{2} = {\text{GRE}}\left( {d_{2}^{ + } ,d_{2}^{ - } } \right) \), where \( d_{2}^{ + } = d\left( {X_{2} ,Y_{2} } \right) \) and \( d_{2}^{ - } = d\left( {X_{2}^{\prime } ,Y_{2} } \right) \).

Step 4 Following steps 1 to 3, calculate the generalized relative entropy of other types of DNA fragments, of which trinucleotide, four-nucleotide, five-nucleotide and six-nucleotide. It constructs ten-dimensional feature vectors \( d_{10} = {\text{GRE}}\left( {d_{2}^{ + } ,d_{2}^{ - } ,d_{3}^{ + } ,d_{3}^{ - } ,d_{4}^{ + } ,d_{4}^{ - } ,d_{5}^{ + } ,d_{5}^{ - } ,d_{6}^{ + } ,d_{6}^{ - } } \right) \).

Step 5 Calculate classification importance score based on random forest and search for crucial feature vectors in nucleosome positioning according to classification importance score.

Step 6 Put those feature vectors obtained in step 5 into SVM to recognize core DNA sequences.

Step 7 Use jackknife test and tenfold cross-validation to examine prediction performance of GRE method.

The detailed processing process of nucleosome positioning is shown in Fig. 2.

Fig. 2
figure 2

Nucleosome positioning flowchart

2.4 Machine learning algorithm

Machine learning (ML) is a technology that studies the process of making machines intelligent through learning of past experience. In recent years, due to the development of artificial intelligence (AI), machine learning algorithms have been widely used (Zhang et al. 2015; Petralia et al. 2015; Ide et al. 2016; Lin et al. 2017; Ismail et al. 2017; Karlekar and Gomathi 2018; Sinoquet 2018).

BP neural network can solve different classification problems by simulating the structure of human brain. However, because BP neural network is an optimization method of local search, the algorithm is easy to fall into the local optimal solution, which can lead to training failure (Meng et al. 2018; Zhang et al. 2017).

Support vector machine (SVM) is a method for both linear and nonlinear data classifications, and the main idea of SVM is to map nonlinear data in a low-dimensional space into a high-dimensional space and then search for the optimal hyperplane using a kernel function to separate data in one class from another class. It has been widely used in the field of bioinformatics (Bhasin and Raghava 2004; Wan et al. 2013).

Random forest (RF) is an important class cation algorithm which can separate two different types of data; meanwhile, it provides the importance scores of feature attributes for classification. It has been widely used in the field of bioinformatics (Petralia et al. 2015; Ide et al. 2016; Zhang et al. 2016; Rahman et al. 2017; Taherzadeh et al. 2017; Ismail et al. 2017; Sinoquet 2018; Fabris et al. 2018).

Therefore, in this paper, due to its powerful nonlinear mapping capabilities, SVM was selected to be a classifier to distinguish core DNA and linker DNA. Meanwhile, RF was used to search for key factors in nucleosome positioning because it can provide the importance scores of feature attributes for classification.

Besides, the software package LIBSVM 3.22 was used to be as an implementation of SVM, which was available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/. Radial basis function was selected as the kernel function. The parameters c and g in the training model were determined by a grid search approach.

2.5 Metrics for performance evaluation

In this paper, prediction performance of GRE model was evaluated by jackknife test and tenfold cross-validation. Detailedly, jackknife test was used to examine prediction performance of GRE model in human, worm and fly genomes. Meanwhile, tenfold cross-validation was used to examine prediction performance of GRE model in yeast genome. These methods were widely applied to evaluate prediction quality of the previously proposed model (Chen et al. 2012; Guo et al. 2014; Chen et al. 2016; Tahir and Hayat 2016; Awazu 2017).

Here, Eqs. 710 are defined to evaluate performance of model. Detailedly, TP is defined as the number that core DNA sequences correctly predicted core DNA sequences, FP is defined as the number that linker DNA sequences incorrectly predicted core DNA sequences, FN is defined as the number that core DNA sequences incorrectly predicted linker DNA sequences, and TN is defined as the number that linker DNA sequences correctly predicted linker DNA sequences. Using jackknife test and tenfold cross-validation, the following metrics are obtained:

$$ S_{n} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}} $$
(7)
$$ S_{p} = \frac{\text{TN}}{{{\text{TN}} + {\text{FP}}}} $$
(8)
$$ {\text{Acc}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FP}} + {\text{FN}} + {\text{TN}}}} $$
(9)
$$ {\text{Mcc}} = \frac{{{\text{TP}} * {\text{TN}} - {\text{FP}} * {\text{FN}}}}{{\sqrt {\left( {{\text{TP}} + {\text{FN}}} \right)\left( {{\text{TP}} + {\text{FP}}} \right)\left( {{\text{TN}} + {\text{FN}}} \right)\left( {{\text{TN}} + {\text{FP}}} \right)} }} $$
(10)

where \( S_{n} ,S_{p} ,{\text{Acc}} \) and \( {\text{Mcc}} \) represent sensitive, specificity, accuracy and Mathew’s correlation coefficient, respectively.

3 Results and discussion

3.1 Parameter optimization

As shown in Eq. 3, our proposed method is based on a parameter k, where k is the weight factor and reflects distributions of core DNA sequences and linker DNA sequences in the entire genome. Furthermore, an appropriate k can reflect distributions of core DNA and linker DNA in high accuracy so that we can separate core DNA from linker DNA. Thus, searching for the optimal values of parameter k is necessary.

In order to record results of search, \( k_{1}^{*} \),\( k_{2}^{*} \) and \( k_{3}^{*} \) were defined as the optimal parameter values obtained in the first, second and third parameter optimizations, respectively, and \( k^{*} \) was defined as the optimal parameter values used to construct GRE model. Due to \( k \ge 1 \), in order to obtain the optimal parameter values and, meanwhile, reduce computational time, the following strategy was adopted.

Step 1 Search for the optimal parameter values according to Eq. 11:

$$ \begin{array}{*{20}l} {2 \le k \le 10,} \hfill & {{\text{with}}\;{\text{step}}\;\Delta = 1} \hfill \\ {10 \le k \le 200,} \hfill & {{\text{with}}\;{\text{step}}\;\Delta = 10} \hfill \\ \end{array} $$
(11)

Step 2 Based on Eq. 11, \( k_{1}^{*} \) was obtained. Then, the optimal parameter values were searched according to Eq. 12.

$$ \begin{aligned} 2 \le k_{1}^{*} \le 10,\quad {\text{with}}\;{\text{step}}\;\Delta = 0.1 \hfill \\ 10 \le k_{1}^{*} \le 200,\quad {\text{with}}\;{\text{step}}\;\Delta = 1 \hfill \\ \end{aligned} $$
(12)

Step 3 Based on Eq. 12, \( k_{2}^{*} \) was obtained. Then, the optimal parameter values were searched according to Eq. 13.

$$ \begin{array}{*{20}l} {2 \le k_{2}^{*} \le 10,} \hfill & {{\text{with}}\;{\text{step}}\;\Delta = 0} \hfill \\ {10 \le k_{2}^{*} \le 200,} \hfill & {{\text{with}}\;{\text{step}}\;\Delta = 0.1} \hfill \\ \end{array} $$
(13)

In Eq. 13, step with 0 meant that we stopped search and obtained the optimal parameter values \( k^{*} \).

In this paper, jackknife test was used to search for the optimal parameter values in human, worm and fly genomes, respectively. Similarly, tenfold cross-validation was used to determine the optimal parameter values in yeast genome. In our experiment, prediction accuracy of fly genome declined obviously when parameter k ranged from 110 to 200. Thus, in order to reduce computational time, optimal parameter values were searched in a range of 2 to 100. Besides, prediction accuracy of yeast genome reached 1 when parameter k was equal to 60, 70, 80, 90 and 100. Therefore, the optimal parameter values of yeast genome were obtained. As for human and worm genomes, the optimal parameter values were searched in a range of 2–200.

Detailedly, the optimal parameter values were determined by the method “shortening the interval length gradually” and precision of the optimal parameter values was set at 0.1. In the processes of searching for the optimal parameter values, search was terminated when prediction accuracy reached 1 or precision of the optimal parameter values reached 0.1. The processes of parameter optimization in human genome were as follows.

Step 1 Calculate prediction accuracy when parameter k varied from 2 to 200, respectively. Then, choose that parameter k with the highest prediction accuracy as \( k_{1}^{*} \). As shown in Table 1, \( k_{1}^{*} = 190 \).

Table 1 The optimal parameter values in four species

Step 2 Search for those values adjacent to \( k_{1}^{*} \) and choose smaller value as starting point of the second parameter optimization and bigger value as terminal point of the second parameter optimization. Thus, the range of the second parameter optimization was determined. Then, prediction accuracy was calculated when parameter k varied from 180 to 200 with a step of 1. Among these k values, the parameter k with the highest prediction accuracy was chosen \( k_{2}^{*} \), where \( k_{2}^{*} = 184 \).

Step 3 Determine the range of parameter optimization using the same way as step 2. Then, search for the optimal parameter values in the scope between 183 and 185 with a step of 0.1. Compared with different results obtained by different parameters k, the parameter k with the highest prediction accuracy was chosen as \( k_{3}^{*} \), where \( k_{3}^{*} = 184.0 \). The precision of the optimal parameter value reached 0.1. Hence, the parameter value used to construct GRE model was obtained, where \( k^{*} = k_{3}^{*} = 184.0 \).

Using the same strategy, the optimal parameter values of worm, fly and yeast genomes were obtained. As shown in Table 1, the optimal parameter values of four species were obtained (\( k^{ * } = 184.0, = 145.8, = 8.5 \) and \( = 60, = 70, = 80, = 90, = 100 \) for human, worm, fly and yeast genomes, respectively). In the second parameter optimization, the precision of the optimal parameter value in fly genome reached 0.1. Therefore, the optimal parameter value in GRE model was obtained. Besides, prediction accuracy of yeast genome reached 1, when k was equal to 60, 70, 80, 90 and 100. As for yeast genome, the optimal parameter values were obtained in the first parameter optimization. The parameter optimization processes and the optimal parameter values of four species are shown in Figs. 3, 4, 5, 6 and Table 1, respectively. As shown in Table 1, \( k_{1}^{*} \),\( k_{2}^{*} \) and \( k_{3}^{*} \) represented the optimal parameter values obtained in the first, second, third parameter optimizations, respectively. \( k^{ * } \) was the optimal parameter value in RGE model. “-” represented null value. As for fly genome, the optimal parameter value was obtained in the second parameter optimization. Therefore, the value of \( k_{3}^{*} \) was null value. Similarly, the optimal parameter values of yeast genome were obtained in the first parameter optimization. Therefore, both \( k_{2}^{*} \) and \( k_{3}^{*} \) were null value.

Fig. 3
figure 3

Parameter optimization processes in human genome

Fig. 4
figure 4

Parameter optimization processes in worm genome

Fig. 5
figure 5

Parameter optimization processes in fly genome

Fig. 6
figure 6

Parameter optimization processes in yeast genome

In order to analyze the effect of k values on the accuracy of SVM in different species, Table 2 is shown. In Table 2, maximum and minimum were used to represent Acc under the best case and the worst case, respectively. Meanwhile, \( k_{\hbox{max} } \) and \( k_{\hbox{min} } \) were used to represent k values corresponding to the maximum accuracy Nucleosome positioning based on generalized relative entropy 7 and the minimum accuracy, respectively. – represented null value.

Table 2 The effect of k value on accuracy in different species

In order to analyze the relation between parameter k and distribution of core DNA and linker DNA, Figs. 7, 8, 9, 10 and 11 were provided. X-axis represented DNA sequences. Detailedly, values of X-axis represented core DNA sequences when \( 0 \le x \le 1779 \), and values of X-axis represented linker DNA sequences when \( 1880 \le x \le 3619 \). Y-axis represented the feature values obtained by GRE model. For simplicity, 2+, 2−, 3+, 3−, 4+, 4−, 5+, 5−, 6+ and 6− in legend denoted positive di-nucleotide, negative di-nucleotide, positive trinucleotide, negative trinucleotide, positive four-nucleotide, negative four-nucleotide, positive five-nucleotide, negative five-nucleotide, positive six-nucleotide and negative six-nucleotide, respectively. As shown in Figs. 7, 8, 9, 10 and 11, distributions of feature values obtained by positive nucleotide sequences and negative nucleotide sequences were different in core DNA and linker DNA. Furthermore, the feature values obtained by positive nucleotide sequences were smaller than those obtained by negative nucleotide sequences in core DNA, while the feature values obtained by negative nucleotide sequences were smaller than those obtained by positive nucleotide sequences in linker DNA. Thus, distributions of feature values in core DNA and linker DNA were different. Based on these, core DNA sequences can be recognized in high accuracy. The above results indicated that prediction results obtained by GRE model were consistent with the real distributions of core DNA and linker DNA in yeast genome when the optimal parameter values were equal to 60, 70, 80, 90 and 100. Therefore, the optimal parameter values can reflect real distribution of core DNA and linker DNA to some extent.

Fig. 7
figure 7

Feature values distribution (k = 60)

Fig. 8
figure 8

Feature values distribution (k = 70)

Fig. 9
figure 9

Feature values distribution (k = 80)

Fig. 10
figure 10

Feature values distribution (k = 90)

Fig. 11
figure 11

Feature values distribution (k = 100)

3.2 Feature selection

Random forest was an important classification algorithm. It can separate two different types of data and, meanwhile, provide the importance of feature attributes for classification. Therefore, random forest was used to find the crucial factors in nucleosome positioning. First, two parameters needed to be set, including the number of trees in forest and the number of node split attributes. In this paper, the number of trees in forest was set as 50, 100,150, 200, 250, 300, 350, 400, 450 and 500, respectively, because ten-dimensional feature vectors were obtained based on GRE method and the number of node split attributes was generally set as root mean square of the number of classification attribute. Therefore, the number of node split attributes was set as 3. Next, under the optimal parameters of GRE model, calculate the importance score of ten feature vectors when the number of trees in forest was set as 50, 100, 150, 200, 250, 300, 350, 400, 450 and 500, respectively. The feature vectors whose score exceeded 100 are shown in Table 3. In Table 3, v1, v2, v3, v4, v5, v6, v7, v8, v9 and v10 presented those feature vectors obtained by positive di-nucleotide, negative di-nucleotide, positive trinucleotide, negative trinucleotide, positive four-nucleotide, negative four nucleotides, positive five-nucleotide, negative five-nucleotide, positive six-nucleotide and negative six-nucleotide, respectively. Then, set threshold of score as 100 to search for those feature vectors whose score threshold exceeded 100. Finally, search for the common feature vectors whose score threshold exceeded 100 in four species. Those feature vectors were considered as the key factors affecting nucleosome positioning. As shown in Table 3, v10, v9, v8, v5 and v7 were the common feature vectors of four species. Therefore, positive four-nucleotide, positive five-nucleotide, negative five-nucleotide, positive six-nucleotide and negative six-nucleotide sequences were considered as the key factors affecting nucleosome positioning.

Table 3 The optimal parameter values in four species

In order to analyze the effect of the number of trees on Acc in different species, Table 4 was provided. As shown in Table 4, maximum and minimum were used to represent Acc under the best case and the worst case, respectively. Meanwhile, \( {\text{tree}}_{\hbox{max} } \) and \( {\text{tree}}_{\hbox{min} } \) were used to represent the number of trees corresponding to the maximum accuracy and the minimum accuracy. Besides, - represented null value.

Table 4 The effect of the number of trees on accuracy in different species

3.3 Prediction results for human, worm and fly genomes

Using jackknife test, Sn, Sp, Acc and Mcc were obtained. Acc (= 0.8778, = 0.8798 and = 0.8336 for human, worm and fly genomes, respectively) obtained by our proposed model was higher than those obtained by iNuc-PseKNC (Guo et al. 2014) for human, worm and fly genomes, higher than those obtained by iNuc-PseSTNC (Tahir and Hayat 2016) for human and fly genomes, and higher than those obtained by 3LS model (Awazu 2017) for worm genomes (Tables 5, 6, 7). Compared with those methods provided by Guo and Tahir, our proposed method can find the key factors influencing nucleosome positioning with lower time complexity. Although the method proposed by Awazu can find key factors influencing nucleosome positioning, its time complexity was high. Therefore, our proposed method was an effective method in nucleosome positioning.

Table 5 The prediction performance for human genome
Table 6 The prediction performance for worm genome
Table 7 The prediction performance for fly genome

3.4 Prediction results for yeast genomes

Based on tenfold cross-validation, prediction results of yeast genome was obtained (Acc = 1, Sn = 1, Sp = 1 and Mcc = 1). As shown in Table 8, results obtained by GRE model were same as results obtained by TNS model (Awazu 2017), higher than those obtained by iNuc-PhysChem (Chen et al. 2012) (Acc = 0.967), DNA energy (Chen et al. 2016) (Acc = 0.981). Besides, prediction accuracy of our model can reach 1 when parameter k was equal to 60, 70, 80, 90 and 100. It was shown that the distribution of core DNA and linker DNA obtained by GRE model was consistent with the real distribution of core DNA and linker DNA. Compared with those methods provided by Chen, our proposed method can find the key factors influencing nucleosome positioning with lower time complexity. Although the method proposed by Awazu can find key factors influencing nucleosome positioning, its time complexity was high, and our proposed method was an effective method in nucleosome positioning.

Table 8 The prediction performance for yeast genome

3.5 Analysis of factors affecting nucleosome positioning

Using GRE model, core DNA of human, worm, fly and yeast genomes can be recognized in high accuracy. Furthermore, the crucial factors of nucleosome positioning have been found.

Parameter k can reflect the distribution of core DNA and linker DNA to some extent. Besides, compared to the distribution of core DNA and linker DNA obtained by non-optimal parameter values, the distribution obtained by the optimal parameter values was more similar to the real distribution of core DNA and linker DNA in the entire genome. It indicated that parameter k played important roles in nucleosome positioning. Besides, the optimal parameter values were different in human, worm, fly and yeast genomes. It indicated that features of nucleosome positioning were species dependent.

Furthermore, GRE models of four species contained common feature vectors (positive four-nucleotide, positive five-nucleotide, negative five-nucleotide, positive six-nucleotide and negative six-nucleotide sequences). Removing those common feature vectors from ten-dimensional feature vectors, Acc was 0.7881, 0.8400 and 0.7663 for human, worm and fly genomes, respectively. Similarly, Acc was 0.9994, 0.9994, 0.9992, 0.9994 and 0.9994 in yeast genome when parameter k equaled 60, 70, 80, 90 and 100, respectively. It showed that Acc obtained by the common feature vectors was higher than those obtained by those feature vectors which were obtained by removing common feature vectors from all feature vectors. Therefore, these common feature vectors were key factors affecting nucleosome positioning.

Although the position of nucleosomes of four species and crucial factors of nucleosome positioning can be determined based on our proposed model, some additional factors, such as the flexibility of DNA fragments, can also influence nucleosome positioning (Sangaiah et al. 2017; Lu et al. 2017; Zhang et al. 2018a, b). Therefore, in the future, we will combine the information of DNA sequences with their physicochemical properties in nucleosome positioning.

From simulation results, we can get the following trends in nucleosome positioning.

  • Nucleosome positioning was species dependent.

  • Some common factors, including five-nucleotide and six-nucleotide, were important factors in nucleosome positioning of different species.

  • The values of parameter k with the highest prediction Acc were different in different species. It indicated that some unique factors can also affect the position of nucleosomes in different species.

3.6 Analysis of the significance of the parameter k

The parameter k reflected the weight of two different distributions and had important biological significance. For searching for the significance of parameter, prediction results obtained by RE and GRE methods were compared. Besides, the optimal k value of the model was studied when a genetic mutation occurred. Details were as follows.

In order to find out the importance of weight distribution, prediction results obtained by GRE and RE methods were compared in human, worm, fly and yeast genomes, respectively. The parameter k of GRE method was based on the optimal k value. The comparison results are shown in Tables 9, 10, 11 and 12.

Table 9 Comparison results obtained by GRE and RE methods in the human genome
Table 10 Comparison results obtained by GRE and RE methods in the worm genome
Table 11 Comparison results obtained by GRE and RE methods in the fly genome
Table 12 Comparison results obtained by GRE and RE methods in the yeast genome

As shown in Tables 9, 10, 11 and 12, we can draw a conclusion that the predictive performance of GRE method was better than that of the RE method. Therefore, in terms of prediction accuracy, the adjustment of the weights of the two distributions was very important. The parameter k played an important role in weight adjustment.

Furthermore, the optimal parameters values in the GRE model were studied when a gene mutation occurred at a certain site of the DNA sequences. In order to ensure that the length of the DNA sequences was unchanged, the gene mutation only considered the replacement of the base pair. It was found that the optimal parameters values in the GRE model were stable. Therefore, the parameter k was conservative in the same species.

4 Conclusion

In this paper, based on relative entropy, a novel nucleosome positioning method was proposed. Using this method, core DNA of human, worm, fly and yeast was recognized by their sequences. In order to evaluate the quality of model, different nucleosome positioning methods were compared with same benchmark datasets. Experimental results showed that our proposed model was an effective nucleosome positioning method. Besides, five-nucleotide and six-nucleotide sequences were considered as crucial factor in nucleosome positioning. Because some additional factors, such as the flexibility of DNA fragments, can also influence nucleosome positioning. Therefore, in the future, we will combine the information of DNA sequences with their physicochemical properties in nucleosome positioning. Besides, we will apply the generalized relative entropy to comparison of text similarity, the allocation of weight indicators in multi-index evaluation systems and pattern recognition.