Keywords

1 Introduction

Transcription of genic regions into RNA molecules is the first step of the biological central dogma, and is dynamically controlled by various transcription factors (TFs) [1]. A TF regulates a gene’s transcription through its dynamic binding to a short (5–20 bps) DNA sequence upstream to the regulated gene. This DNA sequence is the TF’s binding site (TFBS), which is usually highly specific to this TF and is called a motif [2]. Mutations within TFBSs will change the host’s transcription regulatory network, and lead to species specific phenotypes or genetic diseases [3].

There are two major high-throughput strategies to screen the binding sites of a TF in the host genome. Firstly, various high-throughput experimental techniques were developed to screen the TFBSs under the given cell culture conditions, including DNase I footprinting [4], electrophoretic mobility shift assay [5], ChIP-on-chip [6] and ChIP-Seq [7], etc. The dynamic landscape of the transcription regulatory network may be elucidated through these screening techniques. But they are usually costly and labor-intensive, and can only detect the binding sites of one TF under one cell culture condition at a time. Considering the 2,886 transcription factors curated in the human DNA-binding domain (DBD) database [8], and the dynamic nature of transcription regulation, it can be anticipated that the transcription regulatory landscape is significantly under-estimated.

Computational TFBS screening techniques have been used to infer the comprehensive list of TFBSs. The majority of in silico TFBS screening techniques assumes that the binding sites of a given TF have a fixed length, and calculates the similarity score of a query DNA sequence compared with the local oligo-nucleotide frequency patterns in the known TFBSs [9]. The computational techniques include the position weigh matrix (PWM) [10], WebLogo [11], and position specific pairwise score [12], etc. The introduction of TF’s structural information will greatly reduce the false positive rates, as demonstrated by Facelli [13], Saito et al. [14]. But there are only 300 unique human TF structures in the PDB database [ref], and the limited availability of the experimentally detected TF structures restricts the extensive application of these methods [15].

This study hypothesizes that positions contribute differently to the motif scoring based on their nucleotide frequency patterns, and formulates the position contribution as a weight for the position. The vector of weights for different motif positions were randomly mutated by an evolutionary algorithm, with the optimization goal to maximize the overall accuracy. The prediction performance suggests that our algorithm performs similarly or better than the position specific scoring strategies.

2 Materials and Methods

2.1 Data Resources

The proposed algorithm is applied to the following seven transcription factors (TFs), i.e. Ebox, Myc, P53, Q6MAZ, Q601MAZ, V_SREBP_Q3-SREBP (abbreviated as Q3), and V_SREBP2_Q6-SREBP2 (abbreviated as Q6). The known binding sites of these seven transcription factors were manually collected from the database TRANSFAC in August 2012 [16]. Only those binding sites without an “N” letter were kept for further analysis. The target gene sequences and their promoter regions were extracted from the database ENSEMBL [17].

2.2 Motif Screening Problem

The mathematical model of the transcription factor binding site (TFBS) screening problem (sTFBS) is formulated as follows. For a given transcription factor (TF), its known fixed-length binding sites are defined to be the positive dataset P = {M 1, M 2, …, M n }, where |M i | = L. A negative dataset N = {B 1, B 2, …, B m } is randomly extracted from the promoter regions of the genes regulated by the given TF, where |B j | = L, B j has no “N” letters and B j does not overlap with M i . Considering the promoter region is much larger than a TFBS, we set m = 10 × n. A TFBS screening model is denoted as the classification function f(X) ∈ {P, N}, where X ∈ P ∪ N.

Firstly, a similarity score between two fixed-length DNA fragments V = {v 1, v 2, …, v L } and U = {u 1, u 2, …, u L } is defined to be Score(V, U) = (w 1 × S(v 1, u 1) + w 2 × S(v 2, u 2) +···+w L  × S(v L , u L )), where the weight vector W = 〈w 1, w 2, …, w L 〉 is the pre-calculated combination pattern, and w i ∈ [0, 1]. The nucleotide similarity score matrix S(v i , u i ) is defined to be 2 if v i  = u i , 1 for A versus G or C versus T, and −1 for the other pairs [18]. The combination pattern W = 〈w 1, w 2, …, w L 〉 will be optimized by an evolutionary algorithm, as described in the next section.

This study chose the simple nearest neighbor algorithm as the classification model f(X).

Position Weight Matrix (PWM) algorithm assumes that positions in a fixed-length motif are independent to each other and calculates how a query sequence is similar to the set of known motif occurrences [10, 19]. Firstly, a position conservation factor M i is calculated as \(M_{i} = \sum\nolimits_{{b \in \left\{ {A,T,C,G} \right\}}} {(f_{i} (b)/N - P_{0} (b))^{2} /P_{0} (b)\;,i = 1,2, \ldots ,L,}\) where f i (b) is the observed frequencies of nucleotide b at position i in the set of known motif occurrences, and P 0(b) is the background frequency of nucleotide b. Then the position probability matrix (PPM) is calculated as:

$$PPM = \left( {\begin{array}{*{20}c} {P_{1} (A)} & {P_{2} (A)} & \cdots & {P_{n} (A)} \\ {P_{1} (T)} & {P_{2} (T)} & \cdots & {P_{n} (T)} \\ {P_{1} (C)} & {P_{2} (C)} & \cdots & {P_{n} (C)} \\ {P_{1} (G)} & {P_{2} (G)} & \cdots & {P_{n} (G)} \\ \end{array} }\right),$$

where \(P_{j} (b) = \left\{ {f_{j} (b) + s(b)} \right\}/\left\{ {N + \sum\nolimits_{{b \in \{ A,T,C,G\} }} {s(b)} } \right\}\), and \(s(b) = P_{0} (b)\sqrt N\) is a smoothing factor.

Then the position weight matrix (PWM) is calculated as

$$PWM = \left( {\begin{array}{*{20}c} {w_{1} (A)} & {w_{2} (A)} & \cdots & {w_{n} (A)} \\ {w_{1} (T)} & {w_{2} (T)} & \cdots & {w_{n} (T)} \\ {w_{1} (C)} & {w_{2} (C)} & \cdots & {w_{n} (C)} \\ {w_{1} (G)} & {w_{2} (G)} & \cdots & {w_{n} (G)} \\ \end{array} } \right),$$

where \(w_{i} (b) = \ln \left\{ {P_{i} (b)/P_{0} (b)} \right\}.\)

The standardized similarity score of a query sequence Q is defined to be

$$S(Q) = \frac{{\sum\nolimits_{i = 1}^{L} {M_{i} w_{i} (Q_{i} )} - \sum\nolimits_{i = 1}^{L} {M_{i} \hbox{min} \{ w_{i} (b)\} } }}{{\sum\nolimits_{i = 1}^{L} {M_{i} \hbox{max} \{ w_{i} (b)\} } - \sum\nolimits_{i = 1}^{L} {M_{i} \hbox{min} \{ w_{i} (b)\} } }},$$

where Q i is the i th nucleotide in Q, and b ∈ {A, T, C, G}. For a cutoff S 0, only if S(Q) ≥ S 0, Q is defined as a binding motif of the transcription factor.

2.3 Prediction Performance Measurements and Evaluation

Given the positive dataset P = {M 1, M 2, …, M n }, and the negative dataset N = {B 1, B 2, …, B m }, where |M i | = |B j | = L. M i is a true positive or false negative if SNN(M i ) = P or N, respectively, whereas B j is a true negative or false positive if SNN(B j ) = N or P, respectively. For the classification model SNN(X), the numbers of true positives, false negatives, true negatives and false positives are abbreviated as TP, FN, TN and FP, respectively. The classification performance of the model is measured by sensitivity (Sn), specificity (Sp), accuracy (Ac) and Matthews correlation coefficient (MCC) [20, 21], which are defined as follows. Sn = TP/(TP + FN), Sp = TN/(TN + FP), Ac = (Sn + Sp)/2, and MCC = (TP × TN − FP × FN)/sqrt((TP + FP) × (TP + FN) × (TN + FP) × (TN + FN)), where sqrt(t) is the squared root of t.

A line plot will be generated for the evolutionarily optimized combination pattern W = 〈w 1, w 2, …, w L 〉 for the comparison with the WebLogo plot. TFBS screening algorithms usually use the visual technique WebLogo to demonstrate the DNA compositions at each position in the TFBS, and a higher plotted position suggests a larger information content [11]. An initial weight vector \(W^{0} = \langle w_{1}^{0} ,w_{2}^{0} , \ldots ,w_{L}^{0} \rangle\) is generated from a transcription factor’s WebLogo plot, by scaling the information content at position i to [0, 1] as \(w_{i}^{0}.\)

Two validation strategies are adopted to evaluate the classification algorithm SNN’s prediction performance. Firstly, the algorithm SNN is investigated for its leave-one-out (LOO) cross validation performance, i.e. iteratively choosing one data entry and investigating its prediction by the classification model trained on the rest data sets. The LOO validation strategy has been widely used to measure how a TFBS or other functional element prediction algorithm performs [22, 23]. To further investigate the dataset dependency of the proposed SNN algorithm, this study conducted 3-fold cross validation (3FCV) strategy [2426]. The basic idea is to randomly split the positive and negative datasets into 3 equal-size subsets {P 1, P 2, P 3} and {N 1, N 2, N 3}, respectively. The prediction results are iteratively investigated for {P i , N i } using the SNN trained on P\P i and N\N i , where i = 1, 2, and 3. A self validation (denoted as Self) is also used to evaluate the self consistency, which is to evaluate how a classification model performs on the training dataset.

2.4 Evolutionary Optimization Algorithm

This study proposed an evolutionary optimization algorithm to screen for the weight vector with the best overall accuracy Ac of the algorithm SNN, as shown in Fig. 15.1. The basic idea of an evolutionary optimization algorithm (EOA) is to simulate the natural selection process [27, 28]. Each generation of individuals produce children through the operations of crossing and mutation from a pair of parents. A fitness function is defined to describe how each children fit the natural selection pressure. A better fitness leads to a higher chance to survive into the next generation. The population size is usually fixed to a constant value [11, 2937]. The initial population W consists of PopSize individual weight vectors, i.e. W i, where i∈{1, 2, …, PopSize}. Each individual W i is an L-dimension vector \(W^{i} = \langle W_{0}^{i} ,W_{1}^{i} , \ldots ,W_{L}^{i} \rangle\), where \(W_{j}^{i}\) is a random value between 0 and 1.

Fig. 15.1
figure 1

Procedure of the evolutionary optimization algorithm. 5 weight vectors with the best accuracies Ac will be output

MaxGen generations of natural mutation and selection are conducted to find the fittest weight vectors. For a given weight vector W i, an SNN classification model is built, and the overall classification accuracy Ac with the 4-fold cross validation is defined to be the fitness function Ac(W i), as used in step 5. For the population of weight vectors W, Top5(W) consists of 5 weight vectors with the best fitness in the population. The final top 5 weight vectors together with the performance measurements of their classification models are output.

3 Results and Discussion

3.1 Best Parameters for EOA

There are two parameters for the evolutionary algorithm EOA, i.e. the population size PopSize and the generation number MaxGen. Previous studies suggested that PopSize = 100 performs well for the evolutionary optimization problems with individual vector size ~10 [38]. So we firstly fix PopSize = 100, and investigate how the optimization goal, Ac, changes with the increased number of generations, i.e. MaxGen. The parameter MaxGen is set between 0 and 5,000, and the step size is 100. Q6MAZ and Q3 quickly reach the peak Ac value 1.00 after just MaxGen = 200 generations of optimizations, as shown in Fig. 15.2a. The TF genes Ebox, Myc and P53 also reach very high Ac values (>97 %) at just MaxGen = 200. If we choose the Ac value at MaxGen = 5,000 as the final result, all the six investigated TFs reach this peak value at MaxGen = 3,000, as shown in Fig. 15.2a.

Fig. 15.2
figure 2

Distributions of overall classification accuracy, Ac, for different generation numbers. The population sizes PopSize are fixed to a 100, b 60 and c 140, respectively

We further investigate how the parameter PopSize impacts the optimization performance of EOA, as shown in Fig. 15.2 and Supplementary Figure S1. By choosing PopSize ∈ {20, 40, 60, 80, 100, 120, 140, 160, 200}, the overall accuracy Ac is calculated for generation G∈{0, 100, 200, …, 4,900, 5,000} of EOA on each of the six TFs. Figure 15.2 shows that the TFBS prediction problem of Q6 is the most difficult to be optimized, and reaches the peak values at generations 3,800, 3,000 and 2,600 for PopSize = 60, 100 and 140, respectively. All the other five TFs reach the peak Ac values before the optimization generation 3,000. Similar patterns can be observed for other population sizes PopSize, as in Supplementary Figure S1.

Considering that the running time of the evolutionary algorithm EOA increases linearly with the product PopSize × MaxGen, and the above data, this study will set PopSize = 100 and MaxGen = 3,000 for the following experiments.

3.2 Comparison of PWM and SNN(W0)

We firstly compare the widely used PWM algorithm with the SNN algorithm. WebLogo is also widely used to demonstrate the information content or conservation at each position of a motif [11]. The higher a position is, the larger information content this position has, as shown in Fig. 15.3. And the binding sites of all the seven TFs do show significant patterns in information content of some motif positions. So we hypothesize that the information content from WebLogo plot may represent well the weight of each motif position for the SNN algorithm, and the weight vector is denoted as W 0.

Fig. 15.3
figure 3

WebLogo plots for the TFs. a Ebox, b Myc, c P53, d Q6MAZ, e Q601MAZ, f Q3 and g Q6. The line plot is for the evolutionarily optimized weight vector by the SNN + EOA algorithms for each TF

Both PWM and SNN score the similarity of a query DNA sequence to the known TFBSs, and this study chooses the cutoff score with Sn ~ Sp for the comparison. In general, the SNN(W 0) algorithm performs similarly well or slightly worse compared with the PWM algorithm, as shown in Table 15.1. Both algorithms produce ~90 % or larger overall accuracy Ac for the TFBS motif screening problem, and the TF Q3 even receives 100 % accurate separation of the positive and negative data entries from both algorithms under the two validation strategies. The biggest difference between the two algorithms is for the TFBS motif screening problem of Myc, where SNN(W 0) performs 5.01 and 5.48 % worse in Ac than PWM using the LOO and 3FCV validations, respectively. So our first hypothesis about the usage of W 0 is reasonable but may need further optimization.

Table 15.1 Prediction performances of the algorithms PWM and SNN(W 0)

3.3 Comparison of PWM and SNN + EOA

The next hypothesis is that there may exist a weight vector W = 〈w 1, w 2, …, w L 〉 with increased Ac value for the SNN algorithm. Besides the position independent measurements, e.g. PWM or WebLogo, there is no available knowledge about how to optimize the weight vector. So we choose to use the evolutionary optimization algorithm to search for a weight vector with optimal overall accuracy Ac by just random mutations in the weight vectors, as described in Sect. 15.2.4.

After the optimization of MaxGen = 3,000 generations of PopSize = 100 individuals (weight vectors), the motif screening algorithm SNN outperforms the PWM algorithm in any performance measurements for all the seven TFs, as shown in Table 15.2. The PWM algorithm achieves 100 % accuracy for the LOO validation of Q6MAZ and both LOO and 3FCV validations of Q3, and the SNN + EOA algorithm achieves such perfect classification. For the other transcription factors, SNN + EOA outperforms PWM by 0.97–7.83 % in overall accuracy Ac. The measurements MCC ∈ [−1, 1] evaluates how the prediction results match the positive and negative datasets, and a larger MCC means a better prediction. Besides the two TFs Q6MAZ and Q3 that both algorithms perform equally well, SNN + EOA improves the MCC of PWM algorithm by 0.0327–0.2026. The PWM algorithm does not perform well on the dataset of the well-known tumor suppressor P53, as in Table 15.2. It only achieves Sn = 84.78 % and Sp = 96.74 % for the LOO validation of P53, and the overall accuracy is only 90.76 %. SNN + EOA achieves a slightly better specificity (Sp = 97.17 %) and a much better sensitivity (Sn = 100 %). A similar improvement is also achieved by SNN + EOA for the 3FCV validation of P53.

Table 15.2 Prediction performances of the algorithms PWM and SNN + EOA

It’s also interesting to observe that the weight vector achieving the best prediction performance does not match the position independent measurement WebLogo, as shown in Fig. 15.3. For the tumor suppressor P53, the optimized weight vector does not agree with WebLogo at positions 4, 5 and 9, as shown in Fig. 15.3c. The information content at position 4 is larger than that at position 5, but their weights in the optimized vector weighs the two positions reversely. And although the information content at position 9 only ranks 8th, position 9 has the second largest weight. Similar discrepancy exists for all the seven investigated TFs, as in Fig. 15.3, and suggests that a concerted weighing of different positions is necessary for motif screening and other similar problems.