1 Introduction

Finding genes in DNA strings is a problem of critical importance to genomic research. The large quantities of new DNA sequences being continually produced all over the world demand fast, accurate methods for automatic analysis. One of the most important challenges in the analysis of new DNA sequences is whether or not they contain any genes, and if so, determining exactly where they are. A number of methodologies have been put forward for solving this problem. These include the use of neural networks [12, 13, 27, 29, 31], decision trees [25], genetic programming [9], hidden Markov models [2, 8, 10, 11], fuzzy logic [16, 30], etc. The main goal of learning in classification problems is generalization; that is, for the purpose of this paper, how to induce a concept that accurately classifies genes not included in the training set. The difficulty with achieving a good generalization capability is that a learner cannot directly measure the generalization ability; instead, the learner relies on its inductive bias to hopefully produce an accurate classifier.

Neural networks have extensively been used in bioinformatics. It originated with the use of the Perceptron model to find ribosome binding sites in amino acid sequences [28]. Recent examples include the PredictProtein program Footnote 1 and the GRAIL genefinder Footnote 2. Snyder and Stormo [27] successfully used the Perceptron algorithm to find Escherichia coli translational initiation sites. Multi-layer neural networks trained using the back-propagation algorithm have also extensively been used in bioinformatics. Recent examples include the systems developed by Ma et al. [12] for DNA-sequence classification and data mining.

E. coli promoters are located immediately before an E. coli gene. Thus, the successful localization of the E. coli promoter leads to the identification of the E. coli gene. We propose a neural network based multi-classifier system, multiple neural network based system for promoter recognition (MultiNNProm, Fig. 1), for the identification of these promoter sites. The proposed system contains four neural networks, to which the same DNA sequence is presented using four different encoding methods. The outputs of the individual neural networks are then passed through a probability function and finally combined by an aggregation function. This aggregation function aggregates the results, in order to provide an answer as to whether the presented sequence is an E. coli promoter or not.

Fig. 1
figure 1

General scheme of the neural network based multi-classifier for promoter recognition

Genetic algorithms are able to search a large space of candidate hypotheses to identify the best hypothesis. They have been proven to be useful in situations where an evaluation method to calculate the fitness of a candidate hypothesis is available. The combination (aggregation) function used in MultiNNProm combines the classification results of the individual classifiers by calculating the entropy of each output and then aggregating them through a weighted sum. The values of the aggregating weights were obtained by initializing a random population of weights and then running a genetic algorithm, to obtain an optimal set of weights, by using the classification accuracy of the combined classifier as fitness value.

2 E. coli promoter recognition

DNA, deoxyribonucleic acid, is found in long chains, with each link called a nucleotide. Each nucleotide in DNA consists of a sugar molecule called deoxyribose that bonds to a phosphate molecule and to a nitrogen containing compound, called a base. DNA uses four bases in its structure: adenine(A), cytosine(C), guanine(G) and thymine(T). The order of the bases in a DNA molecule—the genetic code—determines the amino acid sequence of a protein. Genes are actually sections of a cell’s DNA. While every gene is made up of DNA bases, the entire DNA of an individual is not made up of genes only. Ironically, only two to three percent of human DNA is made up of genes (protein coding regions).

The Bioinformatics Initiative Footnote 3 lists a set of reasons as to why E. coli is currently the organism of choice for such studies, both as a reference organism for many studies on prokaryotic systems and as a source of information on proteins and metabolic pathways that are shared by eukaryotes as well.

A promoter is a region in the DNA where the transcription initiation takes place. In prokaryotes, the sequence of a promoter is recognized by the sigma (σ) factor of the RNA polymerase. E. coli promoters consist of two sites, known as binding sites. These are the locations to which E. coli polymerase, a kind of protein, binds onto in order to begin the transcription of the protein. The two binding sites are always located at the points known as the −35 hexamer box and the −10 hexamer box. These two names are derived from the statistical fact that these sites usually occur 35 positions and 10 positions, respectively, from the end of the promoter (Fig. 2) [6, 18].

Fig. 2
figure 2

The E. coli promoter, showing the locations of the −35 and −10 hexamer boxes

The spacer between the −10 hexamer box and the transcriptional initiation site has a variable length, the most probable length being 7. The spacer between the −10 site and the −35 site is also of variable length, and can vary between 15 and 21 bases. It is this variation that can make the recognition of these promoters difficult using traditional methodologies.

The E. coli promoter can have one of four σ factors. These are listed out in Table 1. The table also lists out the consensus sequences Footnote 4 for the recognition of these sites. A promoter should contain an element which is very close to, or identical to the consensus sequence. In E. coli, the RNA polymerase holoenzyme (with σ factor) contains all the information needed to accurately initiate transcription, given proper promoter signals (−35 recognition sequence plus −10 Pribnow box).

Table 1 E. coli σ factors and the consensus sequence for the recognition of these promoters

Ma et al. [12] used the software available at http://www-lecb.ncifcrf.gov/~toms/delia.html to display the sequence logos of 438 E. coli promoters that were aligned according to their transcriptional start sites (Fig. 3). This software used the Shannon Entropy to independently measure the non-randomness of each position. From Fig. 3, it can be seen that strong motifs exist at positions −35 and −10, and weak motifs exist at positions +1, −22, −29 and −44.

Fig. 3
figure 3

The sequence logos of 438 E. coli promoters

The location of an E. coli promoter is immediately before the E. coli gene. We are then interested in successfully localizing the E. coli promoter, to identify the position of the E. coli gene. Though the uncertainty of E. coli promoter characteristics adds to the difficulty of recognition, the above mentioned obvious and unobvious features can be employed to make the recognition possible.

3 Multiple classifiers

3.1 Why use multiple classifiers?

Individual classifier models are recently being challenged by combined pattern recognition systems, which show better performances than individual classifiers. The individual classifiers within a multi-classifier system complement each other and, finally, a better result is provided. Additionally, when a system incorporates more classifiers based on different methodologies, the inadequacies of one methodology may be nullified by the advantages offered by another methodology.

Thus, genes not recognized by one classifier will be recognized by another, and DNA falsely classified as being genes by one classifier will be rejected by others (Fig. 4). In Fig. 4, the shaded region represents the true positives, and the ellipses represent the space classified by each classifier as being true positives.

Fig. 4
figure 4

A classification problem with different classifiers being used for the same data set

The challenge in building such a system is the combination of the results provided by each classifier, in order to come up with the optimal result. The performance of the entire system can be proven to be never worse than that of the best individual expert. Dietterich [5] recently indicated that using an ensemble of classifiers we can achieve a better recognition than using a single classifier when (1) the recognition rate of each individual classifier of the ensemble is greater than 0.5, and (2) errors made by each individual classifier are uncorrelated.

In a situation where more classifiers are available, the simplest approach would be to select the best performing classifier and use it for the classification task [17]. This approach, although simple and easy to implement, does not guarantee a good performance [22]. It is highly probable that a combination of classifiers would outperform a singular classifier [24]. On the other hand, different and worse performing classifiers might only add to the complexity of the problem and provide even worse results than the worst classifier. Thus, it is a well-known fact that, if a multi-classifier system is to be successful, the different classifiers should have good individual performances and be sufficiently different from each other [26]. But, neither individual performances [21, 34], nor the diversity [24], provide an ideal measure of how successful the combination of the classifiers will be.

As explained in [24], the core element of classifier selection is the selection criteria. The most natural choice is the combined performance, which will also be the criterion for the selection of the combiner. The only drawback of this methodology is the exponential complexity of testing out all possible combinations of a given set of classifiers.

It has been proven that the usage of an ensemble of neural networks, for certain classification problems, can improve the classification performance when compared to the use of singular neural networks. Rost and Sander [23], and Riis and Krogh [20], provide results tested on protein secondary structure prediction. Baldi and Brunak [1] also list out an overview of applications in molecular biology.

Among different methodologies used for combining multiple neural network classifiers, the majority voting, neural networks, Bayesian inference and the Dempster–Shafer theory have proven to be the most popular ones [15, 32, 33]. The Dempster–Schafer method has proven to be successful, but has a considerable dependency on the function used for the alignment of probability. For the type of output produced by neural networks (numerical values), posterior class-conditional probabilities can be calculated. The calculation of these probabilities becomes relatively simple, especially when the number of output classes is small.

In the next section of the paper, we present two methods that can be used for the combination of results obtained through multiple neural network classifications. These two methods, the linear average predictor (LAP) and logarithmic opinion pool (LOP) methods, were introduced in [7]. We propose a modification of the LOP method—LOP2—which proved to be more adequate and produced better results when using genetic algorithms.

3.2 The LAP and LOP methods for combining classifiers

The method that we used for the combination of our four classifiers is a variation of the LOP method [7], called LOP2. We compare this method with the more commonly used LAP method. The LOP method is a general ensemble method which shows how to calculate the error of an ensemble using the individual errors of different classifiers (called the ensemble ambiguity). Hansen and Krogh [7] claimed that ensembles always improve the average performance.

An ensemble consists of M predictors f i , which are combined into the combined predictor F. Let each predictor produce an output in the form of a probability vector {f 1 i ,...,f N i }, where f j i (i=1,...,M) is the estimated probability that the input \(\vec x\) belongs to the class c j , j=1,...,N (N is the number of classes). There is also a vector of coefficients of the form {α 1,...,α M }, associated with each ensemble, where ∑Mi=1 α i =1, α i ≥ 0.

3.2.1 The LAP method

The combined predictor for the LAP method is defined as:

$$ F_{{\text{LAP}}}^j = \sum\limits_{i = 1}^M {\alpha _i f_i^j } $$
(1)

3.2.2 The LOP method

The combined predictor for the LOP method is defined by (2), where Z is the normalization factor defined by (3). This method is non-linear and asymmetric in comparison to the LAP method. A comparison of the results obtained using the two methods is given in Sect. 5.

$$ F_{{\text{LOP}}}^j = \frac{1} {Z}\exp \left( {\sum\limits_{i = 1}^M {\alpha _i \log f_i^j } } \right) $$
(2)
$$ Z = \sum\limits_{j = 1}^N {\exp \left( {\sum\limits_{i = 1}^M {\alpha _i \log f_i^j } } \right)} $$
(3)

3.2.3 The LOP2 method

The LOP combination method is changed by allowing the sum of coefficients α 1,...,α M to be different than 1, with α i also being allowed to take on negative values. We found that this small modification allowed the system to nullify small errors created by the more successful classifiers.

Genetic algorithms are used to search for the right coefficients α i for aggregating the classification results of the individual classifiers, as described in Sects. 4 and 5. Using the LOP method, where the sum of coefficients is always one, would pose very difficult problems in implementing the genetic algorithms, e.g. crossover, and would result in searching only a small part of the search space (the surface where the sum of coefficients is one).

4 Description of the MultiNNProm system

4.1 Overview of the system

As shown in Fig. 5, the system is a neural network based multi-classifier system. We developed four neural networks, called NNE1, NNE2, NNC2 and NNC4, into which the same DNA sequence is inputted using four different encoding methods: E1, E2, C2 and C4 (see Table 2), respectively. The outputs of the individual neural networks are then passed onto a probability builder function, which assigns probabilities as to whether the presented sequence is an E. coli promoter or not. Finally, the outputs of the probability functions are aggregated by a result combiner, which combines the results and produces a final result as to whether the given sequence is an E. coli promoter or not. The final output is of the ‘yes’ or ‘no’ form.

Fig. 5
figure 5

MultiNNProm architecture

Table 2 The input encoding methods used in the implementation of the system

4.2 Data set

The entire data set consisted of 483 E. coli promoter sequences and 600 non-promoter sequences. The positive data sequences were obtained from the PromEC Footnote 5 Web site. Each of the obtained E. coli sequences were 101 nucleotides in length and were aligned at the transcriptional start site. Each of the transcriptional start sites were positioned such that they each appeared at position 76 of the string. Thus, each of the positive sequences consisted of 101 nucleotides, starting from the −75 position, more exactly 75 nucleotides upstream of the transcriptional start site, and ending at the +25 position, which is 25 nucleotides downstream of the transcriptional start site. The negative data set was obtained from the database compiled by the E. coli GenomeProject, at the University of Wisconsin-Madison Footnote 6. The negative dataset consists of E. coli genes with the preceding promoter region deleted.

Using the given set of available data, we constructed three data sets: a training sequence, a positive test sequence and a negative test sequence. The training set consisted of 324 positive sequences and 429 negative sequences. The remaining positive and negative data were divided into a positive test set of 159 sequences and a negative test set of 171 sequences.

4.3 The encoding methods

As mentioned in Sect. 4.1, four encoding methods (E1, E2, C2 and C4) were used. A description of these encoding methods is outlined in Table 2. The networks were trained to produce an output of ‘−1’ if the DNA sequence was not deemed to be a promoter, and an output of ‘1’ if the DNA sequence was deemed to be a promoter.

By using different encoding methods, we have obtained slightly varying results with respect to the genes correctly classified, as shown in Sect. 5. The underlying conclusion was that each encoding method helped the network specialize in different types of promoters present within the training data set.

4.4 The individual neural network classifiers

We developed four neural networks, each formatted to accept one of the encoding methods mentioned above. Each neural network was trained on the training set using batch training with resilient Backpropagation [19]. Different configurations of hidden layers were considered. It was observed that for each configuration (number of layers), the accuracy of the system on the testing set had not increased with the number of neurons on the hidden layer.

It was also observed that the average accuracy improved as the number of hidden layers increased. Thus, it was concluded that, while the accuracy of the system does not increase beyond a certain threshold with an increase in the number of neurons in the hidden layer, it does improve with an increase in the number of hidden layers of the neural network. We accredited this increase in performance, along with the increase in the number of hidden layers, to the complexity of the DNA strings. We concluded that neural networks with a higher number of hidden layers were able to recognize relationships embedded in the DNA string of a higher order than networks with one or two hidden layers. That is, they were able to correctly learn and classify second and third order relationships among the bases of the DNA string, which in turn led to better generalization and more genes being correctly classified.

Taking a compromise between the size of the neural network and the time required to train the neural network, it was finally concluded that the most suited configuration in terms of accuracy would be a 3-hidden layer neural network. The number of neurons in the input layer is equal to the number of parameters presented to the system; in this case, the number of nucleotides that were presented to the system in one sequence. This technique was used in each of the four neural networks implementing the encoding methods given in Table 2. The configurations of the four neural networks are listed out in Table 3. Each network was trained until the mean squared error (MSE) reached a value below 1e-7. Starting with random weights, ten runs were performed on each network , and the network with the best performance was selected. Each trained neural network exhibited perfect performance (100% accuracy) on both the positive and negative data on the training set.

Table 3 Neural network configurations

Another remarkable observation encountered with the trained system was that the values of the weights on the input layer. Figure 6 displays the distribution of the weights between the input layer and the first hidden layer of the neural network trained on the E1 encoding scheme. The distribution shows remarkable equivalence to the sequence logo graph generated by Ma et al. [12], from Fig. 3, in terms of where the emphasis on the inputs is the largest. This means the network automatically learns the consensus sequences and we don’t need to manually embed the motif information within the system.

Fig. 6
figure 6

The weights between the first and the second hidden layer in a neural network trained on the E1 encoding scheme

4.5 The probability function

Due to the fact that the LOP method requires the input to be in a probability vector form (see Sect. 3.2), we need to convert the output of each neural network into such a form. The probability function assigns a probability as to whether the given sequence is an E. coli promoter or not; in our case N=2, as we have two classes: ‘promoter’ and ‘non-promoter’. The probability functions used to calculate the probabilities for positive and negative data are defined in (4 and 5), where y is the value of the network output.

$$ Positive(y) = \left\{ \begin{aligned} 0.999 & \quad y \geqslant 1 \\ 0.001 & \quad y \leqslant - 1 \\ \frac{{y + 1}} {2} & \quad {\text{else}} \\ \end{aligned} \right. $$
(4)
$$ Negative(y) = \left\{ \begin{aligned} 0.001 & \quad y \geqslant 1 \\ 0.999 & \quad y \leqslant - 1 \\ \frac{{1 - y}} {2} & \quad {\text{else}} \\ \end{aligned} \right. $$
(5)

If the output of the neural network is 1 or more than 1, we deemed the given sequence to be an E. coli promoter. If the output of the neural network is −1 or less than −1, we concluded the given sequence as not being an E. coli promoter. We assigned a probability of 0.999 to either the ‘Yes’ value or the ‘No’ value in such a case, due to the use of the logarithm within the result combiner.

4.6 Result aggregation

The outputs of the four neural networks were combined using the LOP2 method described in Sect. 3.2. The system was tested on three combination methods, namely, classic majority voting, LAP and LOP2. The resulting observations showed us that the results obtained through the LOP2 method provided a much better recognition rate for the test data, both positive and negative. The comparison between these three aggregation methods is listed out in Sect. 5.

The LOP2 method was implemented as follows. Let the outputs of the four neural networks be symbolized by O i , where 1≤ i ≤4. We also defined four coefficients α i , where 1≤ i ≤4. Then, the combined predictor is defined by (68).

$$ O.Positive = \frac{1} {Z}\exp \left( {\sum\limits_{i = 1}^4 {\alpha _i \left( {\log \left( {O_i .Positive} \right)} \right)} } \right) $$
(6)
$$ O.Negative = \frac{1} {Z}\exp \left( {\sum\limits_{i = 1}^4 {\alpha _i \left( {\log \left( {O_i .Negative} \right)} \right)} } \right) $$
(7)
$$ Z = \exp \left( {\sum\limits_{i = 1}^4 {\alpha _i \left( {\log \left( {O_i .Positive} \right)} \right)} } \right) + \exp \left( {\sum\limits_{i = 1}^4 {\alpha _i \left( {\log \left( {O_i .Negative} \right)} \right)} } \right) $$
(8)

O.Positive is the probability of the given sequence being a promoter, and is calculated by the aggregation function of the combined system, whereas O.Negative is the probability of the given sequence being a non-promoter. It is obvious that O.Positive+O.Negative=1. The final conclusion on whether the given sequence is an E. coli promoter or not is reached using (9).

$$ Conclusion(O.Positive,O.Negative) = \left\{ \begin{aligned} & {\text{Yes}}\quad O.Positive > O.Negative \\ & {\text{No}}\quad {\text{Otherwise}} \\ \end{aligned} \right. $$
(9)

So, if O.Positive is greater than O.Negative, the given sequence is classified as a promoter, otherwise as a non-promoter. The methodology used to determine the values of the aggregating coefficients is described in the next section.

4.7 Determination of the aggregating coefficients using genetic algorithms

Genetic algorithms (GAs) are especially effective in searching large and difficult solution spaces, where other searching methods, such as gradient based methods, fail. The search space associated with the coefficients for combining more individual classifiers in our problem, though not an extremely large space, can be a good problem for GAs.

We initialized a set of ten populations, each containing 60 random chromosomes with real values ranging from −10 to 10. Each random chromosome corresponded to a vector representing the four weights utilized within the aggregating function. Each of these populations were then exercised through a genetic algorithm, iterating through 60 generations. The six best performing chromosomes of each population were then extracted and a new population was created. This new population was further exercised through a genetic algorithm for a further 60 generations. The best performing chromosomes of the resulting population were then selected to implement the aggregating function.

The fitness function utilized for the selection of chromosomes is the classification error exhibited by the multi-classifier system on the testing dataset. This function calculated the percentage of erroneous identifications Footnote 7 with respect to the total number of testing data presented. Thus, the genetic algorithm attempted to minimize the total error displayed by the system with respect to the test data. The genetic algorithm was implemented using a precision of 0.9, which meant that the six best performing chromosomes were included in their whole within the next generation. Also, a crossover probability of 0.7 and a mutation probability of 0.1 were used along with roulette-wheel selection for its implementation.

5 Experimental results

5.1 Evaluation of results

In order to evaluate the predictive accuracy of a gene-finding program, we need to compare the promoters (exons) predicted by the program with the actual coding promoters. From this comparison, we calculate the nucleotide level and exon (promoter) level accuracy measures.

Let us denote the following values for true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN):

NTP::

the number of coding nucleotides predicted as coding,

NTN::

the number of non-coding nucleotides predicted as non-coding,

NFP::

the number of coding nucleotides predicted as non-coding,

NFN::

the number of non-coding nucleotides predicted as coding.

Then, we define the sensitivity (10) as the proportion of coding nucleotides that are correctly predicted as coding, and the specificity (11) as the proportion of nucleotides that are actually coding and are predicted as coding.

$$ S_{\text{n}} = \left( {1 - \frac{{N_{{\text{FP}}} }} {{N_{{\text{NG}}} }}} \right) $$
(10)
$$ S_{\text{p}} = \frac{{N_{{\text{TP}}} }} {{N_{{\text{PO}}} }} $$
(11)

Here, NNG , NPO signify the total number of negative and positive sequences, respectively. Both of these are widely used measures for the evaluation of the accuracy of gene prediction programs. Both Sp and Sn range independently over [0, 1], with perfect prediction occurring only when both measures are equal to 1. Another measure that is used for the evaluation of the performance of neural networks is precision (12), where C and N denote the number of test sequences correctly classified and the total number of sequences presented, respectively.

$$ P = \frac{C} {N} $$
(12)

5.2 Performance evaluation of individual classifiers

Each neural network performed perfectly on the training data set and displayed a recognition rate of 100% when presented with them after training. The test sequences were exercised through the four neural networks and the results obtained are listed out in Table 4. These performances are summarized in Fig. 7. The results obtained were then used to calculate the specificity, sensitivity and precision of the four neural network classifiers. These results are tabulated in Table 5, and show us that the best performance is exhibited by NNC4, whereas the worst performance is exhibited by NNE2.

Table 4 The performances of the individual neural network classifiers on the testing dataset
Fig. 7
figure 7

True positives and true negatives recognized by the individual neural networks

Table 5 Comparison on the specifity, sensitivity and precision of the four neural network classifiers

5.3 Performance evaluation of the combined system

The classification performances, including the specificity, sensitivity and precision of the combined system using the LOP2 method, are listed out in Tables 6 and 7. These results are compared, in Fig. 8, with the results obtained for the individual neural networks. It is clear that the combination of these four neural networks provides us with a better recognition rate in terms of precision, specificity and sensitivity. Thus, the performance of the system has been improved by a considerable margin.

Table 6 The performance of the combined system on the testing dataset
Table 7 The specificity, sensitivity, and precision of the combined system on the testing dataset
Fig. 8
figure 8

Comparison of the specificity, sensitivity and precision of the four neural networks and the combined system

The outputs of the four neural networks were also combined by using majority voting and the LAP method. Table 8 lists out a comparison on the specificity, sensitivity and precision between these two methods and the LOP2 method. We did not conduct any simulations for the implementation of the LOP method due to the complexity involved in maintaining the condition where the sum of the coefficients is always equal to one. The obtained results point out the fact that, although the combination of the four neural networks, using either method, provides us with better results than the individual neural networks, the results obtained through the LOP2 method are better than those using the other two aggregating methods.

Table 8 Comparison of the specificity, sensitivity and precision using different aggregating methods

Table 9 compares the results we obtained, using the approach described in this paper, with previous work done on the same problem. Thus, our system can be seen as a considerable improvement compared with recent research on the recognition of E. coli promoters.

Table 9 Comparison of our results with previous work on the same problem

6 Conclusion

In this paper, we presented a novel approach for the recognition of E. coli promoters in strings of DNA. The proposed system showed a substantial improvement in the recognition rate of these promoters, specifically in the recognition of true positives, i.e. rejection of non-promoters. This resulted in our system displaying a far higher specificity than all other systems developed thus far.

We consider the reason for this improvement to be twofold. Firstly, it is the use of larger neural networks than those that have been used thus far. This led to a better rate of recognition and generalization. Secondly, it is the use of multiple neural networks, each accepting the same set of inputs in different forms. We observed the fact that the false positives and false negatives of one network could be wiped out by true positives and true positives in other networks. Thus, the combination of the opinions of more classifiers led us to a system that performed much better than the individual components. Our results substantiate Dietterich’s conclusion [5].

One of the major obstacles encountered during the design and implementation of the neural network was the difficulty in obtaining optimal configurations for the neural networks. Future work for the improvement of this system would involve the development of a framework that can be used for the optimal design of neural networks geared towards gene and promoter recognition. This would involve the optimization of both the neural network configuration and the encoding methods used.