Keywords

1 Introduction

Recent advances in genome sequencing techniques have provided a wealth of base sequence information, from which the coding and regulatory sequences need to be identified. While experimental as well as in silico tools are available for identifying coding sequences, locating regulatory sequences like promoters is a great challenge and the currently available methods are not very efficient. Promoter identification is essential for several reasons: annotating genomic regions for understanding genome architecture and understanding gene regulatory networks. Promoters are identified on the whole genome scale, using experimental techniques like binding assays, ChiP-chip, ChiP-seq, etc, which are costly, labor intensive and time consuming. Hence, it may not be feasible to characterize all genomes in detail experimentally. Alternatively, computational methods are available to identify promoters, as well as coding regions. There are several Promoter Prediction Programs (PPPs) available, which use different features or statistical models and identify either transcription start sites (TSSs) or promoter regions. In this chapter, we briefly describe the architecture of Eukaryotic promoters and the different kinds of promoter prediction algorithms currently available.

2 Eukaryotic Promoter Architecture

A promoter region is generally defined as any genomic DNA where the transcription machinery assembles and initiates transcription. The promoter region consists of protein binding regions along with the transcription start site (TSS). Promoter architecture in Prokaryotes and Eukaryotes differs in complexity. In Prokaryotes, a single RNA polymerase transcribes all types of RNAs and the promoter regions are characterized by the presence of - 35 and - 10 elements and in some cases the UP element as well. Overall, in the Prokaryotes, the regulatory region is located within  100 base pairs relative to the TSS. In Eukaryotes, promoter structure is more complex, with the complexity increasing from single celled yeast to mammals. Eukaryotes have several different types of RNA polymerases (usually three), with each one responsible for the production of different subsets of RNA. RNA polymerase II is responsible for synthesis of all mRNAs and is well studied compared to other RNA polymerases. Hence, only features corresponding to promoters of genes transcribed by RNA polymerase II are discussed below.

In Eukaryotes, the promoter regions are broadly classified as core promoters, proximal promoters and distal promoters. The core promoter region, where the actual basal transcription machinery assembles, is 30–100 nucleotides in length. These regions are characterized by the presence of sequence motifs such as the TATA box and the Inr element. They may also contain downstream elements like DPE, MTE (in humans) along with the associated TSS (Juven-Gershon et al. 2008; Thomas and Chiang 2006). The proximal promoter regions are the sequences located within 500 base pairs relative to the TSS and contain certain proximal promoter elements, which include the GC box, the CAAT box, cis-regulatory modules (CRM) (Lenhard and Sandelin 2012), etc. Distal promoter elements include enhancers, insulators and silencers. The distal promoter region does not have a well-defined length and can extend up to 10 kb from the TSS in upstream as well as downstream regions. Distal promoters interact with transcription activators to increase the rate of transcription. In vertebrates, it is known that 5 % of the genes code for specific transcription activators, which interact with proximal and distal promoter regions.

Along with the transcription factor binding elements, mammalian promoter regions also contain CpG islands. In humans, it is known that 60 % of promoters belong to the CpG island-containing class. Figure 4.1 shows a schematic representation of different promoter elements and their activators in Eukaryotes. Recent studies have shown that in Eukaryotes, especially in humans, each promoter is associated with many TSSs, which are spread over 50–100 nucleotides (referred to as transcriptionally active regions) (Carninci et al. 2006). Promoters can also be bidirectional (Xu et al. 2009). For detailed reviews on Eukaryotic promoters refer to Juven-Gershon et al. (2008), Lenhard and Sandelin (2012), Sandelin et al. (2007), Thomas and Chiang (2006). Recent understanding of vertebrate promoters is that though promoters differ in their motif content (with most of them lacking a consensus motifs), GC content (with lower Eukaryotes being AT rich and mammals being GC rich), some properties such as nucleosome free region and epigenetic features around TSSs are quite common (Valen and Sandelin 2011).

Fig. 4.1
figure 1

A schematic representation of Eukaryotic RNA polymerase II promoter elements and basal transcription machinery. Promoter regions are divided into three classes, namely, core promoters, proximal promoters and distal promoters. Core promoter elements bind to basal transcription factors like TFIID. Proximal and distal promoter elements bind to transcription activators and increase the rate of transcription

3 Experimental Methods of Promoter Identification

Experimental methods for promoter identification and characterization generally identify TSSs or DNA sequences that bind to proteins such as TFs and RNAPII (Lenhard and Sandelin 2012; Sandelin et al. 2007). Earlier methods such as nuclease protection and primer extension carry out promoter identification on a gene-by-gene basis and cannot be used for whole genome promoter identification. Current high-throughput methods measure either products from transcription (mRNA) or promoter activity in whole genome. They provide a snapshot of all transcribed regions or DNA-protein interactions in the genome for given experimental conditions. Recent advancements in promoter region identification consist of sequencing methods and hybridization methods (Sandelin et al. 2007). Sequencing methods such as RACE, 5‘-tag sequencing and 5‘-3‘ paired-end sequencing provide information about the mRNA or cDNA sequences. All these methods use reverse transcription to get cDNA. Then the cDNA is fragmented and the fragments amplified and sequenced from the 5‘-end. The sequenced fragments are mapped to the genomic DNA sequence to get information about TSS location. Hybridization methods, instead of sequencing, use short oligonucleotides to hybridize with target DNA. Two widely used methods are tiling arrays and ChiP-chip, which characterize TSSs and promoter elements respectively. Oligonucleotide tiling arrays are designed with parts of contiguous regions of sequenced genome or some times even whole genomes. They can provide information about the whole transcriptome along with the location of TSSs. The ChiP-chip method is an application of tiling arrays to identify protein bound regions of genomic DNA. ChiP-chip method uses chromatin immunoprecipitation (ChiP) to isolate DNA-bound promoter-associated proteins and then bound DNA is identified using tiling arrays (Sandelin et al. 2007).

4 In silico Methods for Promoter Identification

The computational methods for identification of promoter regions are mostly based on the basic premise that promoter regions have distinct sequences when compared to other genomic regions. Promoter Prediction Programs (PPPs) use experimentally identified promoter regions aligned with respect to TSSs, or transcription factor binding site information from databases (TRANSFAC (Wingender et al. 2000), EPD (Schmid et al. 2004) and DBTSS (Suzuki et al. 2002)) as a training dataset, to derive principles that differentiate promoters from non-promoter regions. PPPs can be broadly classified into three types based on the information used for promoter characterization. They are ab initio, hybrid and homology based algorithms.

Ab initio or de novo methods use only DNA sequence information for promoter identification. Ab initio methods are further classified (as shown in Fig. 4.2) as search-by-signal, search-by-content and search-by-structure algorithms based on features used for modeling (Zeng et al. 2009). Some current algorithms integrate two or more features for efficient promoter prediction.

Fig. 4.2
figure 2

Classification of Promoter Prediction Programs (PPPs) based on the information used for prediction

Hybrid methods use sequence information with other accessory information such as epigenetic features, nucleosome occupancy and gene expression data. Homology based PPPs use orthologous gene information to identify promoter elements. Here, we will focus on ab initio PPPs in detail and also provide an introduction to other methods. Detailed information on the history, feature selection, model design and performance assessment of these PPPs is available in several excellent reviews (Abeel et al. 2009; Bajic et al. 2004; Bajic et al. 2006; Fickett and Hatzigeorgiou 1997; Ohler and Niemann 2001; Pedersen 1999; Zeng et al. 2009; Zeng 2011).

4.1 Ab initio Methods

Ab initio algorithms use only DNA sequence information to predict promoter regions. They identify either putative TSSs or promoter regions or in some cases, both. Ab initio methods may use three different kinds of features: biological signals such as core promoter elements, TFBSs or sequence context information like oligonucleotide composition or DNA structural features. Along with feature selection, they use different statistical and machine learning methods such as weight matrices (Bucher 1990), artificial neural networks (Reese 2001; Wang and Ungar 2007), Markov chains (Audic and Claverie 1997), quadratic discriminant analysis (Davuluri and Grosse 2001), genetic algorithms (Levitsky and Katokhin 2003), principle component analysis (Li et al. 2008) and kernel methods which employ support vector machines (Abeel et al. 2008b; Gangal and Sharma 2005), etc.

These algorithms search for biological signal features of core promoter elements, for example, the TATA box, initiator element (Inr), DPE (Downstream promoter Element), specific TFBSs and CpG islands (in mammals). Generally, these algorithms either predict core promoter elements or, in some cases, give the TSS position along with the distance between the binding site and the TSS. These models first derive consensus signals from experimentally identified TSSs or promoter elements. They then use different statistical methods like weight matrices, artificial neural networks and discriminant models to discriminate between promoter regions and their neighbouring sequences. Typical examples of this class of PPPs include PWMs (Bucher 1990), NNPP (Reese 2001), CpGProD (Ponger and Mouchiroud 2002), CpG-promoter (Ioshikhes and Zhang 2000), FirstEF (Davuluri and Grosse 2001) and Eponine (Down and Hubbard 2002). Search-by-signal PPPs are considered to be first generation methods. Earlier published PPPs did not use CpG-islands and their prediction efficiency was low, where as recent improved algorithms to predict promoters in mammalian genomes include use of CpG islands (Ioshikhes and Zhang 2000; Ponger and Mouchiroud 2002).

  1. 1.

    FirstEF: FirstEF (Davuluri and Grosse 2001), which uses CpG islands, is not a pure promoter prediction program. It identifies first exons along with putative promoter regions (Bucher 1990). The developers of this PPP observed that CpG distribution in the vicinity of TSSs is bimodal, so there are two classes of first exons that exist, such as CpG containing and non-CpG containing ones. It uses a probabilistic model to identify potential first exons (splice donor sites) for both classes of promoter regions. It considers upstream promoter region and downstream splice donor sites (GT) and checks whether the intermediate region is an exon or not. The algorithm is optimized to find potential first donor sites along with CpG-related and non-CpG-related promoter regions.

  2. 2.

    CpGProD: CpGProD (CpG Island Promoter Detection) uses CpG islands to identify mammalian promoter regions in large genomic sequences (Pedersen1998). Although it is strictly dedicated to this particular promoter class, which corresponds to 50 % of the genes in humans, it exhibits a higher sensitivity and specificity than the other tools used for promoter prediction.

  3. 3.

    Eponine: Eponine (Down and Hubbard 2002) is one of the best algorithms and uses sequence motif signals for locating the TSS. It combines weight matrices with discrete probability distributions of differently positioned constraints. The Eponine DNA weight matrix model for any signal is represented by the following equation.

    $$\phi(i;S)=\log\sum_{j=-\infty}^{+\infty}P(j).W(a+i+j;S)$$
    (4.1)

    P(j) is a discrete probability distribution; W(x;S) is the weight matrix score, aligning the first column to position x on sequence S; a is the center position of the distribution, relative to the TSS; and i is the position of the true TSS. These PWM models were chosen for a set of four constraint elements in 599 mammalian promoter regions. They are

  4. i.

    a diffuse preference for CpG enrichment downstream of the TSS.

  5. ii.

    a TATAAA motif with focused distribution centered at position - 30 relative to the TSS.

  6. iii&iv.

    two GC-rich matrices (GCGCG and GC) closely flanking the TATA box and positioned upstream and downstream respectively (Fig. 4.3).

To derive an efficient model, the data was trained using a relevant vector machine (RVM) algorithm with a Monte Carlo sampling process.

Fig. 4.3
figure 3

A schematic representation of the Eponine core promoter model, showing four constraint element distributions, which were used for a weight-matrix consensus. (Down and Hubbard 2002)

4.1.1 Search-by-content Algorithms

Search-by-content algorithms are considered to be more advanced compared to earlier approaches, as they achieve greater sensitivity and specificity. These algorithms are inspired by linguistics. The basic principle underlying all search-by-content methods is that promoter and non-promoter regions differ in their grammar and can be differentiated using certain threshold values. Context features are generally oligonucleotides represented by a set of k-tuples (or k-mers). Promoters and non-promoter regions are different in their tuple statistics. This characteristic statistical property of oligonucleotide composition can be used to discriminate promoter from non-promoter regions. Typical examples of PPPs, which use this feature, include PromFind (Hutchinson 1996), Promoter2.0 (Knudsen 1999), PromoterInspector (Scherf et al. 2000) and PCAHPR (Li et al. 2008). These classes of algorithms were shown to be more discriminative compared to search by signal algorithms. All these PPPs may differ in their statistical models but discriminate promoters from non-promoters using k-mer (k\(=2,3,..6\)) frequencies.

  1. 1.

    PromoterInspector: PromoterInspector uses discriminant functions to identify promoters and was considered the best PPP at one time (Scherf et al. 2000). This was trained using a brute-force algorithm to discover a set of sequence motifs overrepresented in promoter regions. Their models introduce IUPAC words by incorporating wildcards in multiple positions of an oligomer, except at the start and end of words (AGCNGCA, AGCNNGCA). Using a certain threshold, it classifies IUPAC words into promoter related and non-promoter related candidates. From these pre-derived threshold values, PromoterInspector scans target the genome through a sliding window to identify promoter regions. The predictions are not strand-specific and do not provide information about the TSS. This tool was developed for mammalian genomes.

4.1.2 Search-by-property Algorithms

It is known that DNA structural features play a role in DNA-protein recognition (Pedersen 1998). The biological significance of different DNA structural properties in promoter regions is described in the accompanying chapter 13. These structural features are more conserved compared to sequence features. Search-by-property based algorithms use DNA structural features such as flexibility/bendability, curvature, base stacking and free energy to predict promoter regions. These algorithms are more recent compared to the methods described above and are based on one or more structural features to derive principles of learning. Generally, these kinds of models use simple statistical methods (Abeel et al. 2009); Rangannan and Bansal 2010) or advanced machine-learning approaches such as support vector machines (Abeel et al. 2008b) and are applicable across genomes, though genome based cut-offs may have to be specified. McPromoter (Ohler 2000), Prostar (Goni et al. 2007), EP3 (Abeel et al. 2008a), PromPredict (Rangannan and Bansal 2010) and ProSOM (Abeel et al. 2008b) are examples of these types of methods. Some of these algorithms (Abeel et al. 2008b) cluster sequences using structural profiles and use these clusters to classify unknown sequence into different promoter classes. Others use derived threshold property values to distinguish promoters from non-promoter regions (Abeel et al. 2009; Rangannan and Bansal 2010). If a given genomic sequence has a feature score in a defined window which is greater or smaller (depending on the property) than the pre-derived threshold, then it is classified as a promoter. These algorithms generally identify promoter regions rather than giving TSS positions.

  1. 1.

    PromPredict: PromPredict (Rangannan and Bansal 2010) uses the dinucleotide free energy values obtained from differential melting stability of DNA duplex as a predictor of promoters (SantaLucia 1998). The idea behind using DNA duplex stability is that promoter regions should be less stable than neighbouring regions for easy melting at the time of transcription initiation. Compared to other structural features, stability (or base stacking) is found to be the most prevalent feature in the promoter region (Abeel et al. 2008a). Although it was developed for bacterial promoter prediction, it also works well for Eukaryotes (Morey et al. 2011). The program takes an input genome or a fragment of a sequence along with a defined window (100 or 50) and gives the start and end of predicted promoter regions as well as least stable nucleotide position. PromPredict can be applied to any genome and also to fragments of genomic sequences, independent of their size or GC composition.

  2. 2.

    EP3:EP3 (Abeel et al. 2008a) is similar to PromPredict; it uses a base-stacking property to distinguish promoter regions from other regions. For a given sequence of DNA, it calculates inverted base-stacking values over a window size of 400 base pairs in non-overlapping fashion and calls a region as promoter when the structural feature value crosses the threshold score, which is genome specific.

4.1.3 Integrated Algorithms

For ab initio promoter prediction, it is important to choose the most discriminatory features along with the discriminative model (statistical model). Some programs integrate different features to achieve better prediction (Zeng et al. 2010). ARTS (Sonnenburg et al. 2006), CoreBoost (Zhao et al. 2007), PromoterExplorer (Xie et al. 2006) and SCS (Zeng et al. 2010) are a few examples of such new-generation algorithms. which use two or more features to predict promoters. PPPs, such as MetaProm (Wang and Ungar 2007), integrate many algorithms to predict promoters. The integrated algorithms are generally better discriminators of promoter regions, compared to the algorithms described earlier.

4.2 Hybrid Methods

Hybrid PPPs have been developed very recently. Along with the intrinsic features of promoter sequences, they use experimental information such as gene expression and histone modification data (Wang et al. 2012). CoreBoost_HM (Wang et al. 2009) and a method using ChIP-seq Pol-II enrichment data (Gupta et al. 2010) belong to the class of hybrid PPPs. CoreBoost_HM integrates specific histone modification profiles and DNA sequence features (core promoter elements, TFBSs, flexibility) to predict human Pol II promoters. Similarly another recent method integrates gene expression data from Chip-seq and CAGE methods (average and maximum tag counts per million) as well as DNA sequence features (10 sequence composition variables and 22 property variables) to predict promoter regions in humans. Both these methods have outperformed earlier methods in terms of sensitivity and specificity.

4.3 Homology Based

The idea behind using DNA sequence homology for promoter prediction is that, like coding regions, regulatory regions are also evolutionarily under selective pressure and are free of mutations, whereas non-regulatory, non-coding regions can accumulate mutations. Phylogenetic foot printing (Fickett and Wasserman 2000) is one of the methods used in this type of PPP. These methods are only applicable to identify promoter regions of orthologous genes. PromH (Solovyev and Shahmuradov 2003) is one PPP which uses orthologous gene information to predict promoter regions. PromH checks the conservation of TATA boxes in the upstream region, the conservation of nucleotide sequences around the TSS and the conservation of regulatory motifs in the upstream and downstream regions of the TSS and then uses a discriminator function to identify conserved promoter regions in pairs of orthologous genes. The program was developed specifically for testing human and rodent orthologous pairs. These kinds of algorithms are not applicable to whole genome promoter identification.

5 Conclusions and Future Perspectives

In silico identification of promoters is a great challenge in computational biology. A large number of promoter prediction programs are available and they differ in terms of the feature used for discriminating promoter regions from the large mass of genome sequence information. Search-by-structure or integrated algorithms appear to be promising as they are applicable to different model systems, whereas hybrid algorithms are generally efficient but are restricted to the systems for which accessory experimental information is available (such as epigenetic features and CAGE tag counts). With the rapid development of high-throughput technologies, which provide genome wide information about transcription, our understanding of promoter features is changing.

Current notion about vertebrate promoters is that while promoter regions differ in their GC and motif content, some common properties are present, such as the nucleosome free region near the TSS and epigenetic features. So, future algorithms can use this information along with other features to design new PPPs. There is always scope for the development of better algorithms based on new features and high throughput data. Most of the current PPPs are focused on promoter regions of protein coding genes. Now, with the increasing importance of non-coding RNAs in gene regulation, it is essential to analyze them. New algorithms are needed to identify promoter regions of these non-coding genes. Promoter prediction is required even if we have experimental promoter data, as we need statistical models to understand and explain promoter architecture. Up and down regulation of genes and interaction between genes is carried out through the inherent features of promoter regions. So, promoter identification and its characterization as weak or strong can serve as an important input for better understanding of systems biology of diverse organisms.