Keywords and phrases

15.1 Introduction

For the last 15 years, genomic sequence analysis has probably offered the widest variety of problems on pattern statistics. This variety is due to the huge length of the sequences and to their heterogeneous composition and structure, but also to the complexity of the functional motifs. These motifs take place in fundamental molecular processes like chromosome maintenance or gene transcription, but few of them have been completely identified (i.e. their sequence of letters is known). Moreover, they are rarely conserved through species, leading to a very challenging area of DNA motif discovery. This chapter is related to the statistical approach used to predict candidate functional motifs. Indeed, many known functional motifs are characterized by an exceptional behavior of their occurrences. Some of them are extremely frequent along the entire genome (or along a particular DNA strand), others are avoided because their occurrences are lethal for the chromosome, and some are preferred in particular genomic regions. Thus, two main quantities have been widely studied from a probabilistic and statistical point of view: the number of occurrences of a motif in a random sequence and the distances (cumulated or not) between occurrences of a motif. To avoid a huge list of references, we recommend Chapters 6 and 7 from Lothaire (2005) for technical expositions and Robin et al. (2005) for a more applied exposition. In this chapter, we have chosen to present the main statistical results that are really used in practice to help identify functional DNA motifs. Many biological examples will then be given to illustrate the usefulness of the approaches. Most will be devoted to the question of detecting words with an exceptional frequency in a given sequence. Distribution of a word count in Markovian sequences will be studied in Section 15.2. We will also consider the related problem of comparing the exceptionality of a word frequency between two independent sequences. Functional motifs can indeed be specific from known parts of the chromosome (or from some particular chromosomes). In this case, the word occurrences themselves are modeled and a statistical test is derived from the two count processes (Section 15.3). However, when one look for regions significantly enriched with (or devoid of) a given word, the quantity of interest becomes the distance between occurrences. Section 15.3 also presents results on the distance distribution when the occurrences are modeled by a compound Poisson process. Other results on distances and waiting times can be found in Stefanov (2009) when the sequence is Markovian. Section 15.4 addresses the generalization to more complex patterns, namely degenerated patterns and structured motifs. Finally, we end with some ongoing works and open problems.

15.2 Words with Exceptional Frequency

Many functional DNA motifs are extremely over-represented in complete genomes, or in specific genomic regions, whichever compositional level of the biological sequence one takes into account. This statistical property reveals a strong constraint on the DNA sequence. For instance, if we look for the two most over-represented 9-letter words in the complete genome of the bacteria Haemophilus influenzae (1830140 letters long), we find the two reverse complementary oligonucleotides aagtgcggt and accgcactt which occur respectively 740 and 731 times. As an illustration, Table 15.1 gives the expected count of these two words when fitting the sequence composition of smaller words.

Table 15.1 Expected counts of aagtgcggt and accgcactt in random sequences having on average the same composition as the H. influenzae complete genome.

These two 9-letter words are very well known from the biologists: they are the two DNA uptake sequences involved in discriminating self from foreign entering DNA during competence in the bacteria.

Another example is the word gctggtgg which is the “crossover hotspot instigator” (Chi) motif in the bacteria Escherichia coli and is involved in chromosome maintenance. Chi is among the five most over-represented 8-letter words in the E. coli genome (4638858 letters long). This example will be detailed in Section 15.2.5.

In contrast, many restriction sites (generally 6-letter words) are strongly under-represented along bacterial genomes, which is not surprising because they induce a double-strand break of the bacterial DNA. The aim of this section is precisely to show how to assess the significance of over- and under-representations.

When we want to analyze the distribution of a word along a sequence or when we want to know if a word occurs significantly more often in one sequence compared to another one (Section 15.3), it is relevant to model the occurrences themselves in order to fit the observed frequencies of this word. However, if the problem is precisely to know if a given word occurs in a DNA sequence with a frequency that seems either too low or too high, one needs to compare it to an expected frequency. Usually, we compare the observation with what one would expect in random sequences sharing common properties with the DNA sequence. Under classical sequence models (Section 15.2.1), we can analytically calculate the moments of the count (Section 15.2.2) and sometimes obtain its distribution or some approximations (Section 15.2.3), leading to p-values (Section 15.2.4). We will end this section by presenting how the Chi motif of Staphylococcus aureus was predicted, because of its exceptional frequency, before being experimentally validated [Halpern et al. (2007)].

15.2.1 Sequence models

The commonly used sequence models have the property to fit the letter composition of the observed sequence and more generally its composition in small words of a given length. For instance, it is common to fit the 3-letter word composition of coding DNA sequences because the letters of these sequences are read 3 by 3 by the ribosome, which translates each disjoint triplet into amino acids to form a protein. The most intuitive model is therefore the permutation model (or shuffling model), consisting in shuffling the letters of the observed sequence so that the composition remains exactly the same. Preserving exactly the letter composition is an easy task, but it is more difficult for 2-letter words or longer words, from both algorithmic and probabilistic points of view. In that respect, stationary Markov chains are particularly interesting if one accepts fitting the composition on average rather than exactly. Moreover, if one wants to take some periodicity or a heterogeneous composition along the sequence into account, permutation models become very complicated to manipulate.

In our discussion, we will consider a random sequence \(\mathbf{S} = {X}_{1}{X}_{2}\mathrel{\cdots }{X}_{n}\) on the 4-letter DNA alphabet, i.e. \({X}_{i} \in \mathcal{A} :=\{ \mathtt{a},\mathtt{c},\mathtt{g},\mathtt{t}\}\).

These models assume that random sequences are uniformly drawn from the set \({\mathcal{S}}_{m}\) of sequences having exactly the same counts of words of length 1 up to m as the observed DNA sequence, for a given integer m ≥ 1. The probability of a sequence S is then \(1/\vert {\mathcal{S}}_{m}\vert \). For m = 1 or m = 2, for instance, we have

$$\begin{array}{rcl} \vert {\mathcal{S}}_{1}\vert & =& \frac{n!} {{N}_{\mbox{ obs}}(\text{ a})! \times {N}_{\mbox{ obs}}(\text{ c})! \times {N}_{\mbox{ obs}}(\text{ g})! \times {N}_{\mbox{ obs}}(\text{ t})!} \\ \vert {\mathcal{S}}_{2}\vert & =& \prod _{a\in \mathcal{A}} \frac{{N}_{\mbox{ obs}}(a+)!} { \prod _{b\in \mathcal{A}}{N}_{\mbox{ obs}}(ab)!} \times {H}_{{X}_{n},{X}_{1}}(\mathcal{S}), \\ \end{array}$$

where \({N}_{\mbox{ obs}}(\cdot )\) denotes the count in the observed sequence \({\mathbf{S}}_{\mbox{ obs}}\), \({N}_{\mbox{ obs}}(a+) := \sum \limits_{b}{N}_{\mbox{ obs}}(ab)\) and \({H}_{{X}_{n},{X}_{1}}(\mathcal{S})\) is the cofactor corresponding to row X n and column X 1 of the matrix \({\bigl (1\mbox{ I}\{a = b\} - N(ab)/N{(a+)\bigr )}}_{a,b\in \mathcal{A}}\) [Whittle (1955)]. Note that the constraint for \(\mathbf{S} \in {\mathcal{S}}_{2}\) to have the same letter composition as \({\mathbf{S}}_{\mbox{ obs}}\) is equivalent to starting (resp. ending) with the first (resp. last) letter of \({\mathbf{S}}_{\mbox{ obs}}\). Indeed, we have \({N}_{\mbox{ obs}}(a+) = {N}_{\mbox{ obs}}(a)\) for all \(a \in \mathcal{A}\) except for the last nucleotide of \({\mathbf{S}}_{\mbox{ obs}}\) for which the counts then differ from 1. Knowing the letter composition in addition to the dinucleotide composition determines the last letter X n of the sequences \(\mathbf{S} \in \mathcal{S}\). We use the same procedure for the first letter X 1 by using the numbers \({N}_{\mbox{ obs}}(+b)\) of dinucleotides that end with b.

Working with these permutation models requires a lot of combinatorics.

Let us consider the first order stationary Markov model, denoted by M1. This means that the random letters X i are not independent and satisfy the following Markov property:

$$\mathbb{P}({X}_{i} = b\;\vert \;{X}_{1},{X}_{2}, \ldots ,{X}_{i-1}) = \mathbb{P}({X}_{i} = b\;\vert \;{X}_{i-1}),\quad \forall b \in \mathcal{A}.$$

The transition probabilities will be denoted as follows:

$$\pi (a,b) = \mathbb{P}({X}_{i} = a\;\vert \;{X}_{i-1} = b),\forall a,b \in \mathcal{A};$$

Π = (π(a, b)) a, b will denote the transition matrix. Moreover, all X i ’s have the same distribution, namely the stationary distribution μ which satisfies the relation μ = μΠ.

The transition probabilities are estimated by their maximum likelihood estimators (MLEs), i.e.

$$\widehat{\pi }(a,b) = \frac{N(ab)} {N(a+)},\quad a,b,\in \mathcal{A}, $$
(15.1)

where N( ⋅) denotes the number of occurrences in the sequence \(\mathbf{S} = {X}_{1}{X}_{2}\mathrel{\cdots }{X}_{n}\). Moreover, the letter probability μ(a) is usually estimated by \(\widehat{\mu }(a) = \frac{N(a)} {n}.\)

An important consequence of this estimation is that the plug-in estimator of the expected number of ab in model M1 is approximately equal to the observed count of ab in the DNA sequence. Indeed, we will see in Section 15.2.2 that \(\mathbb{E}[N(ab)] = (n - 1)\mu (a)\pi (a,b)\), which leads to

$$\widehat{\mathbb{E}}[N(ab)] := (n - 1)\widehat{\mu }(a)\widehat{\pi }(a,b) \simeq N(ab).$$

In other words, model M1 fits on average the 2-letter word composition of the observed sequence.

Similarly, the stationary m-th order Markov chain model (Mm) fits on average the (m + 1)-letter word composition of the observed sequence. In practice, the choice of the order m of the model Mm is important because it defines the set of reference sequences and, as we will see in Section 15.2.5, this choice often has a strong influence on the statistical results. This influence can already be observed in Table 15.1: the expected counts vary with respect to the chosen model.

Since model Mm on the \(\mathcal{A}\) alphabet can be considered as a model M1 on the larger alphabet \({\mathcal{A}}^{m}\), we will focus on first order Markov chains in this chapter.

The interest in considering phased Markov chains comes from the analysis of coding DNA sequences. Such sequences are split into adjacent 3-letter words called codons, each of which is translated into an amino acid to form a protein. The succession of codons ensures the reading frame for the translation. The nucleotides of a coding DNA sequence are then alternatively the first letter of a codon, the second letter of a codon, the third letter of a codon, and so on. The phase of a nucleotide is its position with respect to the codons; a letter can then be in three different phases in a coding sequence. The three positions of a codon do not have the same importance. First of all, an amino acid is often determined by the two first letters of a codon according to the genetic code. Moreover, the 3D structure of the protein usually implies constraints on the succession of amino acids. It is therefore important to take the phase of the nucleotides into account when modeling coding DNA sequences.

In a phased Markov chain of order 1, the transition probability from letter a to letter b depends on the phase ϕ ∈ { 1, 2, 3} of the nucleotide b. We then have the three following transition probabilities:

$${\pi }_{\phi }(a,b) = \mathbb{P}({X}_{3i+\phi } = b\:\vert \:{X}_{3i+\phi -1} = a),a,b \in \mathcal{A}.$$

We can also define the distributions \({\mu }_{\phi }\) of letters on each phase ϕ ∈ { 1, 2, 3}. They satisfy μ1 = μ3 Π 1, μ2 = μ1 Π 2 and μ3 = μ2 Π 3.

When estimating these parameters by the maximum likelihood method, we can fit on average the composition of the coding DNA sequence in ab’s on phase 1, in ab’s on phase 2 and ab’s on phase 3, for all \(a,b \in \mathcal{A}\).

With an appropriate change of alphabet, the phased Markov model on the \(\mathcal{A}\) alphabet can be considered like a model M1 on \(\mathcal{A}\times \{ 1,2,3\}\). It suffices to rewrite the sequence S over the alphabet \(\mathcal{A}\times \{ 1,2,3\}\) by defining \({X}_{i}^{\star } = ({X}_{i},i\text{ modulo 3})\). The transition probability from (a, ϕ) to (b, ϕ) is then equal to \({\pi }_{\phi }(a,b)\) if \(\phi = \phi ' + 1\) modulo 3, and 0 otherwise.

Some entire chromosomes have been completely sequenced for several years, and it was quickly noticed that their composition is more or less heterogeneous. There may be many reasons for this heterogeneity: genes are more constrained than intergenic regions because they have to code for functional proteins, bacteria can exchange genomic regions (horizontal transfers) but they all have their own signature in terms of composition, etc. It is thus natural to use heterogeneous Markov models. Usually the heterogeneity is considered like a piecewise homogeneity, i.e. homogeneous regions alternate along the genome. If the heterogeneity is known in advance (for instance genes/intergenic regions), one may then use piecewise homogeneous Markov models. When the aim is precisely to recover the heterogeneous structure, then the most popular models in genome analysis are hidden Markov models. Note that a hidden Markov chain with a hidden state space \(\mathcal{Q}\) and an observation space \(\mathcal{A}\) can be considered as a Markov chain on \(\mathcal{A}\times \mathcal{Q}\).

15.2.2 Mean and variance for the count

The derivation of the expectation and the variance of a word count under the permutation model based on \({\mathcal{S}}_{2}\) can be found in Cowan (1991) and Prum et al. (1995) [see Schbath (1995b) and Robin et al. (2005) for the letter permutation model].

In this section, we assume that the sequence \(\mathbf{S} = {X}_{1}{X}_{2}\mathrel{\cdots }{X}_{n}\) is a first order stationary Markov chain (model M1) with nonzero transition probabilities.

The number of occurrences N(w) of an h-letter word \(\mathbf{w} = {w}_{1}{w}_{2}\mathrel{\cdots }{w}_{h}\) in the sequence \(\mathbf{S} = {X}_{1}{X}_{2}\mathrel{\cdots }{X}_{n}\) can be simply defined by

$$N(\mathbf{w}) = \sum \limits_{i=1}^{n-h+1}{Y }_{ i}(\mathbf{w}), $$
(15.2)

where Y i (w) equals 1 if and only if an occurrence of w starts at position i in the sequence and 0 otherwise. Therefore, to get the mean and variance of the count, we need to study the distribution of the random indicators Y i (w)’s, namely their expectation, variance and covariances.

The position of an occurrence of w is defined by the position of its first letter w 1. We define the random indicator Y i (w) of an occurrence of w at position i, \(1 \leq i \leq n - h + 1\), in S by

$${ Y }_{i}(\mathbf{w}) = \left \{\begin{array}{ll} 1&\text{ if }({X}_{i},{X}_{i+1}, \ldots ,{X}_{i+h-1}) = ({w}_{1},{w}_{2}, \ldots {w}_{h}),\\ 0 &\text{ otherwise.} \end{array} \right.$$

It is a random Bernoulli variable with parameter \(\mathbb{P}({Y }_{i}(\mathbf{w}) = 1)\) given by

$$\begin{array}{rcl} \mathbb{P}({Y }_{i}(\mathbf{w}) = 1)& =& \mathbb{P}({X}_{i} = {w}_{1}, \ldots ,{X}_{i+h-1} = {w}_{h}) \\ & =& \mu ({w}_{1}) \times \pi ({w}_{1},{w}_{2}) \times \mathrel{\cdots } \times \pi ({w}_{h-1},{w}_{h})\end{array}$$
(.)

For convenience, μ(w) will denote the probability for the word w to appear at a given position in the sequence. The Y i (w)’s are then Bernoulli variables with expectation μ(w) and variance \(\mu (\mathbf{w})[1 - \mu (\mathbf{w})]\), with

$$\mu (\mathbf{w}) = \mu ({w}_{1}) \times \prod _{j=2}^{h}\pi ({w}_{ j-1},{w}_{j}). $$
(15.3)

However, these random indicators Y i (w) are not independent, not only because the sequence is Markovian, but most importantly because occurrences of a given word may overlap in a sequence. Consequently, their sum over the positions \(i =\{ 1, \ldots ,n - h + 1\}\) (namely the number of occurrences—or count—of the word) is not distributed according to a binomial distribution.

Occurrences of a given word may overlap in a sequence. For instance, w = aataa occurs four times in the sequence given in Figure 15.1, at positions i = 2, 11, 15 and 18. The third occurrence overlaps both the second and the fourth occurrences, leading to a clump of three overlapping occurrences of aataa starting at position 11.

Fig. 15.1
figure 1_15figure 1_15

Four occurrences of aataa in sequence S leading to two clumps of aataa, the first one of size 1 and the second one of size 3.

The overlapping structure of a word can be described by two equivalent quantities: the overlapping indicators or the periods.

The overlapping indicator \({\epsilon }_{u}(\mathbf{w})\), for 1 ≤ uh, is equal to 1 if two occurrences of w can overlap on u letters, meaning that the last u letters of w are identical to its first u letters, and 0 otherwise:

$${ \epsilon }_{u}(\mathbf{w}) = \left \{\begin{array}{ll} 1&\text{ if }({w}_{h-u+1},{w}_{h-u+2}, \ldots ,{w}_{h}) = ({w}_{1},{w}_{2}, \ldots ,{w}_{u}),\\ 0 &\text{ otherwise.} \end{array} \right.$$

By definition, ε h (w) = 1. A non-overlapping word w is such that ε u (w) = 0 for all 1 ≤ uh − 1.

An integer \(p \in \{ 1, \ldots ,h - 1\}\) is said to be a period of w if and only if two occurrences of w can start at a distance p apart (\({\epsilon }_{h-p}(\mathbf{w}) = 1\)). It implies the following periodicity: \({w}_{j} = {w}_{j+p}\) for all \(j \in \{ 1, \ldots ,h - p\}\).

We denote by \(\mathcal{P}(\mathbf{w})\) the set of periods of the word w. For instance, \(\mathcal{P}(\text{ aataataa}) =\{ 3,6,7\}\). Periods that are not a strict multiple of the smallest period are said to be principal since they will be more important, as we will see later. \(\mathcal{P}'(\mathbf{w})\) denotes the set of the principal periods of w; for instance, \(\mathcal{P}'(\text{ aataataa}) =\{ 3,7\}\).

In the rest of our discussion, we will use the periods rather than the overlapping indicators because this simplifies formulas. We will denote by w p w the word composed of two overlapping occurrences of w starting at a distance p apart:

$${\mathbf{w}}^{p}\mathbf{w} = {w}_{ 1}\mathrel{\cdots }{w}_{p}{w}_{1}\mathrel{\cdots }{w}_{h}.$$

The variables Y i (w) and Y i + d (w), d > 0, are not independent. Their covariance is defined by

$$\begin{array}{rcl} \mathbb{C}[{Y }_{i}(\mathbf{w}),{Y }_{i+d}(\mathbf{w})]& =& \mathbb{E}[{Y }_{i}(\mathbf{w}) \times {Y }_{i+d}(\mathbf{w})] - \mathbb{E}[{Y }_{i}(\mathbf{w})] \times \mathbb{E}[{Y }_{i+d}(\mathbf{w})] \\ & =& \mathbb{P}({Y }_{i}(\mathbf{w}) = 1,\;{Y }_{i+d}(\mathbf{w}) = 1) - {[\mu (\mathbf{w})]}^{2}. \end{array}$$
(15.4)

To calculate the probability \(\mathbb{P}({Y }_{i}(\mathbf{w}) = 1,\;{Y }_{i+d}(\mathbf{w}) = 1)\), we distinguish two cases: 1 ≤ d < h (two overlapping occurrences) and dh (two disjoint occurrences).

  • The probability that w occurs both at positions i and i + d, 1 ≤ d < h, is different from 0 only if d is a period of w. In this case, it is equal to μ(w d w).

  • The probability that two disjoint occurrences of w are separated by dh letters (dh) is given by \(\mu (\mathbf{w}){\pi }^{d-h+1}({w}_{h},{w}_{1})\mu (\mathbf{w})/\mu ({w}_{1})\), where \({\pi }^{\mathcal{l}}(\cdot ,\cdot )\) denotes -step transition probabilities in S.

The covariance between two random indicators of occurrence is thus

$$\mathbb{C}[{Y }_{i}(\mathbf{w}),{Y }_{i+d}(\mathbf{w})] = \left \{\begin{array}{ll} - {[\mu (\mathbf{w})]}^{2} &\text{ if }0 < d < h,\,d\in /\mathcal{P}(\mathbf{w}), \\ \mu ({\mathbf{w}}^{d}\mathbf{w}) - {[\mu (\mathbf{w})]}^{2} &\text{ if }d \in \mathcal{P}(\mathbf{w}), \\ {\left [\mu (\mathbf{w})\right ]}^{2}\left [\frac{{\pi }^{d-h+1}({w}_{h},{w}_{1})} {\mu ({w}_{1})} - 1\right ]&\text{ if }d \geq h. \end{array} \right. $$
(15.5)

Finally, we get the following expression for the expectation and the variance of N(w):

$$\begin{array}{rcl} \mathbb{E}[N(\mathbf{w})]& =& \sum \limits_{i=1}^{n-h+1}\mathbb{E}[{Y }_{ i}(\mathbf{w})] = (n - h + 1)\mu (\mathbf{w})\end{array}$$
(15.6)
$$\begin{array}{rcl} \mathbb{V}[N(\mathbf{w})]& =& \sum \limits_{i=1}^{n-h+1}\mathbb{V}[{Y }_{ i}(\mathbf{w})] + 2 \sum \limits_{i=1}^{n-h+1} \sum \limits_{j=i+1}^{n-h+1}\mathbb{C}[{Y }_{ i}(\mathbf{w}),{Y }_{j}(\mathbf{w})] 15.7 \\ & =& (n\! -\! h\! +\! 1)\mu (\mathbf{w}){\bigl (1\! -\! \mu (\mathbf{w})\bigr )}\! +\! 2 \sum \limits_{i=1}^{n-h+1} \sum \limits_{d=1}^{n-h-i+1}\mathbb{C}[{Y }_{ i}(\mathbf{w}),{Y }_{i+d}(\mathbf{w})],\\ \end{array}$$

where μ(w) is given by Equation (15.3), and the covariance term is given by Equation (15.5).

15.2.3 Word count distribution

We will now focus on the statistical distribution of the count N(w). Several methods have been proposed to derive the exact distribution of N(w) in a sequence of independent letters (model M0) or in model M1. Most of them use pattern matching principles or language theory (see for instance Chapter 7 from Lothaire (2005)). The most probabilistic approach is probably the one that uses the following duality principle: \(\mathbb{P}(N(\mathbf{w}) \geq j) = \mathbb{P}({T}_{j} \leq n)\), where T j denotes the position of the j-th occurrence of the word w along a random sequence S of length n. The distribution of T j can be obtained via the distribution of the distance between two successive occurrences of w [see Robin and Daudin (1999)]. However, all these methods are fastidious to implement, with many technical limitations as soon as the sequence is long, or if the order of the Markov model is greater than 1, or if the motif is complex. In practice, approximate distributions are used. In this section, we will present two approximations of the word count distribution that have been theoretically proved under some asymptotic framework: the Gaussian approximation, which is valid if the expected count is far enough from zero, and a compound Poisson approximation, which is adapted for the count of rare and clumping events. The quality of these approximations has been studied in Robin and Schbath (2001) and in Nuel (2006). No theoretical result exists so far on the binomial approximation that would result from neglecting the dependence between the occurrences.

15.2.3.1 Gaussian approximation

Recall that N(w) is a sum of \((n - h + 1)\) random Bernoulli variables Y i (w) with mean μ(w) and variance \(\mu (\mathbf{w})[1 - \mu (\mathbf{w})]\).

If the Bernoulli variables Y i (w) were independent, then the classical central limit theorem would ensure that the count the convergence in distribution is a special probabilistic convergence for random variables to a Gaussian variable. But the Y i (w)’s are not independent for two reasons: the occurrences of w can overlap, and the letters of the sequence are not independent. Nonetheless, by using a central limit theorem for Markov chains, the asymptotic normality of the count can be established:

$$\frac{N(\mathbf{w}) - \mathbb{E}[N(\mathbf{w})]} {\sqrt{\mathbb{V}[N(\mathbf{w} )]}} \stackrel{\mathcal{D}}{\mathrel{\rightarrow }}\mathcal{N}(0,1)\text{ as }n \rightarrow +\infty. $$
(15.8)

In the previous convergence, both the expectation and variance of the count depend on the model parameters, which are not known in practice. Let us estimate the expected count by its plug-in estimator, i.e. by replacing the transition probabilities π(a, b) by their MLEs \(\widehat{\pi }(a,b) = N(ab)/N(a+)\) and the probability μ(w 1) by \(\widehat{\mu }({w}_{1}) = N({w}_{1})/n\) in Equation (15.6). We then consider the following estimator:

$$\widehat{\mathbb{E}}[N(\mathbf{w})] = \frac{N({w}_{1}{w}_{2}) \times \mathrel{\cdots } \times N({w}_{h-1}{w}_{h})} {N({w}_{2}) \times \mathrel{\cdots } \times N({w}_{h-1})}. $$
(15.9)

Because the estimator \(\widehat{{\mathbb{E}}}_{1}[N(\mathbf{w})]\) is expressed like a function of several asymptotically Gaussian counts, the δ-method ensures that there exists a constant v 2(w) such that

$$\frac{N(\mathbf{w}) -\widehat{ \mathbb{E}}[N(\mathbf{w})]} {\sqrt{(n - h + 1){v}^{2 } (\mathbf{w} )}}\stackrel{\mathcal{D}}{\mathrel{\rightarrow }}\mathcal{N}(0,1)\text{ as }n \rightarrow +\infty.$$
(15.10)

However, since \widehat{𝔼}[N(w)] is random, the variance of {N(w) − \widehat{ 𝔼}[N(w)]} is different from \(\mathbb{V}[N(\mathbf{w})]\) and \((n - h + 1){v}^{2}(\mathbf{w})\) is therefore not related to \(\mathbb{V}[N(\mathbf{w})]\).

Several approaches have been used to derive the asymptotic variance \((n - h + 1){v}^{2}(\mathbf{w})\). The first one is the δ-method in Lundstrom (1990): it uses the fact that \({n}^{-1/2}\{N(\mathbf{w}) -\widehat{ \mathbb{E}}[N(\mathbf{w})]\}\) is a function of the asymptotically Gaussian vector \(\bigl(N(\mathbf{w}),N({w}_{1}{w}_{2}), \ldots ,N({w}_{h-1}{w}_{h}))\), \(N({w}_{2}), \ldots ,\) N(w h − 1)\bigr )} from (15.8). However, the function and the size of this vector depend both on the length and on the 2-letter composition of w, so it does not give a unified formula for the asymptotic variance.

Prum et al. (1995) proposed a second method: they showed that the estimator \(\widehat{\mathbb{E}}[N(\mathbf{w})]\) is asymptotically equivalent to \(\mathbb{E}[N(\mathbf{w})\:\vert \:{\mathcal{S}}_{2}]\), the expected count of N(w) under the 2-letter word permutation model, and that v 2(w) is the limit of \({n}^{-1}\mathbb{V}[N(\mathbf{w})\:\vert \:{\mathcal{S}}_{2}]\). They obtained

$$\begin{array}{c} \begin{array}{rlrlrl} {v}^{2}(\mathbf{w})& = \mu (\mathbf{w}) + 2 \sum \limits_{p\in \mathcal{P}(\mathbf{w}),\,p<h-1}\mu ({\mathbf{w}}^{p}\mathbf{w}) \\ &\quad + {[\mu (\mathbf{w})]}^{2}\left [ \sum \limits_{a}\frac{{[{N}_{\mathbf{w}}(a+)]}^{2}} {\mu (a)} - \sum \limits_{a,b}\frac{{[{N}_{\mathbf{w}}(ab)]}^{2}} {\mu (ab)} + \frac{1 - 2{N}_{\mathbf{w}}({w}_{1}+)} {\mu ({w}_{1})} \right ], \end{array}\end{array} $$
(15.11)

where N w ( ⋅) stands for the count inside the word w. The overlaps of w on two or more letters explicitly appear in this formula (p < h − 1). The overlap on a unique letter is taken into account in the [μ(w)]2 term.

Since model M1 allows more variability than the corresponding permutation model, one expects the variance \((n - h + 1){v}^{2}(\mathbf{w})\) to be smaller than the variance \(\mathbb{V}[N(\mathbf{w})]\). This is not difficult to show in the Bernoulli model (m = 0); for higher models, it has been numerically verified.

Generalizations to m > 1 and to phased models can be found in Schbath et al. (1995) and Schbath (1995b). When \(m = h - 2\), i.e. in the Markov chain model fitting the counts of all the (h − 1)-letter words (we call this model the maximal model regarding the analysis of h-letter words), a third approach can be used to derive the asymptotic variance. This approach is based on martingale theory and provides a simpler expression for the asymptotic variance [see Prum et al. (1995) or Reinert et al. (2000)].

15.2.3.2 Compound Poisson approximation

Poisson approximations can also be used for the count of rare events, i.e. when \(\mathbb{E}[N(\mathbf{w})] = \text{ O}(1)\). Note that this condition implies that logn = { O}(h) (long enough words). In this section, we will assume the rare event condition but also assume that h = { o}(n).

A nice method to establish Poisson approximations of counts is the Chen–Stein method [see Arratia et al. (1990) for an introduction and Barbour et al. (1992b) for a more general presentation]. This method gives a bound on the total variation distance between the distribution of a sum of dependent Bernoulli variables and the Poisson distribution with the same expectation. The lower the dependence, the better the Poisson approximation quality. Unfortunately, the local dependence between occurrences of an overlapping word w is too important, and a Poisson approximation of the distribution of N(w) generally does not hold. One can clearly show that the bound provided by the Chen–Stein method does not converge to zero [it is of order \(\mu ({\mathbf{w}}^{{p}_{0}}\mathbf{w})\) with p 0 the minimal period of w, see Schbath (1995a)]. But one can also show that a geometric distribution (discrete version of the exponential distribution) does not fit the distribution of the distance between two successive occurrences of an overlapping word [Robin and Daudin (1999)].

The solution is to take advantage of the clump structure (clumps do not overlap) and to use the following relations between the number of occurrences N(w) and the clumps (size and count). Indeed we have

$$N(\mathbf{w}) = \sum \limits_{i=1}^{\widetilde{N}(\mathbf{w})}{K}_{ i}(\mathbf{w}), $$
(15.12)

where \widetilde{N}(w) is the number of clumps of w and K i (w) is the size of the i-th clump, but we also have

$$N(\mathbf{w}) = \sum \limits_{k>0}k\widetilde{{N}}_{k}(\mathbf{w}), $$
(15.13)

where \widetilde{N} k (w) is the number of clumps of w of size k in S. Since a compound Poisson variable is defined like ∑}\nolimits k > 0 k Z k where the Z k ’s are independent Poisson variables, or like ∑}\nolimits i = 1 Z C i with Z a Poisson variable and the C i ’s independent and identically distributed (i.i.d.) variables, the Poisson approximation of the number of clumps (of any size or of size k) is the core of the compound Poisson approximation of the word count. In the remainder of this section, we will explicitly define the clumps and give some of their probabilistic properties.

A clump of a word w in a sequence S is a maximal succession of overlapping occurrences of w. The size of a clump is the number of occurrences of w of which the clump is composed. For instance, in Figure 15.1, there are two clumps of aataa: one of size 1 starting at position 2, the other one of size 3 starting at position 11. The position of a clump of w in the sequence is defined by the position (start) of the first occurrence of w in the clump. Let us define \widetilde{Y } i (w) as the random indicator that an occurrence of a clump of w starts at position i in S. A clump of w occurs at position i if and only if an occurrence of w occurs at position i without overlapping a previous occurrence of w. Therefore, if we neglect end effects (i.e. when i < h), we can write

$$\widetilde{{Y }}_{i}(\mathbf{w}) = {Y }_{i}(w)[1 - {Y }_{i-1}(w)] \times \mathrel{\cdots } \times [1 - {Y }_{i-h+1}(w)]. $$
(15.14)

(End effects are corrected by considering an infinite sequence.) Now an occurrence of w which overlaps a previous occurrence of w is necessarily preceded by a prefix w 1\mathrel{⋯}w p of w, where p is a period of w. If we restrict ourselves to principal periods, this is a necessary and sufficient condition [Schbath (1995a)]. For instance, an occurrence of aataataa overlaps a previous occurrence of aataataa if and only if it is preceded either by aat (prefix of size 3) or by aataata (prefix of size 7). If it was preceded by aataat (prefix of size 6), it would also be preceded by aat.

Therefore, we have

$$\widetilde{{Y }}_{i}(\mathbf{w}) = \sum \limits_{p\in \mathcal{P}'(\mathbf{w})}[1 - {Y }_{i-p}({w}_{1}\mathrel{\cdots }{w}_{p})] \times {Y }_{i}(w).$$

Let us denote by \widetilde{μ}(w) the probability that a clump of w occurs at a given position, i.e. \(\widetilde{\mu }(\mathbf{w}) = \mathbb{E}[\widetilde{{Y }}_{i}(\mathbf{w})]\). The previous equation gives

$$\widetilde{\mu }(\mathbf{w}) = [1 - a(\mathbf{w})] \times \mu (\mathbf{w}), $$
(15.15)

where a(w) is the probability that an occurrence of w overlaps a previous occurrence of w and is given by

$$a(\mathbf{w}) = \sum \limits_{p\in \mathcal{P}'(\mathbf{w})} \prod _{j=1}^{p}\pi ({w}_{ j},{w}_{j+1}). $$
(15.16)

Symmetrically, the probability that an occurrence of w overlaps a next occurrence of w is also equal to a(w). Therefore, a(w) will be simply called the probability of self-overlap of w. Note that a(w) = 0 if and only if w is a non-overlapping word (we assumed that all transition probabilities were nonzero). In that case we also have \(\widetilde{{Y }}_{i}(\mathbf{w}) = {Y }_{i}(\mathbf{w})\) and \(\widetilde{\mu }(\mathbf{w}) = \mu (\mathbf{w})\).

Let us define the number of clumps of w by \(\widetilde{N}(\mathbf{w}) := \sum \limits_{i=1}^{n-h+1}\widetilde{{Y }}_{i}(\mathbf{w})\). The mean number of clumps is then equal to \((n - h + 1)\widetilde{\mu }(\mathbf{w}) = [1 - a(\mathbf{w})]\mathbb{E}[N(\mathbf{w})]\) from (15.15). The Poisson approximation of \widetilde{N}(w) follows from a direct application of the Chen–Stein method to the Bernoulli variables \widetilde{Y } i (w) [Schbath (1995a)]. The error bound is indeed of order (ρh + hμ(w)) where 0 < ρ < 1 is the second largest eigenvalue (in modulus) of the transition matrix Π. Recall that nμ(w) = { O}(1) from the rare event condition and that h = { o}(n).

The exact distribution of the number of clumps of w in model M1 has been recently derived through its generating function [Stefanov et al. (2007)] and compared to the Poisson distribution; The conclusion was that the smaller the expected count of the word, the better the Poisson approximation.

A clump is of size k if and only if the first occurrence of w in the clump overlaps from the right a second occurrence (probability a(w)), the second occurrence of w in the clump overlaps a third occurrence (probability a(w)), …, the (k − 1)-th occurrence overlaps a k-th occurrence of w (probability a(w)), and this k-th occurrence of w does not overlap a next occurrence (probability 1 − a(w)). Thus, if we denote by K i (w) the size of the i-th clump of w in the sequence, the random variable K i (w) is geometrically distributed:

$$\mathbb{P}({K}_{i}(\mathbf{w}) = k) = [1 - a(\mathbf{w})] \times {[a(\mathbf{w})]}^{(k-1)}. $$
(15.17)

As previously stated, the Poisson approximations of the number of clumps of any size and more particularly of size k for k ≥ 1 are the key ingredients for the compound Poisson approximation of N(w). Indeed, let us denote by \(\mathcal{C}\mathcal{P}({\lambda }_{k},k \geq 1)\) the compound Poisson distribution of ∑}\nolimits k > 0 kZ k with \({Z}_{k} \sim \mathcal{P}({\lambda }_{k})\). Since \(N(\mathbf{w}) = \sum \limits_{k>0}k\widetilde{{N}}_{k}(\mathbf{w})\), the total variation distance properties give

$${d}_{\text{ TV}}(\mathcal{L}(N(\mathbf{w})),\mathcal{C}\mathcal{P}(\mathbb{E}[\widetilde{{N}}_{k}(\mathbf{w})],k \geq 1)) \leq {d}_{\text{ TV}}(\mathcal{L}(\widetilde{{N}}_{k}(\mathbf{w}),k \geq 1),\otimes \mathcal{P}(\mathbb{E}[\widetilde{{N}}_{k}(\mathbf{w})])).$$

The joint Poisson approximation of \((\widetilde{{N}}_{k}(\mathbf{w}),k \geq 1)\) is more involved to obtain than the one for \widetilde{N}(w) [Schbath (1995a)], but the error bound is of the same order and

$$\mathbb{E}[\widetilde{{N}}_{k}(\mathbf{w})] = {[1 - a(\mathbf{w})]}^{2}{[a(\mathbf{w})]}^{(k-1)}\mathbb{E}[N(\mathbf{w})].$$

The above formula means that the limiting compound Poisson distribution \(\mathcal{C}\mathcal{P}(\mathbb{E}[\widetilde{{N}}_{k}(\mathbf{w})],k \geq 1)\) is in fact a Pólya–Aeppli distribution (also called a geometric-Poisson distribution) with parameter \((\mathbb{E}[\widetilde{N}(\mathbf{w}),a(\mathbf{w})])\) [Johnson et al. (1992)].

Direct compound Poisson approximation methods exist and can be alternatively applied to the word count [Erhardsson (1999), Erhardsson (2000)]. Their advantage is that they provide better error bounds, but they give the same limiting compound Poisson distribution as above [see Lothaire (2005), Chapter 6].

As in the Gaussian approximation, the generalization to the phased Markov model of order 1 is done by rewriting the sequence with the new alphabet \(\mathcal{A}\times \{ 1,2,3\}\) (see Section 15.2.1). However, note that the occurrence of a single word w in sequence S corresponds to the occurrence of a word family composed of three phased words in the new sequence. Therefore, one has to use the compound Poisson approximation for the count of a set of words in M1 (see Section 15.4.1).

When one changes the alphabet (see Section 15.2.1) to generalize the compound Poisson approximation in model M1 to model Mm, m > 1, one must be very careful with the word overlaps. Indeed, there is no one-to-one transformation between clumps of w in S and clumps of \({\mathbf{w}}^{\star }\) (word w written on \({\mathcal{A}}^{m}\)) in the new sequence S . Let us take an example with m = 2. Set w = { aataa} and let S be the following sequence on the \(\mathcal{A}\) alphabet:

$$\mathbf{S} = \text{ g}\underline{\text{ aataa}}\text{ tgag}\underline{\text{ aataaataataa}}\text{ g}.$$

S contains four occurrences of w and two clumps of w (one of size 1, the other one of size 3). Now, we write the word and the sequence in the new alphabet \({\mathcal{A}}^{2}\). For this, we set { g}{ a} = γ, { a}{ a} = α, { a}{ t} = β, { t}{ a} = τ, { t}{ g} = δ, { a}{ g} = κ. We have

$${\mathbf{w}}^{\star } = \alpha \beta \tau \alpha \quad \text{ and }\quad {\mathbf{S}}^{\star } = \gamma \underline{\alpha \beta \tau \alpha }\beta \delta \gamma \kappa \gamma \underline{\alpha \beta \tau \alpha }\,\underline{\alpha \beta \tau \alpha \beta \tau \alpha }\kappa.$$

We can see that the word \({\mathbf{w}}^{\star }\) still appears four times in the sequence \({\mathbf{S}}^{\star }\) (N(w) is equal to the count of \({\mathbf{w}}^{\star }\) in \({\mathbf{S}}^{\star }\)) but there are now three clumps of \({\mathbf{w}}^{\star }\) in \({\mathbf{S}}^{\star }\) (two of size 1 and one of size 2). This is due to the fact that \({\mathbf{w}}^{\star }\) has just one unique period (\(\mathcal{P}(\alpha \beta \tau \alpha ) =\{ 3\}\)), whereas w has two periods (\(\mathcal{P}(\text{ aataa}) =\{ 3,4\}\)). Therefore, when the results for the word \({\mathbf{w}}^{\star }\) in M1 are “translated” into the alphabet \(\mathcal{A}\), some overlaps will not appear explicitly in the formulas. In Mm, only the overlaps on m letters or more will be taken into account since the principal periods of \({\mathbf{w}}^{\star }\) are the periods of w that are less than or equal to (hm). The word \({\mathbf{w}}^{\star }\) is non-overlapping as soon as w is not sufficiently self-overlapping.

15.2.4 p-valuesand scores of exceptionality

The significance of the over-representation of a word w in a given DNA sequence is measured by the p-value p(w):

$$p(\mathbf{w}) = \mathbb{P}\{N(\mathbf{w}) \geq {N}_{\text{ obs}}(\mathbf{w})\},$$

where N { obs}(w) is the observed count of w in the DNA sequence. If p(w) is close to 0, then the word is exceptionally frequent: there is no chance to observe it so many times in random sequences. On the other hand, the significance of an under-representation is measured by the p-value \(p'(\mathbf{w}) = \mathbb{P}\{N(\mathbf{w}) \leq {N}_{\text{ obs}}(\mathbf{w})\}\). If p′(w) is close to 0, then w is exceptionally rare under the model: there is no chance that w occurs so rarely in random sequences. Since the exact distribution of the count N(w) is rarely available in practice, approximate p-values are calculated to detect exceptional words and are usually converted into scores of exceptionality.

A natural way of approximating p-values is to use an approximate distribution of N(w); for instance, a Gaussian distribution for highly expected words or a compound Poisson distribution for rarely expected words, as we have seen in Section 15.2.3. Calculating approximate p-values only requires us to compute the tail of the Gaussian or compound Poisson distribution. An efficient algorithm to compute tails of geometric-Poisson distributions has been proposed by Nuel (2008).

For exceptional words, i.e. words whose count strongly deviates from what is expected, large deviation theory is probably the most accurate way to approximate p-values. This approach has been studied in Nuel (2004). Since it requires sophisticated numerical analysis and longer computation times, this method should be restricted to the most exceptional words (filtered from Gaussian or compound Poisson approximations for instance).

In practice, it is often more convenient to manipulate scores from \(\mathbb{R}\) than probabilities of the form \(p(\mathbf{w}) = \mathbb{P}\{N(\mathbf{w}) \geq {N}_{\text{ obs}}(\mathbf{w})\}\), especially when the ones we are interested in are very close to 0 or very close to 1. For symmetrical reasons we prefer to use the probit transformation rather than the − log transformation. Therefore, to each probability p(w) we associate the score u(w) such that

$$\mathbb{P}\{\mathcal{N}(0,1) \geq u(\mathbf{w})\} = p(\mathbf{w}).$$

Therefore, words with a high positive score are exceptionally frequent, whereas words with a negative but high absolute value score are exceptionally rare in the observed sequence.

The Gaussian approximation of N(w) has a great practical advantage: it allows us to directly calculate the score of exceptionality u(w) without calculating the associated p-value. Indeed, if we set

$$u(\mathbf{w}) = \frac{N(\mathbf{w}) -\widehat{ \mathbb{E}}[N(\mathbf{w})]} {\sqrt{\widehat{{\sigma }}^{2 } (\mathbf{w} )}} , $$
(15.18)

where \(\widehat{\mathbb{E}}[N(\mathbf{w})]\) is the estimator of the expected count given by Equation (15.9), and \(\widehat{{\sigma }}^{2}(\mathbf{w})\) is a plug-in estimator of \((n - h + 1){v}^{2}(\mathbf{w})\) (cf. Equation (15.11)), namely

$$\begin{array}{rcl} \widehat{{\sigma }}^{2}(\mathbf{w})& =& \widehat{\mathbb{E}}[N(\mathbf{w})] + 2 \sum \limits_{p\in \mathcal{P}(\mathbf{w}),p<h-1}\widehat{\mathbb{E}}[N({\mathbf{w}}^{p}\mathbf{w})] 15.19 \\ & & +\{\widehat{\mathbb{E}}{[N(\mathbf{w})]\}}^{2}\!\left [ \sum \limits_{a}\frac{{[{N}_{\mathbf{w}}(a+)]}^{2}} {N(a)} \! -\! \sum \limits_{a,b}\frac{{[{N}_{\mathbf{w}}(ab)]}^{2}} {N(ab)} \! +\! \frac{1 - 2{N}_{\mathbf{w}}({w}_{1}+)} {N({w}_{1})} \right ]\!,\\ \end{array}$$

then we have

$$\mathbb{P}\{N(\mathbf{w}) \geq {N}_{\text{ obs}}(\mathbf{w})\} \simeq \mathbb{P}\{\mathcal{N}(0,1) \geq u(\mathbf{w})\}.$$

15.2.5 Example of DNA motif discovery

Chi motifs have been identified in several bacterial genomes, and they are not conserved through species. Their identification in a new species is still a challenge. They are involved in the repair of double-strand DNA breaks by homologous recombination. More precisely, they interact specifically with an enzyme that processes along the DNA and degrades it (exonuclease activity). When the enzyme encounters a Chi site, its exonuclease activity is strongly reduced and altered, but it still continues to separate the two DNA strands, forming a substrate for homologous pairing and repair of the deleted DNA parts. Since Chi motifs protect the bacterial genome from degradation and stimulate its repair, it seems important that these motifs appear as frequently as possible along the bacterial genome. Biologists expect them to be significantly over-represented.

Moreover, Chi activity is strongly orientation dependent. The Chi motif is only recognized when the enzyme enters a double-strand DNA molecule from the right side of the motif. In many bacteria for which the Chi motif has been identified, the Chi orientation is correlated with the direction of DNA replication, meaning that it occurs preferentially on the leading strand [El Karoui et al. (1999), Halpern et al. (2007)]. The over-representation of Chi should then be important on the leading strands. Biologists classically measure the asymmetry strand of a motif by calculating its skew. The skew of a motif w is simply the ratio \(N(\mathbf{w})/N(\overline{\mathbf{w}})\), where \(\overline{\mathbf{w}}\) is the reverse complement of the word w; in other words \(N(\overline{\mathbf{w}})\) is simply the count of w in the complementary strand. Therefore, biologists expect Chi to be relatively skewed, i.e. with a skew much greater than one.

The Chi motif of E. coli has been known for a long time: it is the 8-letter word gctggtgg. If we study the statistical properties of the Chi frequency along the E. coli genome, we note some significant characteristics. First of all, its 762 occurrences in the complete genome (concatenation of both leading stands, n = 4. 6 106) are significantly high whatever model we choose. In other words, its high frequency cannot be explained by the genome composition. As we can see in Table 15.2, Chi has very high over-representation scores and is always among the five most exceptionally frequent 8-letter words. Second, if we restrict the analysis to the E. coli backboneFootnote 1 (n = 3. 7 106), Chi becomes the most exceptionally frequent 8-letter word in five models, especially in the maximal model M6 (see Table 15.2). Analyzing only the backbone seems therefore to reduce the noise produced by the regions which are either highly variable or specific to one or few strains (mobile elements). Indeed, there is a priori no biological reason for Chi to occur in such regions.

Table 15.2 Statistics of gctggtgg in the complete genome (left) and in the backbone genome (right) of E. coli K12 under various models Mm. The rank is obtained while sorting the 65,536 scores by decreasing order.

The choice of the model does not seem to affect the significance of the Chi frequency (it is always exceptional), but this is not a general picture. Note that, when the order of the Markov model increases, the model better fits the sequence composition and fewer exceptional words are found. This is illustrated by the boxplots of Figure 15.2. Moreover, in a high order model we have a more accurate knowledge of the sequence composition than in a low order model: the significance of a word frequency then has no reason to be the same. This point is illustrated by the plot of Figure 15.2 which compares scores in models M1 and M6. We recognize the Chi motif, which is clearly outside the cloud, but let us take the case of the word ggcgctgg. It occurs 761 times in the E.coli backbone, and it has a significantly high score of 62.4 in model M1 (it is the second most exceptional word) but has a score of 0.8 in model M6 (rank 17100). It simply means that its high frequency can be explained by the composition of 7-letter words; indeed, it is expected about 749 times in M6.

Fig. 15.2
figure 2_15figure 2_15figure 2_15figure 2_15

Exceptionality scores for the 65,536 8-letter words in the E. coli backbone. Left: Boxplots of the scores under models M0 to m6. Right: Scores under models M1 (x-axis) and M6 (y-axis).

The third characteristic of Chi in the E.coli backbone is that it is significantly skewed. Its skew is equal to 3. 20, and the method described in Section 15.4.1 to assess skew significance gives a score of 6. 53 in M6 (p-value of 3. 3 10 − 11).

We will describe here the strategy used in Halpern et al. (2007) to identify the Chi motif in the bacteria S. aureus. The first step was to extract the backbone of the S. aureus genome by comparing the genome of six strains of the bacteria. The obtained backbone contains about 2. 44 106 letters.

The second step was to search for motifs which are frequent enough, exceptionally frequent and relatively skewed. They started by analyzing 8-letter words (as for E. coli) but none of the most over-represented and skewed motifs were frequent enough to be retained as potential Chi candidates. They thus focused on 7-letter words. Scores of exceptionality were calculated with the Gaussian approximation and in the maximal model, namely model M5. Six motifs have an exceptionality score greater than 11 (see Table 15.3 or Figure 15.3 for a global view). Two of them have a negative skew score, so they were not retained. A biological experiment was then performed to test for S. aureus Chi activity of the four candidates: gaaaatg, ggattag, gaagcgg and gaattag. The conclusion was that gaagcgg is necessary and sufficient to confer Chi activity in S. aureus. This strategy has also been successfully used to predict and validate the Chi motif of three species of the Streptococcus genus [Halpern et al. (2007)].

Table 15.3 The 10 most exceptionally frequent 7-letter words under model M5 in the S. aureus complete genome. Columns correspond respectively to the word, its observed count, its estimated expected count, its normalizing factor, its score of over-representation under model M5, its observed skew and its skew score under model M0.
Fig. 15.3
figure 3_15figure 3_15

Over-representation scores under M5 and skew scores under M0 for the most over-represented 7-letter words (over-representation scores greater than 5) in the complete genome of S. aureus. The four best candidates (motifs A to D) are indicated. Motif C (gaagcgg) is the functional Chi site of S. aureus.

15.3 Words with Exceptional Distribution

The way the occurrences of a given motif w are spread along a sequence or among different sequences or subsequences may provide functional information. When the motif (and its functional properties) is known, this gives us hints about the function of the regions where it occurs (or where it is avoided). Conversely, new interesting motifs may be discovered by comparing their relative frequencies in different well-defined sequences or subsequences (e.g. regions of a genome).

15.3.1 Compound Poisson process

For both problems, we need a probabilistic model describing the motif occurrences process to assess the significance of the observed results. In this section, we will focus on the (compound) Poisson process, which is simple and provides a surprisingly good approximation of the distribution of the word count [Robin and Schbath (2001)].

In this model, the sequence is viewed as a continuous line. To account for possible overlaps between occurrences, the word is assumed to occur in clumps along the sequence. We assume that the counting process of the clumps \(\{C{(x)\}}_{x\geq 0}\) is a homogeneous Poisson process with intensity λ (in all of Section 15.3, we will avoid indexing the quantities by (w) because there will be no ambiguity). Each clump contains a random number of occurrences, referred to as the clump size. The clump sizes \(\{{K}_{1},{K}_{2}, \ldots \,\}\) are supposed to be i.i.d. with distribution p(k). The counting process {N(x)} x ≥ 0 is hence the compound Poisson process defined as

$$N(x) = \sum \limits_{c=1 \ldots C(x)}{K}_{c}.$$

In the case of a single fixed word, the clump size has a geometric distribution: \(p(k) = (1 - a){a}^{k-1}\), where a stands for the overlapping probability of the word (see Section 15.2.3). In the case of more complex motifs, p(k) may have a more complicated form [Robin (2002)]. The estimates of parameters λ and a depend on the biological question: empirical estimates will fit the observed word frequency (and clumping), while estimates based on a Markov chain model will account for the sequence composition.

15.3.2 Words significantly unbalanced between two sequences

We first consider the detection of motifs having different frequencies between two sequences S 1 and S 2. To avoid artifacts and spurious detections, the testing procedure must account for the different lengths and composition of the sequences, and for the fact that the word may have an unexpected frequency in one or both of them.

We only consider the non-overlapping case (i.e. a = 0). In sequence S i (i = 1, 2), the count N i of w is supposed to have a Poisson distribution

$${N}_{i} \sim \mathcal{P}({\lambda }_{i}),\qquad {\lambda }_{i} = {k}_{i}{\mathcal{l}}_{i}{\mu }_{i},$$

where i is the length of S i , \({\mu }_{i} = {\mu }_{i}(\mathbf{w})\) is the occurrence probability of w under a Markov model fitted to the composition of S i (see Section 15.2.2) and k i is the exceptionality coefficient of w in S i . This framework is described in Robin et al. (2007).

Our purpose is to test if the counts of w in both sequences deviate from their expected values in the same way. We hence want to test the hypothesis H 0 : { k 1 = k 2} versus {k 1\mathrel{≠}k 2}. A test procedure can be derived from the following property: for two independent Poisson variables N 1 and N 2 with respective means λ1 and λ2, the conditional distribution of N 1 given the sum N 1 + N 2 is binomial \(\mathcal{B}({N}_{1} + {N}_{2},{\lambda }_{1}/({\lambda }_{1} + {\lambda }_{2}))\). Hence, we have under H 0:

$${N}_{1}\vert ({N}_{1} + {N}_{2}) \sim \mathcal{B}\left ({N}_{1} + {N}_{2},{\mathcal{l}}_{1}{\mu }_{1}/[{\mathcal{l}}_{1}{\mu }_{1} + {\mathcal{l}}_{2}{\mu }_{2}]\right ).$$

The distribution of the counts of overlapping words is characterized by two parameters (λ and a). For such words, the frequency comparison must be stated in both terms. Assuming that the overlapping probability is the same in the two sequences leads us to define the same binomial test procedure as above on the number of clumps (rather than the number of occurrences itself), which is supposed to have a Poisson distribution (see Section 15.2.3).

To illustrate this procedure, we consider the occurrences of the Chi motif w = gctggtgg in the genome of E. coli. This genome can be split into a very conserved part (called the backbone) that is common to various strains of E. coli and a remaining part (called variable segments) that is specific to the strain under study: K12. The occurrences of Chi actually never overlap in the whole genome; the number of clumps is the number of occurrences. Chi occurs 691 times in the backboneFootnote 2 and 66 times in the loops, while the expected numbers of clumps i \tilde{μ} i under model M1 are 73.6 and 11.3, respectively, so \({\mathcal{l}}_{1}{\mu }_{1}/({\mathcal{l}}_{1}{\mu }_{1} + {\mathcal{l}}_{2}{\mu }_{2}) = 86.7\%\). It seems therefore more frequent in the backbone than in the loops. To assess the significance of this difference, we calculate the p-value \(\Pr \{\mathcal{B}(757,86.7\%) \geq 691\} = 5.12\,1{0}^{-5}\), which shows that Chi is significantly more frequent in the most conserved regions of the genome, which is consistent with its favorable function.

Testing the equality of the two overlapping probabilities (H 0 : { a 1 = a 2}) leads to a hypergeometric test [see Robin et al. (2007)].

15.3.3 Detecting regions significantly enrichedwith or devoid of a word

We now want to detect genome regions where the occurrences of a given word w are unexpectedly frequent (or rare). The standard strategy in such a situation is to use scan statistics, i.e. distances between successive occurrences. This strategy was first proposed in a genomic context by Karlin and Macken (1991). In this setting, the occurrences are supposed to occur according to a homogeneous Poisson process, which actually corresponds to a non-overlapping word.

Overlapping words can be studied in the compound Poisson model. Since the clump size has a geometric distribution, the distance D between two successive occurrences is either (i) 0 (if the two occurrences belong to the same clump) or (ii) exponential (if they belong to two successive clumps). (i) occurs with probability a and (ii) with probability (1 − a). The cumulative distribution function (cdf) of D is hence \(F(y) = 1 - (1 - a){e}^{-\lambda y}.\) The analogous exact distribution is derived in Robin and Daudin (2001) in the Markov chain model. Because the occurrence process is a renewal process, the cdf F r of the r-scan, i.e. the cumulated distance D r between the i-th occurrence and the (i + r)-th is simply the r times self-convolution of F: \({F}_{r} = {F}^{\otimes r}\).

Let D 1 r, D 2 r, \mathop{} denote the successive r-scans. The richest region in terms of occurrences is characterized by the smallest D min r = min i D i r. To check if the observed minimum distance d min r is significantly small, we need to evaluate Pr{D min rd min r}. A Poisson approximation strategy is proposed by Dembo and Karlin (1992):

$$\Pr \{{D}_{\min }^{r} \leq {d}_{\min }^{r}\} \approx 1 -\exp [-(N - r){F}_{ r}({d}_{\min })],$$

where N is the total number of occurrences. Chen–Stein bounds for this approximation are provided. These results can be applied for both the compound Poisson process [Robin (2002)] and Markov chain [Robin and Daudin (2001)] frameworks.

As an illustration, we consider the occurrences of the Chi motif in the genome of Haemophilus influenzae, and study their distribution using 3-scans (see Section 15.2.5 to get the description of the Chi motif). The x-axis of Figure 15.4 gives the positions in Mbps, and the y-axis gives the intensity 3 ∕ D 3 multiplied by 103 (in log scale); peaks correspond to rich regions. We observe several peaks, the highest one being near the center, i.e. near the terminus of replication. Chi motifs are expected to be frequent here because this region is crucial in the replication mechanism of the cell. The four horizontal lines give, in ascending order, the theoretical mean intensity, the lower bound of the Chen–Stein approximation, the Chen–Stein threshold and the upper bound. We see that several peaks are significant under the M1 model, but the mean intensity of the occurrence process is highly underestimated by this model. Using MLEs, the compound Poisson model fits the observed mean intensity. In this model, even the highest peak is no longer significant.

Fig. 15.4
figure 4_15figure 4_15figure 4_15figure 4_15

Significance of the intensity peaks for the occurrences of the Chi site of H. influenzae.

15.4 More Sophisticated Patterns

Biological motifs are not always exact and simple words. They often contain some uncertainties (degenerated motifs) like the Chi motif gntggtgg of H. influenzae (the n stands for any of the four DNA letters). In this case, we have to consider the occurrences of a set of words rather than a single word. In the case of transcription factor binding sites, we have to deal with several (exact or not) words that should occur at a constrained distance apart (structured motifs). In Section 15.4.1, we give major extensions required to generalize the results on simple words presented in the previous sections to a set of words. Then, we present some results for structured motifs in Section 15.4.2.

15.4.1 Family of words

Let \(\mathcal{W}\) be a set (family) of r words: \(\mathcal{W} =\{{ \mathbf{w}}_{1}, \ldots ,{\mathbf{w}}_{r}\}\). To simplify the exposition, we will assume that all of the r words have the same length h. In the general case, one just makes the assumption that no word from the family, is part of another word of the family, and the results can be easily generalized.

The number of occurrences of the word family, denoted by \(N(\mathcal{W})\), is simply the sum of the counts of each word taken from \(\mathcal{W}\):

$$N(\mathcal{W}) = \sum \limits_{j=1}^{r}N({\mathbf{w}}_{ j}).$$

The expected count \(\mathbb{E}[N(\mathcal{W})]\) is then simply the sum of the r expected counts 𝔼[N(w j )], \(j = 1, \ldots ,r\). For the variance, we have \(\mathbb{V}[N(\mathcal{W})] = \sum \limits_{j=1}^{r}\mathbb{V}[N({\mathbf{w}}_{j})] + 2 \sum \limits_{j<j'}\mathbb{C}[N({\mathbf{w}}_{j}),N({\mathbf{w}}_{j'})]\), so we just need to derive the covariance between two word counts (see below). The Gaussian approximation of \(N(\mathcal{W})\) is immediate, and it is easy to derive a score of exceptionality for any family of words. For the compound Poisson approximation, it is much more involved. A first strategy could be to approximate separately the clumps of each word, and then to combine the associated Poisson variables [Reinert and Schbath (1998)]. Unfortunately, words from \(\mathcal{W}\) can overlap each other, and this will lead to a bad approximation for overlapping families. The alternative is to consider clumps of the word family itself, i.e. clumps composed of overlapping occurrences of \(\mathcal{W}\) [Roquain and Schbath (2007)]. This leads to a compound Poisson distribution, whose parameters are derived from an overlapping probability matrix (A(w j , w j′ ))1 ≤ j, j′r , but which is not a geometric Poisson distribution. Tails of general compound Poisson distributions can be calculated by using the algorithm from Barbour et al. (1992a).

Let there be two different words w and w′ of length h. The covariance \(\mathbb{C}[N(\mathbf{w}),N(\mathbf{w'})]\) is given by

$$\mathbb{C}[N(\mathbf{w}),N(\mathbf{w'})] = -\mathbb{E}[N(\mathbf{w})]\,\mathbb{E}[N(\mathbf{w'})] + \sum \limits_{i\mathrel{\not =}j}\mathbb{E}[{Y }_{i}(\mathbf{w}){Y }_{j}(\mathbf{w'})].$$

Because of symmetry, let us restrict ourselves to the calculation of \(\mathbb{E}[{Y }_{i}(\mathbf{w}){Y }_{i+d}(\mathbf{w'})]\) for d > 0. If 0 < d < h, an occurrence of w′ at position i + d would overlap an occurrence of w at position i. We then need to introduce the possible lags between an occurrence of w and a following overlapping occurrence of w′.

Let \(\mathcal{P}(\mathbf{w},\mathbf{w'})\) be the set of these possible lags, namely

$$p \in \mathcal{P}(\mathbf{w},\mathbf{w'})\mathrel{\Leftrightarrow }{w'}_{j} = {w}_{j+p},\quad \forall j \in \{ 1, \ldots ,h - p\}.$$

Overlaps are not necessarily symmetric so \(\mathcal{P}(\mathbf{w},\mathbf{w'})\mathrel{\not =}\mathcal{P}(\mathbf{w'},\mathbf{w})\). For instance, atcg can be overlapped from the right by cgct after a lag of 2 (\(\mathcal{P}(\texttt{ atcg},\texttt{ cgct}) =\{ 2\}\)), whereas cgct cannot be overlapped from the right by atcg (\(\mathcal{P}(\texttt{ cgct},\texttt{ atcg}) = \emptyset\)).

If \(p \in \mathcal{P}(\mathbf{w},\mathbf{w'})\), let w p w′ be the word composed of two overlapping occurrences of w and w′: \({\mathbf{w}}^{p}\mathbf{w'} = {w}_{1}\mathrel{\cdots }{w}_{p}{w'}_{1}\mathrel{\cdots }{w'}_{h}\).

By analogy with Equation (15.5), one can show that

$$\mathbb{E}[{Y }_{i}(\mathbf{w}),{Y }_{i+d}(\mathbf{w'})] = \left \{\begin{array}{ll} 0 &\text{ if }0 \leq d < h,d\in /\mathcal{P}(\mathbf{w},\mathbf{w'}), \\ \mu ({\mathbf{w}}^{d}\mathbf{w'}) &\text{ if }d \in \mathcal{P}(\mathbf{w},\mathbf{w'}), \\ \mu (\mathbf{w})\mu (\mathbf{w'})\frac{{\pi }^{d-h+1}({w}_{ h},{w'}_{1})} {\mu ({w'}_{1})} &\text{ if }d \geq h, \end{array} \right.$$

which finally leads to the following expression for the covariance:

$$\begin{array}{rcl} \mathbb{C}[N(\mathbf{w}),N(\mathbf{w'})]& =& -\:\mathbb{E}[N(\mathbf{w})]\,\mathbb{E}[N(\mathbf{w'})] +\: \sum \limits_{p\in \mathcal{P}(\mathbf{w},\mathbf{w'})}(n - h - p + 1)\mu ({\mathbf{w}}^{p}\mathbf{w'}) \\ & & +\: \sum \limits_{p\in \mathcal{P}(\mathbf{w'},\mathbf{w})}(n - h - p + 1)\mu ({\mathbf{w'}}^{p}\mathbf{w}) \\ & & +\:\mu (\mathbf{w})\mu (\mathbf{w'}) \\ & & \times \sum \limits_{t=1}^{n-2h+1}(n - 2h - t + 2)\left [\frac{{\pi }^{t}({w}_{ h},{w'}_{1})} {\mu ({w'}_{1})} + \frac{{\pi }^{t}({w'}_{h},{w}_{1})} {\mu ({w}_{1})} \right ]\end{array}$$
(.)

Note that it is also possible to calculate the asymptotic variance of \(N(\mathcal{W}) - \sum \limits_{j}\widehat{\mathbb{E}}[N({\mathbf{w}}_{j})]\) by using the conditional covariances of \((N({\mathbf{w}}_{j}),N({\mathbf{w}}_{\mathcal{l}}))\) in the permutation model (see Schbath et al. (1995)).

As we have seen in Section 15.2.5, biologists may be interested in the statistical significance of the skew of a word w. The skew is defined like the ratio \(N(\mathbf{w})/N(\overline{\mathbf{w}})\) where \(\overline{\mathbf{w}}\) is the reverse complementaryFootnote 3 word of w (for instance, if w = { gctggtgg} then \(\overline{\mathbf{w}} = \text{ ccaccagc}\)). To calculate the significance of the skew, one then has to get (or to approximate) the following p-value:

$$\mathbb{P}\left (\frac{N(\mathbf{w})} {N(\overline{\mathbf{w}})} \geq b\right ),$$

where b is the observed skew. This requires at least the joint distribution of \((N(\mathbf{w}),N(\overline{\mathbf{w}}))\).

If we assume that \((N(\mathbf{w}),N(\overline{\mathbf{w}}))\) can be approximated by a Gaussian vector with mean \((\widehat{\mathbb{E}}[N(\mathbf{w})],\widehat{ \mathbb{E}}[N(\overline{\mathbf{w}})])\) and covariance matrix Σ, the above p-value can be approximated by

$$\mathbb{P}\left (\mathcal{N}(0,1) \geq \frac{b\widehat{\mathbb{E}}[N(\overline{\mathbf{w}})] -\widehat{ \mathbb{E}}[N(\mathbf{w})]} {\sqrt{{\Sigma }_{11 } - 2b{\Sigma }_{12 } + {b}^{2 } {\Sigma }_{22}}}\right ).$$

The right term of the preceding inequality will then be considered like a score to measure the significance of the skew. Typically, Σ 11 and Σ 22 are given by Equation (15.19), and Σ 12 can be obtained similarly because of the conditional covariances between counts.

If N(w) and \(N(\overline{\mathbf{w}})\) are more likely to be (compound) Poisson distributed, no solution exists for now. If w and \(\overline{\mathbf{w}}\) do not overlap each other, their counts can be approximated by two independent geometric Poisson variables [Reinert and Schbath (1998)], but it does not help to derive an asymptotic distribution for the skew.

Because of the possible overlaps between words of the family, the distribution of the intersite distances between two word family occurrences depends on which word actually occurs first and which word occurs next [Robin (2002)]. Therefore, in the general case, the occurrences of a set of words do not constitute a renewal process, and the methodology described in Section 15.3.3 cannot be used to get the r-scan distribution. In the Markov chain framework, the occurrences of a set of words turns out to be a semi-Markov process.

15.4.2 Structured motifs

A structured motif is composed of several words which should occur in a given order and at some distance apart from each other. Let consider the simple case of two fixed words u and v. We define a structured motif m like a pattern whose u is a prefix, v is a suffix and whose length is \(\vert \mathbf{u}\vert + d + \vert \mathbf{v}\vert \), d ≥ 0. Moreover, we impose d 1dd 2. Since d 1 can be large (typically 12 to 20 for transcription factor binding sites), it is not reasonable to view a structured motif like a set of words (i.e. a very degenerated word). Dedicated methods should then be provided. The two main questions related to structured motif occurrences are: (i) what is the probability that a random sequence contains at least one occurrence of a given structured motif? (ii) Is this structured motif more over-represented in front of genes than along the whole chromosome? For the first question, an approximate probability has been derived by assuming that the random indicator of occurrence Y i (m) only depends on Y i − 1(m) [Robin et al. (2002)]. More recently, the generating function of the waiting time for the first occurrence of a structured motif was proposed [Stefanov et al. (2007); see also Stefanov (2009)]. For the second question, one can use the test described in Section 15.3.2 which just requires us to compute \(\mu (\mathbf{m}) = \mathbb{E}[{Y }_{i}(\mathbf{m})]\), the occurrence probability of m. An example of the transcription factor binding site discovery method can be found in Touzain et al. (2008).

The probability for m to occur at a given position in a random sequence \({X}_{1},{X}_{2}, \ldots ,{X}_{n}\) (model M1) is given by

$$\mu (\mathbf{m}) = \mu (\mathbf{u}) \sum \limits_{d={d}_{1}}^{{d}_{2} }\mathbb{P}({D}_{\mathbf{u},\mathbf{v}} = d)\mu (\mathbf{v})/\mu ({v}_{1}),$$

where D u, v is the random distance between an occurrence of u and the next occurrence of v, and v 1 is the first letter of v. The distribution of D u, v is given in Robin and Daudin (2001) [see also Stefanov (2008)].

15.5 Ongoing Research and Open Problems

Multiple testing problems immediately arise in motif detection studies: looking for exceptional 8-letter words leads to performing thousands of tests at the same time. The control of the false discovery rate [Benjamini and Hochberg (1995)] has received huge attention in the last few years in the gene expression context, but it is still neglected in most motif statistic studies. The main difficulty comes from the dependency between the counts—and hence between the tests—of all words under study. Under the null (Markov) model, all word counts are correlated, since they are observed on the same sequence. The covariance between any pair of counts is actually known (see Section 15.4.1), but is difficult to account for in multiple testing procedures, partly because of high dimensionality problems.

Many genomes, e.g. bacterial ones, can be characterized in terms of oligonucleotides composition. This phenomenon is often referred to as the “genome signature.” Several new genomic approaches aim at classifying sequences with similar origins: comparative genomics aims at finding similarities between complete genomes, typically in an evolutionary perspective; meta-genome analysis considers sets of hundreds of species living in the same environment (soil, human intestine) and deals with mixtures of subsequences coming from these different species.

As seen before, the Mm Markov chain model accounts for the composition of a sequence in (m + 1)-letter works. Mixture models [McLachlan and Peel (2000)] provide a natural framework to classify objects into unknown groups. Such a model assumes that the sequences actually come from Q groups, each characterized by one transition matrix; sequence i coming from group number q is a random path with transition matrix Π q . The expectation-maximization algorithm is the standard way to estimate both group proportions and matrices Π q , which make \((Q - 1) + 3Q{4}^{m}\) independent parameters. However, mixture models generally lead to model selection problems, typically to choose the unknown number of groups Q. In the case of sequences, this problem turns out to be very complex because of different sequence lengths: long sequences tend to discriminate very easily from each other, while small sequences have almost no influence on the global model. Combinatorial arguments are needed to evaluate the number of “efficient” parameters, i.e. the number of transition probabilities for which some information can actually be derived from the data.

This new technology is likely to be used in many biological experiments in the next decade, typically in the place of micro-arrays. It consists in sequencing a huge number (40 millions) of small DNA fragments (25 nucleotides) in one run. It can be used to count the number of copies of the transcripts of a given gene, to evaluate its expression level, or to explore the meta-genome of a given ecosystem. Dealing with such large datasets is an open problem. Markov models and motif statistics can probably help to organize all this information, but we admit that we still do not really know how.