Keywords and phrases

16.1 Introduction

Patterns and motifs on finite alphabets are of interest in many applied areas, such as computational molecular biology, computer science, communication theory, and reliability theory. A word on an alphabet is called a single pattern, and a set of distinct single patterns (words) is called a compound pattern. The strings (texts) of letters can be generated either by independent and identically distributed multinomial trials, or by general discrete-time or continuous-time models (Markov chains or semi-Markov processes). The main interest, from a probabilistic/statistical point of view, is in finding practicable closed-form expressions for the distributions of the following quantities: the waiting time until the first occurrence of a pattern (single or compound) or motif, the intersite distances between consecutive occurrences of such, and the count(s) of occurrences of a pattern(s) or motif within a finite time horizon. Motifs are special cases of compound patterns which usually contain a huge number of distinct single patterns.

The theory on pattern occurrence attracted a variety of methodological tools. For example, the following methodologies have been widely used in the literature: combinatorial methods and classical probabilistic methods based on conditioning arguments, Markov chain embeddings, Markov renewal embeddings, exponential families, martingale techniques, and automata theory. The usefulness of these methodologies to the area is well illustrated in the sources which follow.

Runs are the simplest patterns. Feller (1950) showed how recurrent event theory can be used to solve problems about success runs. For a comprehensive account of the literature on runs see Balakrishnan and Koutras (2002). The key to handling complex patterns was provided by Conway’s leading numbers, which account for the overlapping structure of a pattern. Guibas and Odlyzko (1981) derived results applying elementary methods, and Chryssaphinou and Papastavridis (1990) extended them to more general models [see also Robin and Daudin (1999, 2001), Rukhin (2002, 2006), Han and Hirano (2003), and Inoue and Aki (2007)]. Li (1980) introduced martingale techniques to the area, and Gerber and Li (1981) combined the latter with a relevant Markov chain embedding. Martingale tools have also been used in Pozdnyakov et al. (2005), Glaz et al. (2006), and Pozdnyakov (2008).

Markov chain embeddings have been widely used in the area for treating problems on pattern occurrence; a few relevant sources are Fu (1996), Chadjiconstantinidis, Antzoulakos, and Koutras (2000), Antzoulakos (2001), Fu and Chang (2002), and Fu and Lou (2003). Blom and Thorburn (1982) made connections with Markov renewal theory, and this was systematically exploited by Biggins and Cannings (1987) and Biggins (1987). Stefanov and Pakes (1997) introduced exponential family methodology, combined with a minimal Markov chain embedding, and Stefanov (2000) extended it in combination with suitable Markov renewal embeddings to handle some special compound patterns (sets of runs).

Nicodème, Salvy, and Flajolet (2002) used automata theory comprehensively. Nuel (2008) combined automata theory with Markov chain embeddings and elaborated on a route which leads, for any given pattern(s), to a minimal embedding Markov chain. Reinert, Schbath, and Waterman (2000) provided a survey on some probabilistic tools used in the theory of patterns, and Szpankowski (2001) treated problems on pattern occurrence associated with average case analysis of string searching algorithms. The first exact distributional results on structured motifs are found in Stefanov, Robin, and Schbath (2007) [cf. also Robin et al. (2002), Nuel (2008), and Pozdnyakov (2008)].

In this chapter, results are discussed which provide explicit, closed-form solutions for the distributions of the aforementioned random quantities associated with the occurrence of patterns and structured motifs. These results are derived using predominantly simple probabilistic tools. Also, for a given alphabet, they require a preliminary (easy) evaluation of a few basic characteristics, and then each pattern case is covered in an automated way.

In Sections 16.2 and 16.3 we discuss single patterns. The strings are generated by discrete- or continuous-time semi-Markov processes. The exact distribution of the waiting time until the first occurrence of a pattern, given any (fixed) portion of it has been reached, is found. Also joint distributional results are discussed. The method relies on the knowledge of basic characteristics associated with the underlying model used to generate the strings. These basic characteristics are the probability generating functions (pgf’s) of the waiting times until another letter of the alphabet is reached. In other words, we need to know only the pgf’s of the waiting times until the simplest special patterns consisting of a single letter from the alphabet are first reached. These pgf’s can be evaluated using well-known analytical results if the underlying model is a discrete- or continuous-time finite-state semi-Markov process. In terms of these basic characteristics, simple recurrence relations are provided; these lead to exact evaluation of the relevant pgf’s for any pattern. The results on single patterns, as provided in Sections 16.2 and 16.3, lead to an easy solution for compound patterns, which consist of a small to moderate number of distinct single patterns. This is discussed in Subsection 16.4.1. The distribution of the count, and more generally the weighted count, of a compound pattern within a finite time horizon is discussed in Subsection 16.4.2. A neat explicit expression is derived for this distribution in terms of the aforementioned waiting time distributions. The result in Subsection 16.4.2 has not appeared in the literature before. Structured motifs are covered in Subsection 16.4.3. It is shown that results on compound patterns, consisting of only two single patterns, are enough to derive exact distribution results on structured motifs.

16.2 Patterns: Discrete-Time Models

In this section we explain how to derive a closed-form expression for the pgf of the waiting time to reach a pattern (word) starting from either a given letter or an already-achieved portion of the pattern. The strings of letters are generated by a finite-state discrete-time Markov chain whose state space and states are also called alphabet and letters, respectively.

Let \(\{X{(n)\}}_{n\geq 0}\) be an ergodic finite-state Markov chain with discrete-time parameter, state space \(\{1,2, \ldots ,N\},\) and one-step transition probabilities p i, j , \(i,j = 1,2, \ldots ,N.\) Denote by g i, j (t) the pgf of the waiting time, τ i, j , to reach state j from state i, that is \({g}_{i,j}(t) = E({t}^{{\tau }_{i,j}}),\) and

$${\tau }_{i,j} =\inf \{ n : X(n) = j\vert X(0) = i\}.$$

We assume τ i, i  = 0, and therefore g i, i (t) = 1, for each i. The first return time to state i is denoted by \(\tilde{{\tau }}_{i,i},\) that is,

$$\tilde{{\tau }}_{i,i} =\inf \{ n > 0 : X(n) = i\vert X(0) = i\},$$

and its pgf is denoted by \tilde{g} i, i (t). 

The pattern of interest is \({\mathbf{w}}_{k} = {w}_{1}{w}_{2} \ldots {w}_{k},\) where \(1 \leq {w}_{i} \leq N,\;i = 1,2, \ldots ,k.\) For j < k, the subpattern w j is also called a prefix of w k . For each \(j,\ j = 2,3, \ldots ,k - 1,\) and r < j, and each \(n,\ n = 1,2, \ldots ,N,\) denote by I r, j, n the indicator function which is equal to one if and only if none of the strings \({w}_{i}{w}_{i+1} \ldots {w}_{j}n\) for \(i = 2,3, \ldots ,r\) is a prefix of w k but \({w}_{r+1}{w}_{r+2} \ldots {w}_{j}n\) is. Also, the indicator function I j, j, n is equal to one if and only if none of the strings \({w}_{i}{w}_{i+1} \ldots {w}_{j}n\) for \(i = 2,3, \ldots ,j\) is a prefix of w k . 

Denote by G j (s)(t) (\tilde{G} j (s)(t)), \(j = 1,2, \ldots ,k,\) the pgf of the waiting time to reach the pattern w j from state s, allowing (not allowing) the initial state s to contribute to the pattern. Also, denote by \({G}_{j}^{({\mathbf{w}}_{r})}(t),1 \leq r \leq j,\) the pgf of the waiting time to reach the pattern w j , given that the pattern w r has already been reached (note that \({G}_{j}^{({\mathbf{w}}_{j})}(t) = 1\)). The following theorem provides a simple route for evaluating these pgf’s knowing the pgf’s, g i, j (t), of the transition times between the states of the original Markov chain X(n). The expressions for the pgf’s g i, j (t) are easily recoverable from well-known analytical results [see Theorem 2.19 on page 81 of Kijima (1997)], for any given finite-state Markov chain with not too large a state space.

Theorem 16.2.1.

Let the pattern of interest be w k. The following recurrence relations hold for each \(j,\ j = 1,2, \ldots ,k - 1,\) and each \(r,\ r = 1,2, \ldots ,j\) (with the convention \(\sum \limits_{i=1}^{0} = 0\)) :

$$\begin{array}{rcl} \tilde{{G}}_{j+1}^{(s)}(t) = \frac{{p}_{{w}_{j},{w}_{j+1}}t\tilde{{G}}_{j}^{(s)}(t)} {1 - \sum \limits_{n=1,\;n\mathrel{\not =}{w}_{j+1}}^{N}{p}_{{ w}_{j},n}t\left ( \sum \limits_{i=1}^{j-1}{I}_{ i,j,n}{G}_{j}^{({\mathbf{w}}_{j-i+1})}(t) + {I}_{ j,j,n}{G}_{j}^{(n)}(t)\right )}\;,& & \\ {G}_{j+1}^{({\mathbf{w}}_{r})}(t) = \frac{{p}_{{w}_{j},{w}_{j+1}}t{G}_{j}^{({\mathbf{w}}_{r})}(t)} {1 - \sum \limits_{n=1,\;n\mathrel{\not =}{w}_{j+1}}^{N}{p}_{{ w}_{j},n}t\left ( \sum \limits_{i=1}^{j-1}{I}_{ i,j,n}{G}_{j}^{({\mathbf{w}}_{j-i+1})}(t) + {I}_{ j,j,n}{G}_{j}^{(n)}(t)\right )}\;,& & \\ \end{array}$$

where

$$\begin{array}{rcl} \tilde{{G}}_{j+1}^{(s)}(t)& =& {G}_{ j+1}^{(s)}(t),\;\;\;\;\;\;\;\mathrm{if}\;\;s\mathrel{\not =}{w}_{ 1}, \\ \tilde{{G}}_{j+1}^{({w}_{1})}(t)& =& \tilde{{g}}_{{ w}_{1},{w}_{1}}(t){G}_{j+1}^{({w}_{1})}(t), \\ {G}_{1}^{(s)}(t)& =& {g}_{ s,{w}_{1}}(t), \\ \tilde{{G}}_{1}^{(s)}(t)& =& \tilde{{g}}_{{ w}_{1},{w}_{1}}(t) = \sum \limits_{n=1}^{N}{p}_{{ w}_{1},n}t{g}_{n,{w}_{1}}(t), \\ \end{array}$$

and the g i,j (t) and the indicator functions I i,j,n are as above.

The pgf of the intersite distance between consecutive occurrences of the pattern w k is given by \({G}_{k}^{({\mathbf{w}}_{j})}(t),\) where j is the largest integer such that w j is a proper prefix as well as a suffix of the pattern w k . Also, the pgf of the waiting time until the r-th occurrence of the pattern w k , given the initial state i, is equal to \({G}_{k}^{(i)}(t){\left ({G}_{k}^{({\mathbf{w}}_{j})}(t)\right )}^{r-1},\) where j has the same property as above.

The proof of Theorem 16.2.1 is based on the following simple idea. Let \({\tau }_{{\mathbf{w}}_{j}\vert \bar{{\mathbf{w}}}_{j+1}}\) be the waiting time for the first return (strictly positive) from pattern w j to itself given that the pattern w j + 1 is not achieved. Of course, the pattern w j + 1 is not achieved if the first state visited is not state w j + 1. Therefore, the pgf of \({\tau }_{{\mathbf{w}}_{j}\vert \bar{{\mathbf{w}}}_{j+1}}\) is equal to

$${g}_{{\tau }_{{\mathbf{w}}_{ j}\vert \bar{{\mathbf{w}}}_{j+1}}}(t) = \sum \limits_{n=1,\;n\mathrel{\not =}{w}_{j+1}}^{N} \frac{{p}_{{w}_{j},n}t} {1 - {p}_{{w}_{j},{w}_{j+1}}} \left ( \sum \limits_{i=1}^{j-1}{I}_{ i,j,n}{G}_{j}^{({\mathbf{w}}_{j-i+1})}(t) + {I}_{ j,j,n}{G}_{j}^{(n)}(t)\right ).$$

Then, the waiting time to reach pattern w j + 1 starting from state s is equal to one plus a geometric sum of independent random variables, \({Y }_{1},{Y }_{2}, \ldots ,\) say, such that Y 1 has the distribution of the waiting time to reach subpattern w j from state s and the remaining Y n have the distribution of \({\tau }_{{\mathbf{w}}_{j}\vert \bar{{\mathbf{w}}}_{j+1}}.\) This implies that

$$\tilde{{G}}_{j+1}^{(s)}(t) = \frac{{p}_{{w}_{j},{w}_{j+1}}t\tilde{{G}}_{j}^{(s)}(t)} {1 - \sum \limits_{n=1,\;n\mathrel{\not =}{w}_{j+1}}^{N}{p}_{{ w}_{j},n}t\left ( \sum \limits_{i=1}^{j-1}{I}_{ i,j,n}{G}_{j}^{({\mathbf{w}}_{j-i+1})}(t) + {I}_{ j,j,n}{G}_{j}^{(n)}(t)\right )}\;.$$

A detailed proof of Theorem 16.2.1 is found in Stefanov (2003).__________

16.3 Patterns: General Discrete-Time and Continuous-Time Models

In this section, extensions of the result from the preceding section are presented. Finite-state semi-Markov processes, with either discrete- or continuous-time parameters, are the underlying models for generating the strings. Also, joint distributions of the waiting time to reach a pattern, together with the associated counts of occurrences of each letter, are of interest.

16.3.1 Waiting times

The notation from the preceding section is further used here for identifying the counterparts of similar quantities. For example, g i, j (t) will again denote the pgf of the waiting time to reach state j from state i in the more general discrete- or continuous-time model considered here.

Let \(\{X{(u)\}}_{u\geq 0}\) (the time parameter u may be either discrete or continuous) be a semi-Markov process whose associated embedded discrete-time Markov chain has a finite state space \(\{1,2, \ldots ,N\}\) and one-step transition probabilities p i, j , \(i,j = 1,2, \ldots ,N.\) For a formal definition of a semi-Markov process see çinlar (1975). Denote by ϕ i, j (t) the pgf of the holding (sojourn) time in state i, given that the next state to be visited is state j (if the holding time distributions are discrete, then the time parameter is discrete). We denote by g i, j (t) the pgf of the waiting time, τ i, j , to reach state j from state i; that is, \({g}_{i,j}(t) = E({t}^{{\tau }_{i,j}}),\) where

$${\tau }_{i,j} =\inf \{ u : X(u) = j\vert X(0) = i\}.$$

We assume τ i, i  = 0, and therefore g i, i (t) = 1, for each i. The first return time to state i is denoted by \tilde{τ} i, i and its pgf by \tilde{g} i, i (t). Of course, if X(u) is a discrete-time Markov chain,

$$\tilde{{\tau }}_{i,i} =\inf \{ u > 0 : X(u) = i\vert X(0) = i\},$$

and if X(u) is a continuous-time Markov chain,

$$\tilde{{\tau }}_{i,i} =\inf \{ u > 0 : X(u) = i,X(u-)\mathrel{\not =}i\vert X(0) = i\}.$$

If X(u) is a general semi-Markov process, then \(\tilde{{\tau }}_{i,i}\) is understood to be the waiting time to reach state i from itself given that at least one transition has been made in the associated embedded discrete-time Markov chain. This clarifies the interpretation of \tilde{τ} i, i in case one-step transitions are allowed from a state to itself in the embedded discrete-time Markov chain.

Again, as in the preceding section, the pattern of interest is denoted by w k . Denote by G j (s)(t) (\tilde{G} j (s)(t)), \(j = 1,2, \ldots ,k,\) the pgf of the waiting time to reach the pattern w j from state s, allowing (not allowing) the initial state s to contribute to the pattern. Also denote by \({G}_{j}^{({\mathbf{w}}_{r})}(t),1 \leq r \leq j,\) the pgf of the waiting time to reach the pattern w j , given that the pattern w r has already been reached (note that \({G}_{j}^{({\mathbf{w}}_{j})}(t) = 1\)). The following theorem provides a simple route for evaluating these pgf’s in terms of the following characteristics of the original semi-Markov process X(u): the pgf’s, g i, j (t), of the transition times between the states, the pgf’s, ϕ i, j (t), of the holding time distributions, and the transition probabilities, p i, j , of the embedded discrete-time Markov chain.

Theorem 16.3.1.

Let the pattern of interest be w k. The following recurrence relations hold for each \(j,\ j = 1,2, \ldots ,k - 1,\) and each \(r,\ r = 1,2, \ldots ,j\) (with the convention \(\sum \limits_{i=1}^{0} = 0\)) :

$$\begin{array}{rcl} & \tilde{{G}}_{j+1}^{(s)}(t) = \frac{{p}_{{w}_{j},{w}_{j+1}}{\phi }_{{w}_{j},{w}_{j+1}}(t)\tilde{{G}}_{j}^{(s)}(t)} {1- \sum \limits_{\begin{array}{c}n=1, \\ n\mathrel{\not =}{w}_{j+1}\end{array}}^{N}{p}_{{ w}_{j},n}{\phi }_{{w}_{j},n}(t)\left ( \sum \limits_{i=1}^{j-1}{I}_{ i,j,n}{G}_{j}^{({\mathbf{w}}_{j-i+1})}(t) + {I}_{ j,j,n}{G}_{j}^{(n)}(t)\right )}\;, & \\ & {G}_{j+1}^{({\mathbf{w}}_{r})}(t) = \frac{{p}_{{w}_{j},{w}_{j+1}}{\phi }_{{w}_{j},{w}_{j+1}}(t){G}_{j}^{({\mathbf{w}}_{r})}(t)} {1- \sum \limits_{\begin{array}{c}n=1, \\ n\mathrel{\not =}{w}_{j+1}\end{array}}^{N}{p}_{{ w}_{j},n}{\phi }_{{w}_{j},n}(t)\left ( \sum \limits_{i=1}^{j-1}{I}_{ i,j,n}{G}_{j}^{({\mathbf{w}}_{j-i+1})}(t) + {I}_{ j,j,n}{G}_{j}^{(n)}(t)\right )}\;,& \\ \end{array}$$

where

$$\begin{array}{rcl} \tilde{{G}}_{j+1}^{(s)}(t)& =& {G}_{ j+1}^{(s)}(t),\;\;\;\;\;\;\;\;\mathrm{if}\;\;s\mathrel{\not =}{w}_{ 1}, \\ \tilde{{G}}_{j+1}^{({w}_{1})}(t)& =& \tilde{{g}}_{{ w}_{1},{w}_{1}}(t){G}_{j+1}^{({w}_{1})}(t), \\ {G}_{1}^{(s)}(t)& =& {g}_{ s,{w}_{1}}(t), \\ \tilde{{G}}_{1}^{({w}_{1})}(t)& =& \tilde{{g}}_{{ w}_{1},{w}_{1}}(t) = \sum \limits_{n=1}^{N}{p}_{{ w}_{1},n}{\phi }_{{w}_{1},n}(t){g}_{n,{w}_{1}}(t)\end{array}$$
(.)

The proof is based on the same idea as that used to prove Theorem 16.2.1. Similarly to the preceding section, denote by \({\tau }_{{\mathbf{w}}_{j}\vert \bar{{\mathbf{w}}}_{j+1}}\) the waiting time to reach w j from itself given that the pattern w j + 1 is not achieved. Then one may notice that the waiting time to reach pattern w j + 1 starting from state s is equal to the sum of two independent random variables, where the first has a pgf which equals \({\phi }_{{w}_{j},{w}_{j+1}}(t)\) and the second one is a geometric sum of independent random variables, \({Y }_{1},{Y }_{2}, \ldots ,\) say, such that Y 1 has the distribution of the waiting time to reach subpattern w j from state s and the remaining Y n have the distribution of \({\tau }_{{\mathbf{w}}_{j}\vert \bar{{\mathbf{w}}}_{j+1}}.\)

16.3.2 Joint generating functions associated with waiting times

In this subsection we consider the same general semi-Markov model X(u) that has been introduced in the preceding subsection. Recall that its embedded discrete-time Markov chain has N states. Throughout this subsection these states will be called ‘symbols’. Again the notation from the preceding subsections is further used in this subsection for identifying the counterparts of similar quantities (such as G j (s)( ⋅), etc.). Note that basic quantities of the underlying model, such as τ i, j and ϕ i, j , have the same meaning as that in the preceding subsection.

Let C i (u) be the count of occurrences of symbol i up to time u, and let \({g}_{i,j}(\underline{\mathbf{t}}),\) where \(\underline{\mathbf{t}} = ({t}_{0},{t}_{1}, \ldots ,{t}_{N}),\) be the joint pgf of \(({\tau }_{i,j},{C}_{1}({\tau }_{i,j}), \ldots ,{C}_{N}({\tau }_{i,j})),\) where the τ i, j have been introduced in the preceding subsection. Likewise, let \tilde{g} i, i (\underline{t}) be the joint pgf of \((\tilde{{\tau }}_{i,i},{C}_{1}(\tilde{{\tau }}_{i,i}), \ldots ,{C}_{N}(\tilde{{\tau }}_{i,i})),\) where again the \tilde{τ} i, i have been introduced in the preceding subsection. Note that g i, i (\underline{t}) = 1. Denote by ν j (s) the waiting time to reach the pattern w j from state s. Let G j (s)(\underline{t}), \((\tilde{{G}}_{j}^{(s)}(\underline{\mathbf{t}})),\) be the joint pgf of \({\nu }_{j}^{(s)},{C}_{1}({\nu }_{j}^{(s)}), \ldots ,{C}_{N}({\nu }_{j}^{(s)}),\) allowing (not allowing) the first symbol to contribute to the pattern. Further, let ν j (w r ) be the waiting time to reach the pattern w j from the already-reached prefix w r , and let G j (w r )(\underline{t}) be the joint pgf of \({\nu }_{j}^{({\mathbf{w}}_{r})},{C}_{1}({\nu }_{j}^{({\mathbf{w}}_{r})}), \ldots ,{C}_{N}({\nu }_{j}^{({\mathbf{w}}_{r})}).\) Note that the methodology introduced in Stefanov (2000; see Section 3) yields explicit expressions for the pgf’s g i, j (\underline{t}) associated with any given semi-Markov process, whose embedded discrete-time Markov chain has a relatively small number of states. Therefore, the recurrence relations in the following theorem provide a simple route for explicit evaluation of the joint pgf’s of the waiting time to reach, or the intersite distance between two consecutive occurrences of, a pattern and the associated counts of occurrences of the corresponding symbols (letters).

Theorem 16.3.2.

Let the pattern of interest be w k. The following recurrence relations hold for each \(j,\ j = 1,2, \ldots ,k - 1,\) and each \(r,\ r = 1,2, \ldots ,j\) :

$$\begin{array}{rcl} & \tilde{{G}}_{j+1}^{(s)}(\underline{\mathbf{t}}) = \frac{{p}_{{w}_{j},{w}_{j+1}}{t}_{{w}_{j+1}}{\phi }_{{w}_{j},{w}_{j+1}}({t}_{0})\tilde{{G}}_{j}^{(s)}(\underline{\mathbf{t}})} {1- \sum \limits_{\begin{array}{c}n=1, \\ n\mathrel{\not =}{w}_{j+1}\end{array}}^{N}{p}_{{ w}_{j},n}{t}_{n}{\phi }_{{w}_{j},n}({t}_{0})\left ( \sum \limits_{i=1}^{j-1}{I}_{ i,j,n}{G}_{j}^{({\mathbf{w}}_{j-i+1})}(\underline{\mathbf{t}}) + {I}_{ j,j,n}{G}_{j}^{(n)}(\underline{\mathbf{t}})\right )}, & \\ & {G}_{j+1}^{({\mathbf{w}}_{r})}(\underline{\mathbf{t}}) = \frac{{p}_{{w}_{j},{w}_{j+1}}{t}_{{w}_{j+1}}{\phi }_{{w}_{j},{w}_{j+1}}({t}_{0}){G}_{j}^{({\mathbf{w}}_{r})}(\underline{\mathbf{t}})} {1- \sum \limits_{\begin{array}{c}n=1, \\ n\mathrel{\not =}{w}_{j+1}\end{array}}^{N}{p}_{{ w}_{j},n}{t}_{n}{\phi }_{{w}_{j},n}({t}_{0})\left ( \sum \limits_{i=1}^{j-1}{I}_{ i,j,n}{G}_{j}^{({\mathbf{w}}_{j-i+1})}(\underline{\mathbf{t}}) + {I}_{ j,j,n}{G}_{j}^{(n)}(\underline{\mathbf{t}})\right )},& \\ \end{array}$$

where

$$\begin{array}{rcl} \tilde{{G}}_{j+1}^{(s)}(\underline{\mathbf{t}})& =& {G}_{ j+1}^{(s)}(\underline{\mathbf{t}}),\;\;\;\;\;\;\;\mathrm{if}\;\;s\mathrel{\not =}{w}_{ 1}, \\ \tilde{{G}}_{j+1}^{({w}_{1})}(\underline{\mathbf{t}})& =& \tilde{{g}}_{{ w}_{1},{w}_{1}}(\underline{\mathbf{t}}){G}_{j+1}^{({w}_{1})}(\underline{\mathbf{t}}), \\ {G}_{1}^{(s)}(\underline{\mathbf{t}})& =& {g}_{ s,{w}_{1}}(\underline{\mathbf{t}}), \\ \tilde{{G}}_{1}^{({w}_{1})}(\underline{\mathbf{t}})& =& \tilde{{g}}_{{ w}_{1},{w}_{1}}(\underline{\mathbf{t}}) = \sum \limits_{n=1}^{N}{p}_{{ w}_{1},n}{t}_{n}{\phi }_{{w}_{1},n}({t}_{0}){g}_{n,{w}_{1}}(\underline{\mathbf{t}})\end{array}$$
(.)

The proof of this theorem is found in Stefanov (2003).

16.4 Compound Patterns

Throughout this section we assume that the strings are generated by discrete-time Markov chains.

16.4.1 Compound patterns containing a small numberof single patterns

Denote by W a compound pattern which consists of k distinct single patterns, \({\mathbf{w}}^{(1)},{\mathbf{w}}^{(2)}, \ldots ,{\mathbf{w}}^{(k)}.\) The latter may have different lengths, and it is assumed that none of them is a proper substring of any of the others. Let a be an arbitrary pattern; in particular, if a has length 1, that is, it is equal to a particular letter, s say, then we will denote a by s. Introduce the following quantities.

T a, W — the waiting time, starting from pattern a, to reach for the first time the compound pattern W; if a equals one of the w (i), then this waiting time is assumed to be greater than 0;

\({T}_{\mathbf{a},\mathbf{W}\vert {\mathbf{w}}^{(j)}}\) — the waiting time, starting from pattern a, to reach for the first time the compound pattern W, given that W is reached via w (j);

T a, b — the waiting time to reach pattern b starting from pattern a;

X i, j — the interarrival time between two consecutive occurrences of pattern W, given that the starting pattern is w (i) and the reached pattern is w (j);

r i, j — the probability that the first reached pattern from W is w (j), given that the starting pattern is w (i). 

Of course, \({X}_{i,j} = {T}_{{\mathbf{w}}^{(i)},\mathbf{W}\vert {\mathbf{w}}^{(j)}}.\) Introduce the following pgf’s:

$${G}_{\mathbf{a},\mathbf{W},j}(t) = \sum \limits_{n=1}^{\infty }P\left ({T}_{\mathbf{ a},\mathbf{W}} = {T}_{\mathbf{a},\mathbf{W}\vert {\mathbf{w}}^{(j)}} = n\right ){t}^{n},\;\;\;\;\;j = 1,2, \ldots ,k,$$

and recall that by G Y (t) we denote the pgf of a random variable Y. Clearly,

$${r}_{i,j} = P\left ({T}_{{\mathbf{w}}^{(i)},\mathbf{W}} = {T}_{{\mathbf{w}}^{(i)},\mathbf{W}\vert {\mathbf{w}}^{(j)}}\right ) = {G}_{{\mathbf{w}}^{(i)},\mathbf{W},j}(1).$$

Also, it is easy to see that

$${G}_{{X}_{i,j}}(t) = \frac{{G}_{{\mathbf{w}}^{(i)},\mathbf{W},j}(t)} {{r}_{i,j}} \;.$$

Therefore, both the r i, j and the pgf’s \({G}_{{X}_{i,j}}(t)\) can be recovered from the pgf’s \({G}_{{\mathbf{w}}^{(i)},\mathbf{W},j}(t).\) The following theorem [see Chryssaphinou and Papastavridis (1990) and Gerber and Li (1981)] provides, for each pattern a, a system of linear equations from which one can recover the pgf’s G a, W, j (t) and \({G}_{{T}_{\mathbf{a},\mathbf{W}}}(t)\) in terms of the pgf’s \({G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t).\) The \({G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t)\) are derived from the results in Section 16.2.

Theorem 16.4.1.

The following identities hold:

$$\begin{array}{rcl} {G}_{{T}_{\mathbf{a},\mathbf{W}}}(t)& =& \sum \limits_{j=1}^{k}{G}_{\mathbf{ a},\mathbf{W},j}(t), \\ {G}_{{T}_{\mathbf{ a},{\mathbf{w}}^{(i)}}}(t)& =& \sum \limits_{j=1}^{k}{G}_{{ T}_{{\mathbf{w}}^{(i)},{\mathbf{w}}^{(j)}}}(t){G}_{\mathbf{a},\mathbf{W},j}(t),\;\;\;\;\;i = 1,2, \ldots ,k\end{array}$$
(.)

In particular, we get the following explicit expressions for the \({G}_{{\mathbf{w}}^{(i)},\mathbf{W},j}(t)\) in terms of the \({G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t)\) if the compound pattern W = { w (1), w (2)} consists of two patterns. For brevity, \({G}_{{T}_{i,j}}\) below stands for \({G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t).\)

$$\begin{array}{rcl}{ G}_{{\mathbf{w}}^{(1)},\mathbf{W},1}(t)& =& \frac{{G}_{{T}_{1,1}}{G}_{{T}_{2,2}} - {G}_{{T}_{1,2}}^{2}} {{G}_{{T}_{1,1}}{G}_{{T}_{2,2}} - {G}_{{T}_{1,2}}{G}_{{T}_{2,1}}} \;, \\ {G}_{{\mathbf{w}}^{(1)},\mathbf{W},2}(t)& =& \frac{{G}_{{T}_{1,1}}{G}_{{T}_{1,2}} - {G}_{{T}_{1,1}}{G}_{{T}_{2,1}}} {{G}_{{T}_{1,1}}{G}_{{T}_{2,2}} - {G}_{{T}_{1,2}}{G}_{{T}_{2,1}}} \;, \\ {G}_{{\mathbf{w}}^{(2)},\mathbf{W},1}(t)& =& \frac{{G}_{{T}_{2,1}}{G}_{{T}_{2,2}} - {G}_{{T}_{1,2}}{G}_{{T}_{2,2}}} {{G}_{{T}_{1,1}}{G}_{{T}_{2,2}} - {G}_{{T}_{1,2}}{G}_{{T}_{2,1}}} \;, \\ {G}_{{\mathbf{w}}^{(2)},\mathbf{W},2}(t)& =& \frac{{G}_{{T}_{1,1}}{G}_{{T}_{2,2}} - {G}_{{T}_{2,1}}^{2}} {{G}_{{T}_{1,1}}{G}_{{T}_{2,2}} - {G}_{{T}_{1,2}}{G}_{{T}_{2,1}}} \;\end{array}$$
(.)

16.4.2 Weighted counts of compound patterns

A quantity of interest is the count of occurrences of a compound pattern, W say (as introduced in Subsection 16.4.1), within a finite time horizon. A more general quantity is the weighted count of pattern occurrences which attaches a weight, h i say, to each occurrence of a single pattern, w (i), from W. More specifically, introduce

$${H}_{\mathbf{W}}(t) = \sum \limits_{i=1}^{k}{h}_{ i}{N}_{{\mathbf{w}}^{(i)}}(t),$$

where \({N}_{{\mathbf{w}}^{(i)}}(t)\) is the count of occurrences of pattern w (i) within a time interval of length t. Recall the meaning of the r i, j , X i, j , and \({T}_{s,\mathbf{W}\vert {\mathbf{w}}^{(j)}}\) which are introduced in Subsection 16.4.1. Of course, the occurrence of W can be modelled by a k-state semi-Markov process, where an entry to state i identifies an occurrence of pattern w (i). The one-step transition probabilities of the embedded discrete-time Markov chain of this semi-Markov process are the r i, j . The holding time at state i, given that the next state to be visited is state j, is identified by the random variable X i, j . For each initial letter, s say, we augment this semi-Markov process with one initial state, 0 say, and relevant one-step transition probabilities and holding times as follows (we denote the probability to move from state 0 to state j by r 0, j ):

$${r}_{0,0} = 0,\;\;\;\;{r}_{0,j} = {G}_{s,\mathbf{W},j}(1),\;\;\;\;\;j = 1,2, \ldots ,k,$$

and the holding time at state 0, given that the next state to be visited is state j, is identified by \({T}_{s,\mathbf{W}\vert {\mathbf{w}}^{(j)}},\) where the latter and G s, W, j (t) are introduced in Subsection 16.4.1. Now consider the semi-Markov processes, Y t say, derived from that above as follows. The state space has (k + 1)2 states, identified by the pairs \((i,j),\ i,j = 0,1, \ldots ,k.\) The process Y t enters state (i, j) if pattern w (i) is reached, given that the next occurrence of W is via pattern w (j). The initial states are the states (0, j) for \(j = 1,2, \ldots ,k,\) and the initial probabilities are the r 0, j . Clearly, the holding time distributions for this new semi-Markov process do not depend on the next state visited. Also, the holding time in state (i, j) is identified by the random variable X i, j , and that in state (0, j) by \({T}_{s,\mathbf{W}\vert {\mathbf{w}}^{(j)}}.\) Then the weighted count H W (t), introduced above, is equal to

$${H}_{\mathbf{W}}(t) = \sum \limits_{i=0}^{k} \sum \limits_{j=0}^{k}{h}_{ i}{N}_{(i,j)}(t),$$

where N (i, j)(t) counts the number of visits of Y t to state (i, j) within a time interval of length t. Denote by \({\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})}\) the first passage time of Y t from state (i 1, j 1) to state (i 2, j 2) and by \({L}_{{H}_{\mathbf{W}}}^{{\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})} }({s}_{1},{s}_{2})\) the joint Laplace transform of the random variables \({\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})}\) and \({H}_{\mathbf{W}}\left ({\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})}\right )\), that is,

$${L}_{{H}_{\mathbf{W}}}^{{\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})} }({s}_{1},{s}_{2}) = E\left (\exp \left (-{s}_{1}{\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})} - {s}_{2}{H}_{\mathbf{W}}({\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})})\right )\right ).$$

Closed-form expressions for the \({L}_{{H}_{\mathbf{W}}}^{{\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})} }({s}_{1},{s}_{2})\) are derivable in terms of the r i, j and the Laplace transforms of the X i, j and the \({T}_{s,\mathbf{W}\vert {\mathbf{w}}^{(j)}}\), as explained in Stefanov (2006) for general reward functions on semi-Markov processes. Let

$${L}_{t,{H}_{\mathbf{W}}}^{(s)}({s}_{ 1},{s}_{2}) = \int _{0}^{\infty } \int _{0}^{\infty }{e}^{-{s}_{1}t-{s}_{2}x}P\left ({H}_{\mathbf{ W}}(t) \leq x\vert \mbox{ the initial letter is}\;s\right )dxdt$$

The following theorem follows from a general result on reward functions for semi-Markov processes [see Theorem 2.1 in Stefanov (2006)]. It provides an explicit, closed-form expression for the Laplace transform, \({L}_{t,{H}_{\mathbf{W}}}^{(s)}({s}_{1},{s}_{2}),\) of the weighted count of W occurrences within a time interval of length t, in terms of the r i, j , the Laplace transforms, \(\mathcal{L}[{X}_{i,j}](\cdot ),\) of the interarrival times X i, j of the compound pattern W, and the Laplace transforms, \(\mathcal{L}[{T}_{s,\mathbf{W}\vert {\mathbf{w}}^{(j)}}](\cdot ),\) of the waiting time to reach W from an initial letter s, for \(s = 1,2, \ldots ,N.\)

Theorem 16.4.2.

The following identity holds for the Laplace transform \({L}_{t,{H}_{\mathbf{W}}}^{(s)}\!\! :\)

$${L}_{t,{H}_{\mathbf{W}}}^{(s)}({s}_{ 1},{s}_{2}) = \sum \limits_{m=1}^{k}{r}_{ 0,m} \sum \limits_{i,j=1}^{k}\frac{\left (1 -\mathcal{L}[{X}_{i,j}]({s}_{1} + {s}_{2}{h}_{i})\right ){L}_{{H}_{\mathbf{W}}}^{{\nu }_{(0,m),(i,j)} }({s}_{1},{s}_{2})} {{s}_{2}({s}_{1} + {s}_{2}{h}_{i})\left (1 - {L}_{{H}_{\mathbf{W}}}^{{\nu }_{(i,j),(i,j)} }({s}_{1},{s}_{2})\right )} \;,$$

where the joint Laplace transforms \({L}_{{H}_{\mathbf{W}}}^{{\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})} }({s}_{1},{s}_{2})\) have been introduced above.

16.4.3 Structured motifs

Structured motifs are special compound patterns, usually containing a huge number of single patterns. In this subsection we consider both the waiting time until the first occurrence, and the intersite distance between consecutive occurrences, of a structured motif. The interest in these waiting times is due to the biological challenge of identifying promoter motifs along genomes. A structured motif is composed of several patterns separated by a variable distance. If the number of patterns is n, then the structured motif is said to have n boxes. The formal definition of a structured motif with 2 boxes follows. Let w (1) and w (2) be two patterns of length k 1 and k 2, respectively. The alphabet size equals N, and the strings are generated by the Markov chain introduced in Section 16.2. A structured motif m formed by the patterns w (1) and w (2), and denoted by \(\mathbf{m} ={ \mathbf{w}}^{(1)}({d}_{1} : {d}_{2}){\mathbf{w}}^{(2)},\) is a string with the following property. Pattern w (1) is a prefix and pattern w (2) is a suffix of the string, and the number of letters between the two patterns is not smaller than d 1 and not greater than d 2. Also, it is assumed that patterns w (1) and w (2) appear only once in the string. The pgf’s of both the waiting time, τ m (s), to reach for the first time the structured motif m from state s, and the intersite distance, \({\tau }_{\mathbf{m}}^{(intersite)},\) between two consecutive occurrences of m, are of interest.

Let \(\mathbf{W} =\{{ \mathbf{w}}^{(1)},{\mathbf{w}}^{(2)}\}\) be a compound pattern consisting of two patterns. For brevity, denote by T i, j , i, j ∈ { 1, 2}, the waiting time to reach pattern w (j) from pattern w (i), and by T j (s) the waiting time to reach pattern w (j) from state s. The quantities r i, j and X i, j , i, j ∈ { 1, 2}, are introduced in Subsection 16.4.1. Let

$${a}_{i,j}(x) = P({X}_{i,j} = x).$$

In order to reach the structured motif m, we need to reach first the pattern w (1) and, from this occurrence of w (1), to reach the pattern w (2) such that \({d}_{1} + {k}_{2} \leq {X}_{1,2} \leq {d}_{2} + {k}_{2}\). Introduce the following random variables:

$$\begin{array}{rcl}{ F}_{12}& =& ({X}_{1,2}\:\vert \:{X}_{1,2} < {d}_{1} + {k}_{2}\;\;\mathrm{or}\;\;{X}_{1,2} > {d}_{2} + {k}_{2}), \\ {S}_{12}& =& ({X}_{1,2}\:\vert \:{d}_{1} + {k}_{2} \leq {X}_{1,2} \leq {d}_{2} + {k}_{2})\end{array}$$
(.)

F 12 corresponds to an occurrence of w (2) that fails to achieve the structured motif, whereas for S 12, w (2) achieves the structured motif. One may notice that the pgf’s of F 12 and S 12 are given by

$$\begin{array}{rcl}{ G}_{{F}_{12}}(t)& =& \left ({G}_{{X}_{12}}(t) - \sum \limits_{x={d}_{1}+{k}_{2}}^{{d}_{2}+{k}_{2} }{a}_{1,2}(x){t}^{x}\right ){\left (1 - {q}_{ S}\right )}^{-1} \\ {G}_{{S}_{12}}(t)& =& \left ( \sum \limits_{x={d}_{1}+{k}_{2}}^{{d}_{2}+{k}_{2} }{a}_{1,2}(x){t}^{x}\right ){q}_{ S}^{-1}, \\ \end{array}$$

where q S is the probability of ‘success’ (w (2) achieves the structured motif), i.e., the probability that \({d}_{1} + {k}_{2} \leq {X}_{1,2} \leq {d}_{2} + {k}_{2}\). Namely, we have

$${q}_{S} = \sum \limits_{x={d}_{1}+{k}_{2}}^{{d}_{2}+{k}_{2} }{a}_{1,2}(x).$$

The following theorem provides explicit and calculable expressions for the pgf’s of both the waiting time to reach for the first time the structured motif \(\mathbf{m} ={ \mathbf{w}}^{(1)}({d}_{1} : {d}_{2}){\mathbf{w}}^{(2)}\) from state s, and the intersite distance between two consecutive occurrences of m. 

Theorem 16.4.3.

The pgf, \({G}_{\mathbf{m}}^{(s)}(t),\) of the waiting time to reach for the first time a structured motif m starting from state s, and the pgf, \({G}_{\mathbf{m}}^{(intersite)}(t),\) of the intersite distance between two consecutive occurrences of m, admit the following explicit expressions:

$${G}_{\mathbf{m}}^{(s)}(t) = \frac{{r}_{1,2}\,{q}_{S}\,{G}_{{T}_{1}^{(s)}}(t)\,{G}_{{S}_{12}}(t)} {(1 - (1 - {r}_{1,2}){G}_{{X}_{1,1}}(t))\left (1 - (1 - {q}_{S})\left (\frac{{r}_{1,2}\,{G}_{{T}_{2,1}}(t)\,{G}_{{F}_{12}}(t)} {1-(1-{r}_{1,2}){G}_{{X}_{1,1}}(t)} \right )\right )}\;,$$
$${G}_{\mathbf{m}}^{(intersite)}(t) = \frac{{r}_{1,2}\,{q}_{S}\,{G}_{{T}_{2,1}}(t)\,{G}_{{S}_{12}}(t)} {\left (1 - (1 - {r}_{1,2}){G}_{{X}_{1,1}}(t)\right )\left (1 - (1 - {q}_{S})\left (\frac{{r}_{1,2}\,{G}_{{T}_{2,1}}(t)\,{G}_{{F}_{12}}(t)} {1-(1-{r}_{1,2}){G}_{{X}_{1,1}}(t)} \right )\right )}\;,$$

where \({G}_{{F}_{12}}(t),{G}_{{S}_{12}}(t),\) and q S are given above.

The proof of this theorem is found in Stefanov, Robin, and Schbath (2007). Note that, in view of this theorem, the availability of the pgf’s \({G}_{{X}_{i,j}}(t),\;i,j = 1,2,\) is enough to calculate explicit, closed-form expressions for \({G}_{\mathbf{m}}^{(s)}(t)\) and \({G}_{\mathbf{m}}^{(intersite)}(t).\) Explicit expressions for the \({G}_{{X}_{i,j}}(t),\) in terms of the \({G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t),\) are derived from the identities at the end of Subsection 16.4.1. Also, recall that the \({G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t)\) are calculated from Theorem 16.2.1 in Section 16.2.

Neat closed-form expressions for the relevant pgf’s associated with structured motifs with n boxes are found in Stefanov, Robin, and Schbath (2009).