Abstract
Patterns and motifs on finite alphabets are of interest in many applied areas, such as computational molecular biology, computer science, communication theory, and reliability theory. The exact distribution theory associated with occurrences of patterns (single or compound) and motifs, in random strings of letters, is treated in this chapter. The strings are generated by a Markov source, and for the case of single patterns, they are generated by general discrete-time or continuous-time models. Here, the interest is in finding closed-form expressions for the distributions of the following quantities: (i) the waiting time until the first occurrence of a pattern (motif), (ii) the intersite distances between consecutive occurrences of such, and (iii) the count of occurrences of a pattern, or more generally, the weighted count of occurrences of a compound pattern, both within a finite time horizon. General exact distribution results are discussed. Also, a brief guide on various methodological tools used in the area is provided in the Introduction.
Access provided by Autonomous University of Puebla. Download chapter PDF
Similar content being viewed by others
Keywords and phrases
16.1 Introduction
Patterns and motifs on finite alphabets are of interest in many applied areas, such as computational molecular biology, computer science, communication theory, and reliability theory. A word on an alphabet is called a single pattern, and a set of distinct single patterns (words) is called a compound pattern. The strings (texts) of letters can be generated either by independent and identically distributed multinomial trials, or by general discrete-time or continuous-time models (Markov chains or semi-Markov processes). The main interest, from a probabilistic/statistical point of view, is in finding practicable closed-form expressions for the distributions of the following quantities: the waiting time until the first occurrence of a pattern (single or compound) or motif, the intersite distances between consecutive occurrences of such, and the count(s) of occurrences of a pattern(s) or motif within a finite time horizon. Motifs are special cases of compound patterns which usually contain a huge number of distinct single patterns.
The theory on pattern occurrence attracted a variety of methodological tools. For example, the following methodologies have been widely used in the literature: combinatorial methods and classical probabilistic methods based on conditioning arguments, Markov chain embeddings, Markov renewal embeddings, exponential families, martingale techniques, and automata theory. The usefulness of these methodologies to the area is well illustrated in the sources which follow.
Runs are the simplest patterns. Feller (1950) showed how recurrent event theory can be used to solve problems about success runs. For a comprehensive account of the literature on runs see Balakrishnan and Koutras (2002). The key to handling complex patterns was provided by Conway’s leading numbers, which account for the overlapping structure of a pattern. Guibas and Odlyzko (1981) derived results applying elementary methods, and Chryssaphinou and Papastavridis (1990) extended them to more general models [see also Robin and Daudin (1999, 2001), Rukhin (2002, 2006), Han and Hirano (2003), and Inoue and Aki (2007)]. Li (1980) introduced martingale techniques to the area, and Gerber and Li (1981) combined the latter with a relevant Markov chain embedding. Martingale tools have also been used in Pozdnyakov et al. (2005), Glaz et al. (2006), and Pozdnyakov (2008).
Markov chain embeddings have been widely used in the area for treating problems on pattern occurrence; a few relevant sources are Fu (1996), Chadjiconstantinidis, Antzoulakos, and Koutras (2000), Antzoulakos (2001), Fu and Chang (2002), and Fu and Lou (2003). Blom and Thorburn (1982) made connections with Markov renewal theory, and this was systematically exploited by Biggins and Cannings (1987) and Biggins (1987). Stefanov and Pakes (1997) introduced exponential family methodology, combined with a minimal Markov chain embedding, and Stefanov (2000) extended it in combination with suitable Markov renewal embeddings to handle some special compound patterns (sets of runs).
Nicodème, Salvy, and Flajolet (2002) used automata theory comprehensively. Nuel (2008) combined automata theory with Markov chain embeddings and elaborated on a route which leads, for any given pattern(s), to a minimal embedding Markov chain. Reinert, Schbath, and Waterman (2000) provided a survey on some probabilistic tools used in the theory of patterns, and Szpankowski (2001) treated problems on pattern occurrence associated with average case analysis of string searching algorithms. The first exact distributional results on structured motifs are found in Stefanov, Robin, and Schbath (2007) [cf. also Robin et al. (2002), Nuel (2008), and Pozdnyakov (2008)].
In this chapter, results are discussed which provide explicit, closed-form solutions for the distributions of the aforementioned random quantities associated with the occurrence of patterns and structured motifs. These results are derived using predominantly simple probabilistic tools. Also, for a given alphabet, they require a preliminary (easy) evaluation of a few basic characteristics, and then each pattern case is covered in an automated way.
In Sections 16.2 and 16.3 we discuss single patterns. The strings are generated by discrete- or continuous-time semi-Markov processes. The exact distribution of the waiting time until the first occurrence of a pattern, given any (fixed) portion of it has been reached, is found. Also joint distributional results are discussed. The method relies on the knowledge of basic characteristics associated with the underlying model used to generate the strings. These basic characteristics are the probability generating functions (pgf’s) of the waiting times until another letter of the alphabet is reached. In other words, we need to know only the pgf’s of the waiting times until the simplest special patterns consisting of a single letter from the alphabet are first reached. These pgf’s can be evaluated using well-known analytical results if the underlying model is a discrete- or continuous-time finite-state semi-Markov process. In terms of these basic characteristics, simple recurrence relations are provided; these lead to exact evaluation of the relevant pgf’s for any pattern. The results on single patterns, as provided in Sections 16.2 and 16.3, lead to an easy solution for compound patterns, which consist of a small to moderate number of distinct single patterns. This is discussed in Subsection 16.4.1. The distribution of the count, and more generally the weighted count, of a compound pattern within a finite time horizon is discussed in Subsection 16.4.2. A neat explicit expression is derived for this distribution in terms of the aforementioned waiting time distributions. The result in Subsection 16.4.2 has not appeared in the literature before. Structured motifs are covered in Subsection 16.4.3. It is shown that results on compound patterns, consisting of only two single patterns, are enough to derive exact distribution results on structured motifs.
16.2 Patterns: Discrete-Time Models
In this section we explain how to derive a closed-form expression for the pgf of the waiting time to reach a pattern (word) starting from either a given letter or an already-achieved portion of the pattern. The strings of letters are generated by a finite-state discrete-time Markov chain whose state space and states are also called alphabet and letters, respectively.
Let \(\{X{(n)\}}_{n\geq 0}\) be an ergodic finite-state Markov chain with discrete-time parameter, state space \(\{1,2, \ldots ,N\},\) and one-step transition probabilities p i, j , \(i,j = 1,2, \ldots ,N.\) Denote by g i, j (t) the pgf of the waiting time, τ i, j , to reach state j from state i, that is \({g}_{i,j}(t) = E({t}^{{\tau }_{i,j}}),\) and
We assume τ i, i = 0, and therefore g i, i (t) = 1, for each i. The first return time to state i is denoted by \(\tilde{{\tau }}_{i,i},\) that is,
and its pgf is denoted by \tilde{g} i, i (t).
The pattern of interest is \({\mathbf{w}}_{k} = {w}_{1}{w}_{2} \ldots {w}_{k},\) where \(1 \leq {w}_{i} \leq N,\;i = 1,2, \ldots ,k.\) For j < k, the subpattern w j is also called a prefix of w k . For each \(j,\ j = 2,3, \ldots ,k - 1,\) and r < j, and each \(n,\ n = 1,2, \ldots ,N,\) denote by I r, j, n the indicator function which is equal to one if and only if none of the strings \({w}_{i}{w}_{i+1} \ldots {w}_{j}n\) for \(i = 2,3, \ldots ,r\) is a prefix of w k but \({w}_{r+1}{w}_{r+2} \ldots {w}_{j}n\) is. Also, the indicator function I j, j, n is equal to one if and only if none of the strings \({w}_{i}{w}_{i+1} \ldots {w}_{j}n\) for \(i = 2,3, \ldots ,j\) is a prefix of w k .
Denote by G j (s)(t) (\tilde{G} j (s)(t)), \(j = 1,2, \ldots ,k,\) the pgf of the waiting time to reach the pattern w j from state s, allowing (not allowing) the initial state s to contribute to the pattern. Also, denote by \({G}_{j}^{({\mathbf{w}}_{r})}(t),1 \leq r \leq j,\) the pgf of the waiting time to reach the pattern w j , given that the pattern w r has already been reached (note that \({G}_{j}^{({\mathbf{w}}_{j})}(t) = 1\)). The following theorem provides a simple route for evaluating these pgf’s knowing the pgf’s, g i, j (t), of the transition times between the states of the original Markov chain X(n). The expressions for the pgf’s g i, j (t) are easily recoverable from well-known analytical results [see Theorem 2.19 on page 81 of Kijima (1997)], for any given finite-state Markov chain with not too large a state space.
Theorem 16.2.1.
Let the pattern of interest be w k. The following recurrence relations hold for each \(j,\ j = 1,2, \ldots ,k - 1,\) and each \(r,\ r = 1,2, \ldots ,j\) (with the convention \(\sum \limits_{i=1}^{0} = 0\)) :
where
and the g i,j (t) and the indicator functions I i,j,n are as above.
The pgf of the intersite distance between consecutive occurrences of the pattern w k is given by \({G}_{k}^{({\mathbf{w}}_{j})}(t),\) where j is the largest integer such that w j is a proper prefix as well as a suffix of the pattern w k . Also, the pgf of the waiting time until the r-th occurrence of the pattern w k , given the initial state i, is equal to \({G}_{k}^{(i)}(t){\left ({G}_{k}^{({\mathbf{w}}_{j})}(t)\right )}^{r-1},\) where j has the same property as above.
The proof of Theorem 16.2.1 is based on the following simple idea. Let \({\tau }_{{\mathbf{w}}_{j}\vert \bar{{\mathbf{w}}}_{j+1}}\) be the waiting time for the first return (strictly positive) from pattern w j to itself given that the pattern w j + 1 is not achieved. Of course, the pattern w j + 1 is not achieved if the first state visited is not state w j + 1. Therefore, the pgf of \({\tau }_{{\mathbf{w}}_{j}\vert \bar{{\mathbf{w}}}_{j+1}}\) is equal to
Then, the waiting time to reach pattern w j + 1 starting from state s is equal to one plus a geometric sum of independent random variables, \({Y }_{1},{Y }_{2}, \ldots ,\) say, such that Y 1 has the distribution of the waiting time to reach subpattern w j from state s and the remaining Y n have the distribution of \({\tau }_{{\mathbf{w}}_{j}\vert \bar{{\mathbf{w}}}_{j+1}}.\) This implies that
A detailed proof of Theorem 16.2.1 is found in Stefanov (2003).__________
16.3 Patterns: General Discrete-Time and Continuous-Time Models
In this section, extensions of the result from the preceding section are presented. Finite-state semi-Markov processes, with either discrete- or continuous-time parameters, are the underlying models for generating the strings. Also, joint distributions of the waiting time to reach a pattern, together with the associated counts of occurrences of each letter, are of interest.
16.3.1 Waiting times
The notation from the preceding section is further used here for identifying the counterparts of similar quantities. For example, g i, j (t) will again denote the pgf of the waiting time to reach state j from state i in the more general discrete- or continuous-time model considered here.
Let \(\{X{(u)\}}_{u\geq 0}\) (the time parameter u may be either discrete or continuous) be a semi-Markov process whose associated embedded discrete-time Markov chain has a finite state space \(\{1,2, \ldots ,N\}\) and one-step transition probabilities p i, j , \(i,j = 1,2, \ldots ,N.\) For a formal definition of a semi-Markov process see çinlar (1975). Denote by ϕ i, j (t) the pgf of the holding (sojourn) time in state i, given that the next state to be visited is state j (if the holding time distributions are discrete, then the time parameter is discrete). We denote by g i, j (t) the pgf of the waiting time, τ i, j , to reach state j from state i; that is, \({g}_{i,j}(t) = E({t}^{{\tau }_{i,j}}),\) where
We assume τ i, i = 0, and therefore g i, i (t) = 1, for each i. The first return time to state i is denoted by \tilde{τ} i, i and its pgf by \tilde{g} i, i (t). Of course, if X(u) is a discrete-time Markov chain,
and if X(u) is a continuous-time Markov chain,
If X(u) is a general semi-Markov process, then \(\tilde{{\tau }}_{i,i}\) is understood to be the waiting time to reach state i from itself given that at least one transition has been made in the associated embedded discrete-time Markov chain. This clarifies the interpretation of \tilde{τ} i, i in case one-step transitions are allowed from a state to itself in the embedded discrete-time Markov chain.
Again, as in the preceding section, the pattern of interest is denoted by w k . Denote by G j (s)(t) (\tilde{G} j (s)(t)), \(j = 1,2, \ldots ,k,\) the pgf of the waiting time to reach the pattern w j from state s, allowing (not allowing) the initial state s to contribute to the pattern. Also denote by \({G}_{j}^{({\mathbf{w}}_{r})}(t),1 \leq r \leq j,\) the pgf of the waiting time to reach the pattern w j , given that the pattern w r has already been reached (note that \({G}_{j}^{({\mathbf{w}}_{j})}(t) = 1\)). The following theorem provides a simple route for evaluating these pgf’s in terms of the following characteristics of the original semi-Markov process X(u): the pgf’s, g i, j (t), of the transition times between the states, the pgf’s, ϕ i, j (t), of the holding time distributions, and the transition probabilities, p i, j , of the embedded discrete-time Markov chain.
Theorem 16.3.1.
Let the pattern of interest be w k. The following recurrence relations hold for each \(j,\ j = 1,2, \ldots ,k - 1,\) and each \(r,\ r = 1,2, \ldots ,j\) (with the convention \(\sum \limits_{i=1}^{0} = 0\)) :
where
The proof is based on the same idea as that used to prove Theorem 16.2.1. Similarly to the preceding section, denote by \({\tau }_{{\mathbf{w}}_{j}\vert \bar{{\mathbf{w}}}_{j+1}}\) the waiting time to reach w j from itself given that the pattern w j + 1 is not achieved. Then one may notice that the waiting time to reach pattern w j + 1 starting from state s is equal to the sum of two independent random variables, where the first has a pgf which equals \({\phi }_{{w}_{j},{w}_{j+1}}(t)\) and the second one is a geometric sum of independent random variables, \({Y }_{1},{Y }_{2}, \ldots ,\) say, such that Y 1 has the distribution of the waiting time to reach subpattern w j from state s and the remaining Y n have the distribution of \({\tau }_{{\mathbf{w}}_{j}\vert \bar{{\mathbf{w}}}_{j+1}}.\)
16.3.2 Joint generating functions associated with waiting times
In this subsection we consider the same general semi-Markov model X(u) that has been introduced in the preceding subsection. Recall that its embedded discrete-time Markov chain has N states. Throughout this subsection these states will be called ‘symbols’. Again the notation from the preceding subsections is further used in this subsection for identifying the counterparts of similar quantities (such as G j (s)( ⋅), etc.). Note that basic quantities of the underlying model, such as τ i, j and ϕ i, j , have the same meaning as that in the preceding subsection.
Let C i (u) be the count of occurrences of symbol i up to time u, and let \({g}_{i,j}(\underline{\mathbf{t}}),\) where \(\underline{\mathbf{t}} = ({t}_{0},{t}_{1}, \ldots ,{t}_{N}),\) be the joint pgf of \(({\tau }_{i,j},{C}_{1}({\tau }_{i,j}), \ldots ,{C}_{N}({\tau }_{i,j})),\) where the τ i, j have been introduced in the preceding subsection. Likewise, let \tilde{g} i, i (\underline{t}) be the joint pgf of \((\tilde{{\tau }}_{i,i},{C}_{1}(\tilde{{\tau }}_{i,i}), \ldots ,{C}_{N}(\tilde{{\tau }}_{i,i})),\) where again the \tilde{τ} i, i have been introduced in the preceding subsection. Note that g i, i (\underline{t}) = 1. Denote by ν j (s) the waiting time to reach the pattern w j from state s. Let G j (s)(\underline{t}), \((\tilde{{G}}_{j}^{(s)}(\underline{\mathbf{t}})),\) be the joint pgf of \({\nu }_{j}^{(s)},{C}_{1}({\nu }_{j}^{(s)}), \ldots ,{C}_{N}({\nu }_{j}^{(s)}),\) allowing (not allowing) the first symbol to contribute to the pattern. Further, let ν j (w r ) be the waiting time to reach the pattern w j from the already-reached prefix w r , and let G j (w r )(\underline{t}) be the joint pgf of \({\nu }_{j}^{({\mathbf{w}}_{r})},{C}_{1}({\nu }_{j}^{({\mathbf{w}}_{r})}), \ldots ,{C}_{N}({\nu }_{j}^{({\mathbf{w}}_{r})}).\) Note that the methodology introduced in Stefanov (2000; see Section 3) yields explicit expressions for the pgf’s g i, j (\underline{t}) associated with any given semi-Markov process, whose embedded discrete-time Markov chain has a relatively small number of states. Therefore, the recurrence relations in the following theorem provide a simple route for explicit evaluation of the joint pgf’s of the waiting time to reach, or the intersite distance between two consecutive occurrences of, a pattern and the associated counts of occurrences of the corresponding symbols (letters).
Theorem 16.3.2.
Let the pattern of interest be w k. The following recurrence relations hold for each \(j,\ j = 1,2, \ldots ,k - 1,\) and each \(r,\ r = 1,2, \ldots ,j\) :
where
The proof of this theorem is found in Stefanov (2003).
16.4 Compound Patterns
Throughout this section we assume that the strings are generated by discrete-time Markov chains.
16.4.1 Compound patterns containing a small numberof single patterns
Denote by W a compound pattern which consists of k distinct single patterns, \({\mathbf{w}}^{(1)},{\mathbf{w}}^{(2)}, \ldots ,{\mathbf{w}}^{(k)}.\) The latter may have different lengths, and it is assumed that none of them is a proper substring of any of the others. Let a be an arbitrary pattern; in particular, if a has length 1, that is, it is equal to a particular letter, s say, then we will denote a by s. Introduce the following quantities.
T a, W — the waiting time, starting from pattern a, to reach for the first time the compound pattern W; if a equals one of the w (i), then this waiting time is assumed to be greater than 0;
\({T}_{\mathbf{a},\mathbf{W}\vert {\mathbf{w}}^{(j)}}\) — the waiting time, starting from pattern a, to reach for the first time the compound pattern W, given that W is reached via w (j);
T a, b — the waiting time to reach pattern b starting from pattern a;
X i, j — the interarrival time between two consecutive occurrences of pattern W, given that the starting pattern is w (i) and the reached pattern is w (j);
r i, j — the probability that the first reached pattern from W is w (j), given that the starting pattern is w (i).
Of course, \({X}_{i,j} = {T}_{{\mathbf{w}}^{(i)},\mathbf{W}\vert {\mathbf{w}}^{(j)}}.\) Introduce the following pgf’s:
and recall that by G Y (t) we denote the pgf of a random variable Y. Clearly,
Also, it is easy to see that
Therefore, both the r i, j and the pgf’s \({G}_{{X}_{i,j}}(t)\) can be recovered from the pgf’s \({G}_{{\mathbf{w}}^{(i)},\mathbf{W},j}(t).\) The following theorem [see Chryssaphinou and Papastavridis (1990) and Gerber and Li (1981)] provides, for each pattern a, a system of linear equations from which one can recover the pgf’s G a, W, j (t) and \({G}_{{T}_{\mathbf{a},\mathbf{W}}}(t)\) in terms of the pgf’s \({G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t).\) The \({G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t)\) are derived from the results in Section 16.2.
Theorem 16.4.1.
The following identities hold:
In particular, we get the following explicit expressions for the \({G}_{{\mathbf{w}}^{(i)},\mathbf{W},j}(t)\) in terms of the \({G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t)\) if the compound pattern W = { w (1), w (2)} consists of two patterns. For brevity, \({G}_{{T}_{i,j}}\) below stands for \({G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t).\)
16.4.2 Weighted counts of compound patterns
A quantity of interest is the count of occurrences of a compound pattern, W say (as introduced in Subsection 16.4.1), within a finite time horizon. A more general quantity is the weighted count of pattern occurrences which attaches a weight, h i say, to each occurrence of a single pattern, w (i), from W. More specifically, introduce
where \({N}_{{\mathbf{w}}^{(i)}}(t)\) is the count of occurrences of pattern w (i) within a time interval of length t. Recall the meaning of the r i, j , X i, j , and \({T}_{s,\mathbf{W}\vert {\mathbf{w}}^{(j)}}\) which are introduced in Subsection 16.4.1. Of course, the occurrence of W can be modelled by a k-state semi-Markov process, where an entry to state i identifies an occurrence of pattern w (i). The one-step transition probabilities of the embedded discrete-time Markov chain of this semi-Markov process are the r i, j . The holding time at state i, given that the next state to be visited is state j, is identified by the random variable X i, j . For each initial letter, s say, we augment this semi-Markov process with one initial state, 0 say, and relevant one-step transition probabilities and holding times as follows (we denote the probability to move from state 0 to state j by r 0, j ):
and the holding time at state 0, given that the next state to be visited is state j, is identified by \({T}_{s,\mathbf{W}\vert {\mathbf{w}}^{(j)}},\) where the latter and G s, W, j (t) are introduced in Subsection 16.4.1. Now consider the semi-Markov processes, Y t say, derived from that above as follows. The state space has (k + 1)2 states, identified by the pairs \((i,j),\ i,j = 0,1, \ldots ,k.\) The process Y t enters state (i, j) if pattern w (i) is reached, given that the next occurrence of W is via pattern w (j). The initial states are the states (0, j) for \(j = 1,2, \ldots ,k,\) and the initial probabilities are the r 0, j . Clearly, the holding time distributions for this new semi-Markov process do not depend on the next state visited. Also, the holding time in state (i, j) is identified by the random variable X i, j , and that in state (0, j) by \({T}_{s,\mathbf{W}\vert {\mathbf{w}}^{(j)}}.\) Then the weighted count H W (t), introduced above, is equal to
where N (i, j)(t) counts the number of visits of Y t to state (i, j) within a time interval of length t. Denote by \({\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})}\) the first passage time of Y t from state (i 1, j 1) to state (i 2, j 2) and by \({L}_{{H}_{\mathbf{W}}}^{{\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})} }({s}_{1},{s}_{2})\) the joint Laplace transform of the random variables \({\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})}\) and \({H}_{\mathbf{W}}\left ({\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})}\right )\), that is,
Closed-form expressions for the \({L}_{{H}_{\mathbf{W}}}^{{\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})} }({s}_{1},{s}_{2})\) are derivable in terms of the r i, j and the Laplace transforms of the X i, j and the \({T}_{s,\mathbf{W}\vert {\mathbf{w}}^{(j)}}\), as explained in Stefanov (2006) for general reward functions on semi-Markov processes. Let
The following theorem follows from a general result on reward functions for semi-Markov processes [see Theorem 2.1 in Stefanov (2006)]. It provides an explicit, closed-form expression for the Laplace transform, \({L}_{t,{H}_{\mathbf{W}}}^{(s)}({s}_{1},{s}_{2}),\) of the weighted count of W occurrences within a time interval of length t, in terms of the r i, j , the Laplace transforms, \(\mathcal{L}[{X}_{i,j}](\cdot ),\) of the interarrival times X i, j of the compound pattern W, and the Laplace transforms, \(\mathcal{L}[{T}_{s,\mathbf{W}\vert {\mathbf{w}}^{(j)}}](\cdot ),\) of the waiting time to reach W from an initial letter s, for \(s = 1,2, \ldots ,N.\)
Theorem 16.4.2.
The following identity holds for the Laplace transform \({L}_{t,{H}_{\mathbf{W}}}^{(s)}\!\! :\)
where the joint Laplace transforms \({L}_{{H}_{\mathbf{W}}}^{{\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})} }({s}_{1},{s}_{2})\) have been introduced above.
16.4.3 Structured motifs
Structured motifs are special compound patterns, usually containing a huge number of single patterns. In this subsection we consider both the waiting time until the first occurrence, and the intersite distance between consecutive occurrences, of a structured motif. The interest in these waiting times is due to the biological challenge of identifying promoter motifs along genomes. A structured motif is composed of several patterns separated by a variable distance. If the number of patterns is n, then the structured motif is said to have n boxes. The formal definition of a structured motif with 2 boxes follows. Let w (1) and w (2) be two patterns of length k 1 and k 2, respectively. The alphabet size equals N, and the strings are generated by the Markov chain introduced in Section 16.2. A structured motif m formed by the patterns w (1) and w (2), and denoted by \(\mathbf{m} ={ \mathbf{w}}^{(1)}({d}_{1} : {d}_{2}){\mathbf{w}}^{(2)},\) is a string with the following property. Pattern w (1) is a prefix and pattern w (2) is a suffix of the string, and the number of letters between the two patterns is not smaller than d 1 and not greater than d 2. Also, it is assumed that patterns w (1) and w (2) appear only once in the string. The pgf’s of both the waiting time, τ m (s), to reach for the first time the structured motif m from state s, and the intersite distance, \({\tau }_{\mathbf{m}}^{(intersite)},\) between two consecutive occurrences of m, are of interest.
Let \(\mathbf{W} =\{{ \mathbf{w}}^{(1)},{\mathbf{w}}^{(2)}\}\) be a compound pattern consisting of two patterns. For brevity, denote by T i, j , i, j ∈ { 1, 2}, the waiting time to reach pattern w (j) from pattern w (i), and by T j (s) the waiting time to reach pattern w (j) from state s. The quantities r i, j and X i, j , i, j ∈ { 1, 2}, are introduced in Subsection 16.4.1. Let
In order to reach the structured motif m, we need to reach first the pattern w (1) and, from this occurrence of w (1), to reach the pattern w (2) such that \({d}_{1} + {k}_{2} \leq {X}_{1,2} \leq {d}_{2} + {k}_{2}\). Introduce the following random variables:
F 12 corresponds to an occurrence of w (2) that fails to achieve the structured motif, whereas for S 12, w (2) achieves the structured motif. One may notice that the pgf’s of F 12 and S 12 are given by
where q S is the probability of ‘success’ (w (2) achieves the structured motif), i.e., the probability that \({d}_{1} + {k}_{2} \leq {X}_{1,2} \leq {d}_{2} + {k}_{2}\). Namely, we have
The following theorem provides explicit and calculable expressions for the pgf’s of both the waiting time to reach for the first time the structured motif \(\mathbf{m} ={ \mathbf{w}}^{(1)}({d}_{1} : {d}_{2}){\mathbf{w}}^{(2)}\) from state s, and the intersite distance between two consecutive occurrences of m.
Theorem 16.4.3.
The pgf, \({G}_{\mathbf{m}}^{(s)}(t),\) of the waiting time to reach for the first time a structured motif m starting from state s, and the pgf, \({G}_{\mathbf{m}}^{(intersite)}(t),\) of the intersite distance between two consecutive occurrences of m, admit the following explicit expressions:
where \({G}_{{F}_{12}}(t),{G}_{{S}_{12}}(t),\) and q S are given above.
The proof of this theorem is found in Stefanov, Robin, and Schbath (2007). Note that, in view of this theorem, the availability of the pgf’s \({G}_{{X}_{i,j}}(t),\;i,j = 1,2,\) is enough to calculate explicit, closed-form expressions for \({G}_{\mathbf{m}}^{(s)}(t)\) and \({G}_{\mathbf{m}}^{(intersite)}(t).\) Explicit expressions for the \({G}_{{X}_{i,j}}(t),\) in terms of the \({G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t),\) are derived from the identities at the end of Subsection 16.4.1. Also, recall that the \({G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t)\) are calculated from Theorem 16.2.1 in Section 16.2.
Neat closed-form expressions for the relevant pgf’s associated with structured motifs with n boxes are found in Stefanov, Robin, and Schbath (2009).
References
Antzoulakos, D. L. (2001). Waiting times for patterns in a sequence of multistate trials. Journal of Applied Probability, 38, 508–518.
Balakrishnan, N. and Koutras, M. (2002). Runs and Scans with Applications. Wiley, New York.
Biggins, J. D. (1987). A note on repeated sequences in Markov chains. Advances in Applied Probability, 19, 739–742.
Biggins, J. D. and Cannings, C. (1987). Markov renewal processes, counters and repeated sequences in Markov chains. Advances in Applied Probability, 19, 521–545.
Blom, G. and Thorburn, D. (1982). How many random digits are required until given sequences are obtained? Journal of Applied Probability, 19, 518–531.
Chadjiconstantinidis, S., Antzoulakos, D. L. and Koutras, M. V. (2000). Joint distributions of successes, failures and patterns in enumeration problems. Advances in Applied Probability, 32, 866–884.
Chryssaphinou, O. and Papastavridis, S. (1990). The occurrence of a sequence of patterns in repeated dependent experiments. Theory of Probability and Its Applications, 35, 167–173.
çinlar, E. (1975). Introduction to Stochastic Processes. Prentice-Hall, Englewood Cliffs, NJ.
Feller, W. (1950). An Introduction to Probability Theory and Its Applications, Vol. 1. Wiley, New York.
Fu, J. C. (1996). Distribution theory of runs and patterns associated with a sequence of multistate trials. Statistica Sinica, 6, 957–974.
Fu, J. C. and Chang, Y. M. (2002). On probability generating functions for waiting time distributions of compound patterns in a sequence of multistate trials. Journal of Applied Probability, 39, 70–80.
Fu, J. C. and Lou, W. Y. W. (2003). Distribution Theory of Runs and Patterns and its Applications, World Scientific, Hackensack, NJ.
Gerber, H. and Li, S-Y. R. (1981). The occurrence of sequence patterns in repeated experiments and hitting times in a Markov chain. Stochastic Processes and Their Applications, 11, 101–108.
Glaz, J., Kulldorff, M., Pozdnyakov, V. and Steele, J. M. (2006). Gambling teams and waiting times for patterns in two-state Markov chains. Journal of Applied Probability, 43, 127–140.
Guibas, L. J. and Odlyzko, A. M. (1981). String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory. Series A, 30, 183–208.
Han, Q. and Hirano, K. (2003). Sooner and later waiting time problems for patterns in Markov dependent trials. Journal of Applied Probability, 40, 73–86.
Inoue, K. and Aki, S. (2007). On generating functions of waiting times and numbers of occurrences of compound patterns in a sequence of multistate trials. Journal of Applied Probability 44, 71–81.
Kijima, M. (1997). Markov Processes for Stochastic Modeling. Chapman & Hall, London.
Li, S-Y. R. (1980). A martingale approach to the study of occurrence of sequence patterns in repeated experiments. Annals of Probability, 8, 1171–1176.
Nicodème, P., Salvy, B. and Flajolet, P. (2002). Motif statistics. Theoretical Computer Science, 287, 593–617.
Nuel, G. (2008). Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata. Journal of Applied Probability, 45, 226–243.
Pozdnyakov, V. (2008). A note on occurrence of gapped patterns in i.i.d. sequences. Discrete Applied Mathematics 156, 93–102.
Pozdnyakov, V., Glaz, J., Kulldorff, M. and Steele, J. M. (2005). A martingale approach to scan statistics. Annals of the Institute of Statistical Mathematics, 57, 21–37.
Reinert, G., Schbath, S. and Waterman, M. (2000). Probabilistic and statistical properties of words: an overview. Journal of Computational Biology, 7, 1–46.
Robin, S. and Daudin, J. (1999). Exact distribution of word occurrences in a random sequence of letters. Journal of Applied Probability, 36, 179–193.
Robin, S. and Daudin, J. (2001). Exact distribution of the distances between any occurrences of a set of words. Annals of the Institute of Statistical Mathematics, 36, 895–905.
Robin, S., Daudin, J., Richard, H., Sagot, M.-F. and Schbath, S. (2002). Occurrence probability of structured motifs in random sequences. Journal of Computational Biology, 9, 761–773.
Rukhin, A. (2002). Distribution of the number of words with a prescribed frequency and tests of randomness. Advances in Applied Probability, 34, 775–797.
Rukhin, A. (2006). Correlation matrices of chains for Markov sequences, and testing for randomness. (Russian) Teoriya Veroyatnostei i ee Primeneniya, 51, 712–731.
Stefanov, V. T. (2000). On some waiting time problems. Journal of Applied Probability, 37, 756–764.
Stefanov, V. T. (2003). The intersite distances between pattern occurrences in strings generated by general discrete- and continuous-time models: an algorithmic approach. Journal of Applied Probability, 40, 881–892.
Stefanov, V. T. (2006). Exact distributions for reward functions on semi-Markov and Markov additive processes. Journal of Applied Probability, 43, 1053–1065.
Stefanov, V. T. and Pakes, A. G (1997). Explicit distributional results in pattern formation. Annals of Applied Probability, 7, 666–678.
Stefanov, V. T., Robin, S. and Schbath, S. (2007). Waiting times for clumps of patterns and for structured motifs in random sequences. Discrete Applied Mathematics, 155, 868–880.
Stefanov, V. T., Robin, S. and Schbath, S. (2008). Occurrence of structured motifs in random sequences: arbitrary number of boxes. (in preparation).
Szpankowski, W. (2001). Average Case Analysis of Algorithms on Sequences. John Wiley & Sons, New York.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Birkhäuser Boston, a part of Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Stefanov, V. (2009). Occurrence of Patterns and Motifs in Random Strings. In: Glaz, J., Pozdnyakov, V., Wallenstein, S. (eds) Scan Statistics. Statistics for Industry and Technology. Birkhäuser Boston. https://doi.org/10.1007/978-0-8176-4749-0_16
Download citation
DOI: https://doi.org/10.1007/978-0-8176-4749-0_16
Published:
Publisher Name: Birkhäuser Boston
Print ISBN: 978-0-8176-4748-3
Online ISBN: 978-0-8176-4749-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)