Occurrence of Patterns and Motifs in Random Strings

Stefanov, Valeri T.

doi:10.1007/978-0-8176-4749-0_16

Valeri T. Stefanov⁴

Part of the book series: Statistics for Industry and Technology ((SIT))

1875 Accesses

Abstract

Patterns and motifs on finite alphabets are of interest in many applied areas, such as computational molecular biology, computer science, communication theory, and reliability theory. The exact distribution theory associated with occurrences of patterns (single or compound) and motifs, in random strings of letters, is treated in this chapter. The strings are generated by a Markov source, and for the case of single patterns, they are generated by general discrete-time or continuous-time models. Here, the interest is in finding closed-form expressions for the distributions of the following quantities: (i) the waiting time until the first occurrence of a pattern (motif), (ii) the intersite distances between consecutive occurrences of such, and (iii) the count of occurrences of a pattern, or more generally, the weighted count of occurrences of a compound pattern, both within a finite time horizon. General exact distribution results are discussed. Also, a brief guide on various methodological tools used in the area is provided in the Introduction.

Access provided by Autonomous University of Puebla. Download chapter PDF

Distribution of Patterns of Constrained Length in Binary Sequences

Article Open access 21 November 2023

Number of appearances of events in random sequences: a new generating function approach to Type II and Type III runs

Article 30 December 2015

Markov Chains with Special Structures

Keywords and phrases

16.1 Introduction

Patterns and motifs on finite alphabets are of interest in many applied areas, such as computational molecular biology, computer science, communication theory, and reliability theory. A word on an alphabet is called a single pattern, and a set of distinct single patterns (words) is called a compound pattern. The strings (texts) of letters can be generated either by independent and identically distributed multinomial trials, or by general discrete-time or continuous-time models (Markov chains or semi-Markov processes). The main interest, from a probabilistic/statistical point of view, is in finding practicable closed-form expressions for the distributions of the following quantities: the waiting time until the first occurrence of a pattern (single or compound) or motif, the intersite distances between consecutive occurrences of such, and the count(s) of occurrences of a pattern(s) or motif within a finite time horizon. Motifs are special cases of compound patterns which usually contain a huge number of distinct single patterns.

The theory on pattern occurrence attracted a variety of methodological tools. For example, the following methodologies have been widely used in the literature: combinatorial methods and classical probabilistic methods based on conditioning arguments, Markov chain embeddings, Markov renewal embeddings, exponential families, martingale techniques, and automata theory. The usefulness of these methodologies to the area is well illustrated in the sources which follow.

Runs are the simplest patterns. Feller (1950) showed how recurrent event theory can be used to solve problems about success runs. For a comprehensive account of the literature on runs see Balakrishnan and Koutras (2002). The key to handling complex patterns was provided by Conway’s leading numbers, which account for the overlapping structure of a pattern. Guibas and Odlyzko (1981) derived results applying elementary methods, and Chryssaphinou and Papastavridis (1990) extended them to more general models [see also Robin and Daudin (1999, 2001), Rukhin (2002, 2006), Han and Hirano (2003), and Inoue and Aki (2007)]. Li (1980) introduced martingale techniques to the area, and Gerber and Li (1981) combined the latter with a relevant Markov chain embedding. Martingale tools have also been used in Pozdnyakov et al. (2005), Glaz et al. (2006), and Pozdnyakov (2008).

Markov chain embeddings have been widely used in the area for treating problems on pattern occurrence; a few relevant sources are Fu (1996), Chadjiconstantinidis, Antzoulakos, and Koutras (2000), Antzoulakos (2001), Fu and Chang (2002), and Fu and Lou (2003). Blom and Thorburn (1982) made connections with Markov renewal theory, and this was systematically exploited by Biggins and Cannings (1987) and Biggins (1987). Stefanov and Pakes (1997) introduced exponential family methodology, combined with a minimal Markov chain embedding, and Stefanov (2000) extended it in combination with suitable Markov renewal embeddings to handle some special compound patterns (sets of runs).

Nicodème, Salvy, and Flajolet (2002) used automata theory comprehensively. Nuel (2008) combined automata theory with Markov chain embeddings and elaborated on a route which leads, for any given pattern(s), to a minimal embedding Markov chain. Reinert, Schbath, and Waterman (2000) provided a survey on some probabilistic tools used in the theory of patterns, and Szpankowski (2001) treated problems on pattern occurrence associated with average case analysis of string searching algorithms. The first exact distributional results on structured motifs are found in Stefanov, Robin, and Schbath (2007) [cf. also Robin et al. (2002), Nuel (2008), and Pozdnyakov (2008)].

In this chapter, results are discussed which provide explicit, closed-form solutions for the distributions of the aforementioned random quantities associated with the occurrence of patterns and structured motifs. These results are derived using predominantly simple probabilistic tools. Also, for a given alphabet, they require a preliminary (easy) evaluation of a few basic characteristics, and then each pattern case is covered in an automated way.

In Sections 16.2 and 16.3 we discuss single patterns. The strings are generated by discrete- or continuous-time semi-Markov processes. The exact distribution of the waiting time until the first occurrence of a pattern, given any (fixed) portion of it has been reached, is found. Also joint distributional results are discussed. The method relies on the knowledge of basic characteristics associated with the underlying model used to generate the strings. These basic characteristics are the probability generating functions (pgf’s) of the waiting times until another letter of the alphabet is reached. In other words, we need to know only the pgf’s of the waiting times until the simplest special patterns consisting of a single letter from the alphabet are first reached. These pgf’s can be evaluated using well-known analytical results if the underlying model is a discrete- or continuous-time finite-state semi-Markov process. In terms of these basic characteristics, simple recurrence relations are provided; these lead to exact evaluation of the relevant pgf’s for any pattern. The results on single patterns, as provided in Sections 16.2 and 16.3, lead to an easy solution for compound patterns, which consist of a small to moderate number of distinct single patterns. This is discussed in Subsection 16.4.1. The distribution of the count, and more generally the weighted count, of a compound pattern within a finite time horizon is discussed in Subsection 16.4.2. A neat explicit expression is derived for this distribution in terms of the aforementioned waiting time distributions. The result in Subsection 16.4.2 has not appeared in the literature before. Structured motifs are covered in Subsection 16.4.3. It is shown that results on compound patterns, consisting of only two single patterns, are enough to derive exact distribution results on structured motifs.

16.2 Patterns: Discrete-Time Models

In this section we explain how to derive a closed-form expression for the pgf of the waiting time to reach a pattern (word) starting from either a given letter or an already-achieved portion of the pattern. The strings of letters are generated by a finite-state discrete-time Markov chain whose state space and states are also called alphabet and letters, respectively.

Let $\{X{(n)\}}_{n\geq 0}$ be an ergodic finite-state Markov chain with discrete-time parameter, state space $\{1,2, \ldots ,N\},$ and one-step transition probabilities p _i, j, $i,j = 1,2, \ldots ,N.$ Denote by g _i, j(t) the pgf of the waiting time, τ_i, j, to reach state j from state i, that is ${g}_{i,j}(t) = E({t}^{{\tau }_{i,j}}),$ and

$${\tau }_{i,j} =\inf \{ n : X(n) = j\vert X(0) = i\}.$$

We assume τ_i, i = 0, and therefore g _i, i(t) = 1, for each i. The first return time to state i is denoted by $\tilde{{\tau }}_{i,i},$ that is,

$$\tilde{{\tau }}_{i,i} =\inf \{ n > 0 : X(n) = i\vert X(0) = i\},$$

and its pgf is denoted by \tilde{g}_i, i(t).

The pattern of interest is ${\mathbf{w}}_{k} = {w}_{1}{w}_{2} \ldots {w}_{k},$ where $1 \leq {w}_{i} \leq N,\;i = 1,2, \ldots ,k.$ For j < k, the subpattern w _j is also called a prefix of w _k. For each $j,\ j = 2,3, \ldots ,k - 1,$ and r < j, and each $n,\ n = 1,2, \ldots ,N,$ denote by I _r, j, n the indicator function which is equal to one if and only if none of the strings ${w}_{i}{w}_{i+1} \ldots {w}_{j}n$ for $i = 2,3, \ldots ,r$ is a prefix of w _k but ${w}_{r+1}{w}_{r+2} \ldots {w}_{j}n$ is. Also, the indicator function I _j, j, n is equal to one if and only if none of the strings ${w}_{i}{w}_{i+1} \ldots {w}_{j}n$ for $i = 2,3, \ldots ,j$ is a prefix of w _k.

Denote by G _j ^(s)(t) (\tilde{G}_j ^(s)(t)), $j = 1,2, \ldots ,k,$ the pgf of the waiting time to reach the pattern w _j from state s, allowing (not allowing) the initial state s to contribute to the pattern. Also, denote by ${G}_{j}^{({\mathbf{w}}_{r})}(t),1 \leq r \leq j,$ the pgf of the waiting time to reach the pattern w _j, given that the pattern w _r has already been reached (note that ${G}_{j}^{({\mathbf{w}}_{j})}(t) = 1$). The following theorem provides a simple route for evaluating these pgf’s knowing the pgf’s, g _i, j(t), of the transition times between the states of the original Markov chain X(n). The expressions for the pgf’s g _i, j(t) are easily recoverable from well-known analytical results [see Theorem 2.19 on page 81 of Kijima (1997)], for any given finite-state Markov chain with not too large a state space.

Theorem 16.2.1.

Let the pattern of interest be w _k. The following recurrence relations hold for each $j,\ j = 1,2, \ldots ,k - 1,$ and each $r,\ r = 1,2, \ldots ,j$ (with the convention $\sum \limits_{i=1}^{0} = 0$) :

$$\begin{array}{rcl} \tilde{{G}}_{j+1}^{(s)}(t) = \frac{{p}_{{w}_{j},{w}_{j+1}}t\tilde{{G}}_{j}^{(s)}(t)} {1 - \sum \limits_{n=1,\;n\mathrel{\not =}{w}_{j+1}}^{N}{p}_{{ w}_{j},n}t\left ( \sum \limits_{i=1}^{j-1}{I}_{ i,j,n}{G}_{j}^{({\mathbf{w}}_{j-i+1})}(t) + {I}_{ j,j,n}{G}_{j}^{(n)}(t)\right )}\;,& & \\ {G}_{j+1}^{({\mathbf{w}}_{r})}(t) = \frac{{p}_{{w}_{j},{w}_{j+1}}t{G}_{j}^{({\mathbf{w}}_{r})}(t)} {1 - \sum \limits_{n=1,\;n\mathrel{\not =}{w}_{j+1}}^{N}{p}_{{ w}_{j},n}t\left ( \sum \limits_{i=1}^{j-1}{I}_{ i,j,n}{G}_{j}^{({\mathbf{w}}_{j-i+1})}(t) + {I}_{ j,j,n}{G}_{j}^{(n)}(t)\right )}\;,& & \\ \end{array}$$

where

$$\begin{array}{rcl} \tilde{{G}}_{j+1}^{(s)}(t)& =& {G}_{ j+1}^{(s)}(t),\;\;\;\;\;\;\;\mathrm{if}\;\;s\mathrel{\not =}{w}_{ 1}, \\ \tilde{{G}}_{j+1}^{({w}_{1})}(t)& =& \tilde{{g}}_{{ w}_{1},{w}_{1}}(t){G}_{j+1}^{({w}_{1})}(t), \\ {G}_{1}^{(s)}(t)& =& {g}_{ s,{w}_{1}}(t), \\ \tilde{{G}}_{1}^{(s)}(t)& =& \tilde{{g}}_{{ w}_{1},{w}_{1}}(t) = \sum \limits_{n=1}^{N}{p}_{{ w}_{1},n}t{g}_{n,{w}_{1}}(t), \\ \end{array}$$

and the g _i,j (t) and the indicator functions I _i,j,n are as above.

The pgf of the intersite distance between consecutive occurrences of the pattern w _k is given by ${G}_{k}^{({\mathbf{w}}_{j})}(t),$ where j is the largest integer such that w _j is a proper prefix as well as a suffix of the pattern w _k. Also, the pgf of the waiting time until the r-th occurrence of the pattern w _k, given the initial state i, is equal to ${G}_{k}^{(i)}(t){\left ({G}_{k}^{({\mathbf{w}}_{j})}(t)\right )}^{r-1},$ where j has the same property as above.

The proof of Theorem 16.2.1 is based on the following simple idea. Let ${\tau }_{{\mathbf{w}}_{j}\vert \bar{{\mathbf{w}}}_{j+1}}$ be the waiting time for the first return (strictly positive) from pattern w _j to itself given that the pattern w _j + 1 is not achieved. Of course, the pattern w _j + 1 is not achieved if the first state visited is not state w _j + 1. Therefore, the pgf of ${\tau }_{{\mathbf{w}}_{j}\vert \bar{{\mathbf{w}}}_{j+1}}$ is equal to

$${g}_{{\tau }_{{\mathbf{w}}_{ j}\vert \bar{{\mathbf{w}}}_{j+1}}}(t) = \sum \limits_{n=1,\;n\mathrel{\not =}{w}_{j+1}}^{N} \frac{{p}_{{w}_{j},n}t} {1 - {p}_{{w}_{j},{w}_{j+1}}} \left ( \sum \limits_{i=1}^{j-1}{I}_{ i,j,n}{G}_{j}^{({\mathbf{w}}_{j-i+1})}(t) + {I}_{ j,j,n}{G}_{j}^{(n)}(t)\right ).$$

Then, the waiting time to reach pattern w _j + 1 starting from state s is equal to one plus a geometric sum of independent random variables, ${Y }_{1},{Y }_{2}, \ldots ,$ say, such that Y ₁ has the distribution of the waiting time to reach subpattern w _j from state s and the remaining Y _n have the distribution of ${\tau }_{{\mathbf{w}}_{j}\vert \bar{{\mathbf{w}}}_{j+1}}.$ This implies that

$$\tilde{{G}}_{j+1}^{(s)}(t) = \frac{{p}_{{w}_{j},{w}_{j+1}}t\tilde{{G}}_{j}^{(s)}(t)} {1 - \sum \limits_{n=1,\;n\mathrel{\not =}{w}_{j+1}}^{N}{p}_{{ w}_{j},n}t\left ( \sum \limits_{i=1}^{j-1}{I}_{ i,j,n}{G}_{j}^{({\mathbf{w}}_{j-i+1})}(t) + {I}_{ j,j,n}{G}_{j}^{(n)}(t)\right )}\;.$$

A detailed proof of Theorem 16.2.1 is found in Stefanov (2003).__________

16.3 Patterns: General Discrete-Time and Continuous-Time Models

In this section, extensions of the result from the preceding section are presented. Finite-state semi-Markov processes, with either discrete- or continuous-time parameters, are the underlying models for generating the strings. Also, joint distributions of the waiting time to reach a pattern, together with the associated counts of occurrences of each letter, are of interest.

16.3.1 Waiting times

The notation from the preceding section is further used here for identifying the counterparts of similar quantities. For example, g _i, j(t) will again denote the pgf of the waiting time to reach state j from state i in the more general discrete- or continuous-time model considered here.

Let $\{X{(u)\}}_{u\geq 0}$ (the time parameter u may be either discrete or continuous) be a semi-Markov process whose associated embedded discrete-time Markov chain has a finite state space $\{1,2, \ldots ,N\}$ and one-step transition probabilities p _i, j, $i,j = 1,2, \ldots ,N.$ For a formal definition of a semi-Markov process see çinlar (1975). Denote by ϕ_i, j(t) the pgf of the holding (sojourn) time in state i, given that the next state to be visited is state j (if the holding time distributions are discrete, then the time parameter is discrete). We denote by g _i, j(t) the pgf of the waiting time, τ_i, j, to reach state j from state i; that is, ${g}_{i,j}(t) = E({t}^{{\tau }_{i,j}}),$ where

$${\tau }_{i,j} =\inf \{ u : X(u) = j\vert X(0) = i\}.$$

We assume τ_i, i = 0, and therefore g _i, i(t) = 1, for each i. The first return time to state i is denoted by \tilde{τ}_i, i and its pgf by \tilde{g}_i, i(t). Of course, if X(u) is a discrete-time Markov chain,

$$\tilde{{\tau }}_{i,i} =\inf \{ u > 0 : X(u) = i\vert X(0) = i\},$$

and if X(u) is a continuous-time Markov chain,

$$\tilde{{\tau }}_{i,i} =\inf \{ u > 0 : X(u) = i,X(u-)\mathrel{\not =}i\vert X(0) = i\}.$$

If X(u) is a general semi-Markov process, then $\tilde{{\tau }}_{i,i}$ is understood to be the waiting time to reach state i from itself given that at least one transition has been made in the associated embedded discrete-time Markov chain. This clarifies the interpretation of \tilde{τ}_i, i in case one-step transitions are allowed from a state to itself in the embedded discrete-time Markov chain.

Again, as in the preceding section, the pattern of interest is denoted by w _k. Denote by G _j ^(s)(t) (\tilde{G}_j ^(s)(t)), $j = 1,2, \ldots ,k,$ the pgf of the waiting time to reach the pattern w _j from state s, allowing (not allowing) the initial state s to contribute to the pattern. Also denote by ${G}_{j}^{({\mathbf{w}}_{r})}(t),1 \leq r \leq j,$ the pgf of the waiting time to reach the pattern w _j, given that the pattern w _r has already been reached (note that ${G}_{j}^{({\mathbf{w}}_{j})}(t) = 1$). The following theorem provides a simple route for evaluating these pgf’s in terms of the following characteristics of the original semi-Markov process X(u): the pgf’s, g _i, j(t), of the transition times between the states, the pgf’s, ϕ_i, j(t), of the holding time distributions, and the transition probabilities, p _i, j, of the embedded discrete-time Markov chain.

Theorem 16.3.1.

Let the pattern of interest be w _k. The following recurrence relations hold for each $j,\ j = 1,2, \ldots ,k - 1,$ and each $r,\ r = 1,2, \ldots ,j$ (with the convention $\sum \limits_{i=1}^{0} = 0$) :

$$\begin{array}{rcl} & \tilde{{G}}_{j+1}^{(s)}(t) = \frac{{p}_{{w}_{j},{w}_{j+1}}{\phi }_{{w}_{j},{w}_{j+1}}(t)\tilde{{G}}_{j}^{(s)}(t)} {1- \sum \limits_{\begin{array}{c}n=1, \\ n\mathrel{\not =}{w}_{j+1}\end{array}}^{N}{p}_{{ w}_{j},n}{\phi }_{{w}_{j},n}(t)\left ( \sum \limits_{i=1}^{j-1}{I}_{ i,j,n}{G}_{j}^{({\mathbf{w}}_{j-i+1})}(t) + {I}_{ j,j,n}{G}_{j}^{(n)}(t)\right )}\;, & \\ & {G}_{j+1}^{({\mathbf{w}}_{r})}(t) = \frac{{p}_{{w}_{j},{w}_{j+1}}{\phi }_{{w}_{j},{w}_{j+1}}(t){G}_{j}^{({\mathbf{w}}_{r})}(t)} {1- \sum \limits_{\begin{array}{c}n=1, \\ n\mathrel{\not =}{w}_{j+1}\end{array}}^{N}{p}_{{ w}_{j},n}{\phi }_{{w}_{j},n}(t)\left ( \sum \limits_{i=1}^{j-1}{I}_{ i,j,n}{G}_{j}^{({\mathbf{w}}_{j-i+1})}(t) + {I}_{ j,j,n}{G}_{j}^{(n)}(t)\right )}\;,& \\ \end{array}$$

where

$$\begin{array}{rcl} \tilde{{G}}_{j+1}^{(s)}(t)& =& {G}_{ j+1}^{(s)}(t),\;\;\;\;\;\;\;\;\mathrm{if}\;\;s\mathrel{\not =}{w}_{ 1}, \\ \tilde{{G}}_{j+1}^{({w}_{1})}(t)& =& \tilde{{g}}_{{ w}_{1},{w}_{1}}(t){G}_{j+1}^{({w}_{1})}(t), \\ {G}_{1}^{(s)}(t)& =& {g}_{ s,{w}_{1}}(t), \\ \tilde{{G}}_{1}^{({w}_{1})}(t)& =& \tilde{{g}}_{{ w}_{1},{w}_{1}}(t) = \sum \limits_{n=1}^{N}{p}_{{ w}_{1},n}{\phi }_{{w}_{1},n}(t){g}_{n,{w}_{1}}(t)\end{array}$$

(.)

The proof is based on the same idea as that used to prove Theorem 16.2.1. Similarly to the preceding section, denote by ${\tau }_{{\mathbf{w}}_{j}\vert \bar{{\mathbf{w}}}_{j+1}}$ the waiting time to reach w _j from itself given that the pattern w _j + 1 is not achieved. Then one may notice that the waiting time to reach pattern w _j + 1 starting from state s is equal to the sum of two independent random variables, where the first has a pgf which equals ${\phi }_{{w}_{j},{w}_{j+1}}(t)$ and the second one is a geometric sum of independent random variables, ${Y }_{1},{Y }_{2}, \ldots ,$ say, such that Y ₁ has the distribution of the waiting time to reach subpattern w _j from state s and the remaining Y _n have the distribution of ${\tau }_{{\mathbf{w}}_{j}\vert \bar{{\mathbf{w}}}_{j+1}}.$

16.3.2 Joint generating functions associated with waiting times

In this subsection we consider the same general semi-Markov model X(u) that has been introduced in the preceding subsection. Recall that its embedded discrete-time Markov chain has N states. Throughout this subsection these states will be called ‘symbols’. Again the notation from the preceding subsections is further used in this subsection for identifying the counterparts of similar quantities (such as G _j ^(s)( ⋅), etc.). Note that basic quantities of the underlying model, such as τ_i, j and ϕ_i, j, have the same meaning as that in the preceding subsection.

Let C _i(u) be the count of occurrences of symbol i up to time u, and let ${g}_{i,j}(\underline{\mathbf{t}}),$ where $\underline{\mathbf{t}} = ({t}_{0},{t}_{1}, \ldots ,{t}_{N}),$ be the joint pgf of $({\tau }_{i,j},{C}_{1}({\tau }_{i,j}), \ldots ,{C}_{N}({\tau }_{i,j})),$ where the τ_i, j have been introduced in the preceding subsection. Likewise, let \tilde{g}_i, i(\underline{t}) be the joint pgf of $(\tilde{{\tau }}_{i,i},{C}_{1}(\tilde{{\tau }}_{i,i}), \ldots ,{C}_{N}(\tilde{{\tau }}_{i,i})),$ where again the \tilde{τ}_i, i have been introduced in the preceding subsection. Note that g _i, i(\underline{t}) = 1. Denote by ν_j ^(s) the waiting time to reach the pattern w _j from state s. Let G _j ^(s)(\underline{t}), $(\tilde{{G}}_{j}^{(s)}(\underline{\mathbf{t}})),$ be the joint pgf of ${\nu }_{j}^{(s)},{C}_{1}({\nu }_{j}^{(s)}), \ldots ,{C}_{N}({\nu }_{j}^{(s)}),$ allowing (not allowing) the first symbol to contribute to the pattern. Further, let ν_j (w _r) be the waiting time to reach the pattern w _j from the already-reached prefix w _r, and let G _j (w _r)(\underline{t}) be the joint pgf of ${\nu }_{j}^{({\mathbf{w}}_{r})},{C}_{1}({\nu }_{j}^{({\mathbf{w}}_{r})}), \ldots ,{C}_{N}({\nu }_{j}^{({\mathbf{w}}_{r})}).$ Note that the methodology introduced in Stefanov (2000; see Section 3) yields explicit expressions for the pgf’s g _i, j(\underline{t}) associated with any given semi-Markov process, whose embedded discrete-time Markov chain has a relatively small number of states. Therefore, the recurrence relations in the following theorem provide a simple route for explicit evaluation of the joint pgf’s of the waiting time to reach, or the intersite distance between two consecutive occurrences of, a pattern and the associated counts of occurrences of the corresponding symbols (letters).

Theorem 16.3.2.

Let the pattern of interest be w _k. The following recurrence relations hold for each $j,\ j = 1,2, \ldots ,k - 1,$ and each $r,\ r = 1,2, \ldots ,j$ :

$$\begin{array}{rcl} & \tilde{{G}}_{j+1}^{(s)}(\underline{\mathbf{t}}) = \frac{{p}_{{w}_{j},{w}_{j+1}}{t}_{{w}_{j+1}}{\phi }_{{w}_{j},{w}_{j+1}}({t}_{0})\tilde{{G}}_{j}^{(s)}(\underline{\mathbf{t}})} {1- \sum \limits_{\begin{array}{c}n=1, \\ n\mathrel{\not =}{w}_{j+1}\end{array}}^{N}{p}_{{ w}_{j},n}{t}_{n}{\phi }_{{w}_{j},n}({t}_{0})\left ( \sum \limits_{i=1}^{j-1}{I}_{ i,j,n}{G}_{j}^{({\mathbf{w}}_{j-i+1})}(\underline{\mathbf{t}}) + {I}_{ j,j,n}{G}_{j}^{(n)}(\underline{\mathbf{t}})\right )}, & \\ & {G}_{j+1}^{({\mathbf{w}}_{r})}(\underline{\mathbf{t}}) = \frac{{p}_{{w}_{j},{w}_{j+1}}{t}_{{w}_{j+1}}{\phi }_{{w}_{j},{w}_{j+1}}({t}_{0}){G}_{j}^{({\mathbf{w}}_{r})}(\underline{\mathbf{t}})} {1- \sum \limits_{\begin{array}{c}n=1, \\ n\mathrel{\not =}{w}_{j+1}\end{array}}^{N}{p}_{{ w}_{j},n}{t}_{n}{\phi }_{{w}_{j},n}({t}_{0})\left ( \sum \limits_{i=1}^{j-1}{I}_{ i,j,n}{G}_{j}^{({\mathbf{w}}_{j-i+1})}(\underline{\mathbf{t}}) + {I}_{ j,j,n}{G}_{j}^{(n)}(\underline{\mathbf{t}})\right )},& \\ \end{array}$$

where

$$\begin{array}{rcl} \tilde{{G}}_{j+1}^{(s)}(\underline{\mathbf{t}})& =& {G}_{ j+1}^{(s)}(\underline{\mathbf{t}}),\;\;\;\;\;\;\;\mathrm{if}\;\;s\mathrel{\not =}{w}_{ 1}, \\ \tilde{{G}}_{j+1}^{({w}_{1})}(\underline{\mathbf{t}})& =& \tilde{{g}}_{{ w}_{1},{w}_{1}}(\underline{\mathbf{t}}){G}_{j+1}^{({w}_{1})}(\underline{\mathbf{t}}), \\ {G}_{1}^{(s)}(\underline{\mathbf{t}})& =& {g}_{ s,{w}_{1}}(\underline{\mathbf{t}}), \\ \tilde{{G}}_{1}^{({w}_{1})}(\underline{\mathbf{t}})& =& \tilde{{g}}_{{ w}_{1},{w}_{1}}(\underline{\mathbf{t}}) = \sum \limits_{n=1}^{N}{p}_{{ w}_{1},n}{t}_{n}{\phi }_{{w}_{1},n}({t}_{0}){g}_{n,{w}_{1}}(\underline{\mathbf{t}})\end{array}$$

(.)

The proof of this theorem is found in Stefanov (2003).

16.4 Compound Patterns

Throughout this section we assume that the strings are generated by discrete-time Markov chains.

16.4.1 Compound patterns containing a small numberof single patterns

Denote by W a compound pattern which consists of k distinct single patterns, ${\mathbf{w}}^{(1)},{\mathbf{w}}^{(2)}, \ldots ,{\mathbf{w}}^{(k)}.$ The latter may have different lengths, and it is assumed that none of them is a proper substring of any of the others. Let a be an arbitrary pattern; in particular, if a has length 1, that is, it is equal to a particular letter, s say, then we will denote a by s. Introduce the following quantities.

T _a, W — the waiting time, starting from pattern a, to reach for the first time the compound pattern W; if a equals one of the w ⁽ⁱ⁾, then this waiting time is assumed to be greater than 0;

${T}_{\mathbf{a},\mathbf{W}\vert {\mathbf{w}}^{(j)}}$ — the waiting time, starting from pattern a, to reach for the first time the compound pattern W, given that W is reached via w ^(j);

T _a, b — the waiting time to reach pattern b starting from pattern a;

X _i, j — the interarrival time between two consecutive occurrences of pattern W, given that the starting pattern is w ⁽ⁱ⁾ and the reached pattern is w ^(j);

r _i, j — the probability that the first reached pattern from W is w ^(j), given that the starting pattern is w ⁽ⁱ⁾.

Of course, ${X}_{i,j} = {T}_{{\mathbf{w}}^{(i)},\mathbf{W}\vert {\mathbf{w}}^{(j)}}.$ Introduce the following pgf’s:

$${G}_{\mathbf{a},\mathbf{W},j}(t) = \sum \limits_{n=1}^{\infty }P\left ({T}_{\mathbf{ a},\mathbf{W}} = {T}_{\mathbf{a},\mathbf{W}\vert {\mathbf{w}}^{(j)}} = n\right ){t}^{n},\;\;\;\;\;j = 1,2, \ldots ,k,$$

and recall that by G _Y(t) we denote the pgf of a random variable Y. Clearly,

$${r}_{i,j} = P\left ({T}_{{\mathbf{w}}^{(i)},\mathbf{W}} = {T}_{{\mathbf{w}}^{(i)},\mathbf{W}\vert {\mathbf{w}}^{(j)}}\right ) = {G}_{{\mathbf{w}}^{(i)},\mathbf{W},j}(1).$$

Also, it is easy to see that

$${G}_{{X}_{i,j}}(t) = \frac{{G}_{{\mathbf{w}}^{(i)},\mathbf{W},j}(t)} {{r}_{i,j}} \;.$$

Therefore, both the r _i, j and the pgf’s ${G}_{{X}_{i,j}}(t)$ can be recovered from the pgf’s ${G}_{{\mathbf{w}}^{(i)},\mathbf{W},j}(t).$ The following theorem [see Chryssaphinou and Papastavridis (1990) and Gerber and Li (1981)] provides, for each pattern a, a system of linear equations from which one can recover the pgf’s G _a, W, j(t) and ${G}_{{T}_{\mathbf{a},\mathbf{W}}}(t)$ in terms of the pgf’s ${G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t).$ The ${G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t)$ are derived from the results in Section 16.2.

Theorem 16.4.1.

The following identities hold:

$$\begin{array}{rcl} {G}_{{T}_{\mathbf{a},\mathbf{W}}}(t)& =& \sum \limits_{j=1}^{k}{G}_{\mathbf{ a},\mathbf{W},j}(t), \\ {G}_{{T}_{\mathbf{ a},{\mathbf{w}}^{(i)}}}(t)& =& \sum \limits_{j=1}^{k}{G}_{{ T}_{{\mathbf{w}}^{(i)},{\mathbf{w}}^{(j)}}}(t){G}_{\mathbf{a},\mathbf{W},j}(t),\;\;\;\;\;i = 1,2, \ldots ,k\end{array}$$

(.)

In particular, we get the following explicit expressions for the ${G}_{{\mathbf{w}}^{(i)},\mathbf{W},j}(t)$ in terms of the ${G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t)$ if the compound pattern W = { w ⁽¹⁾, w ⁽²⁾} consists of two patterns. For brevity, ${G}_{{T}_{i,j}}$ below stands for ${G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t).$

$$\begin{array}{rcl}{ G}_{{\mathbf{w}}^{(1)},\mathbf{W},1}(t)& =& \frac{{G}_{{T}_{1,1}}{G}_{{T}_{2,2}} - {G}_{{T}_{1,2}}^{2}} {{G}_{{T}_{1,1}}{G}_{{T}_{2,2}} - {G}_{{T}_{1,2}}{G}_{{T}_{2,1}}} \;, \\ {G}_{{\mathbf{w}}^{(1)},\mathbf{W},2}(t)& =& \frac{{G}_{{T}_{1,1}}{G}_{{T}_{1,2}} - {G}_{{T}_{1,1}}{G}_{{T}_{2,1}}} {{G}_{{T}_{1,1}}{G}_{{T}_{2,2}} - {G}_{{T}_{1,2}}{G}_{{T}_{2,1}}} \;, \\ {G}_{{\mathbf{w}}^{(2)},\mathbf{W},1}(t)& =& \frac{{G}_{{T}_{2,1}}{G}_{{T}_{2,2}} - {G}_{{T}_{1,2}}{G}_{{T}_{2,2}}} {{G}_{{T}_{1,1}}{G}_{{T}_{2,2}} - {G}_{{T}_{1,2}}{G}_{{T}_{2,1}}} \;, \\ {G}_{{\mathbf{w}}^{(2)},\mathbf{W},2}(t)& =& \frac{{G}_{{T}_{1,1}}{G}_{{T}_{2,2}} - {G}_{{T}_{2,1}}^{2}} {{G}_{{T}_{1,1}}{G}_{{T}_{2,2}} - {G}_{{T}_{1,2}}{G}_{{T}_{2,1}}} \;\end{array}$$

(.)

16.4.2 Weighted counts of compound patterns

A quantity of interest is the count of occurrences of a compound pattern, W say (as introduced in Subsection 16.4.1), within a finite time horizon. A more general quantity is the weighted count of pattern occurrences which attaches a weight, h _i say, to each occurrence of a single pattern, w ⁽ⁱ⁾, from W. More specifically, introduce

$${H}_{\mathbf{W}}(t) = \sum \limits_{i=1}^{k}{h}_{ i}{N}_{{\mathbf{w}}^{(i)}}(t),$$

where ${N}_{{\mathbf{w}}^{(i)}}(t)$ is the count of occurrences of pattern w ⁽ⁱ⁾ within a time interval of length t. Recall the meaning of the r _i, j, X _i, j, and ${T}_{s,\mathbf{W}\vert {\mathbf{w}}^{(j)}}$ which are introduced in Subsection 16.4.1. Of course, the occurrence of W can be modelled by a k-state semi-Markov process, where an entry to state i identifies an occurrence of pattern w ⁽ⁱ⁾. The one-step transition probabilities of the embedded discrete-time Markov chain of this semi-Markov process are the r _i, j. The holding time at state i, given that the next state to be visited is state j, is identified by the random variable X _i, j. For each initial letter, s say, we augment this semi-Markov process with one initial state, 0 say, and relevant one-step transition probabilities and holding times as follows (we denote the probability to move from state 0 to state j by r _0, j):

$${r}_{0,0} = 0,\;\;\;\;{r}_{0,j} = {G}_{s,\mathbf{W},j}(1),\;\;\;\;\;j = 1,2, \ldots ,k,$$

and the holding time at state 0, given that the next state to be visited is state j, is identified by ${T}_{s,\mathbf{W}\vert {\mathbf{w}}^{(j)}},$ where the latter and G _s, W, j(t) are introduced in Subsection 16.4.1. Now consider the semi-Markov processes, Y _t say, derived from that above as follows. The state space has (k + 1)² states, identified by the pairs $(i,j),\ i,j = 0,1, \ldots ,k.$ The process Y _t enters state (i, j) if pattern w ⁽ⁱ⁾ is reached, given that the next occurrence of W is via pattern w ^(j). The initial states are the states (0, j) for $j = 1,2, \ldots ,k,$ and the initial probabilities are the r _0, j. Clearly, the holding time distributions for this new semi-Markov process do not depend on the next state visited. Also, the holding time in state (i, j) is identified by the random variable X _i, j, and that in state (0, j) by ${T}_{s,\mathbf{W}\vert {\mathbf{w}}^{(j)}}.$ Then the weighted count H _W(t), introduced above, is equal to

$${H}_{\mathbf{W}}(t) = \sum \limits_{i=0}^{k} \sum \limits_{j=0}^{k}{h}_{ i}{N}_{(i,j)}(t),$$

where N _(i, j)(t) counts the number of visits of Y _t to state (i, j) within a time interval of length t. Denote by ${\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})}$ the first passage time of Y _t from state (i ₁, j ₁) to state (i ₂, j ₂) and by ${L}_{{H}_{\mathbf{W}}}^{{\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})} }({s}_{1},{s}_{2})$ the joint Laplace transform of the random variables ${\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})}$ and ${H}_{\mathbf{W}}\left ({\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})}\right )$, that is,

$${L}_{{H}_{\mathbf{W}}}^{{\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})} }({s}_{1},{s}_{2}) = E\left (\exp \left (-{s}_{1}{\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})} - {s}_{2}{H}_{\mathbf{W}}({\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})})\right )\right ).$$

Closed-form expressions for the ${L}_{{H}_{\mathbf{W}}}^{{\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})} }({s}_{1},{s}_{2})$ are derivable in terms of the r _i, j and the Laplace transforms of the X _i, j and the ${T}_{s,\mathbf{W}\vert {\mathbf{w}}^{(j)}}$, as explained in Stefanov (2006) for general reward functions on semi-Markov processes. Let

$${L}_{t,{H}_{\mathbf{W}}}^{(s)}({s}_{ 1},{s}_{2}) = \int _{0}^{\infty } \int _{0}^{\infty }{e}^{-{s}_{1}t-{s}_{2}x}P\left ({H}_{\mathbf{ W}}(t) \leq x\vert \mbox{ the initial letter is}\;s\right )dxdt$$

The following theorem follows from a general result on reward functions for semi-Markov processes [see Theorem 2.1 in Stefanov (2006)]. It provides an explicit, closed-form expression for the Laplace transform, ${L}_{t,{H}_{\mathbf{W}}}^{(s)}({s}_{1},{s}_{2}),$ of the weighted count of W occurrences within a time interval of length t, in terms of the r _i, j, the Laplace transforms, $\mathcal{L}[{X}_{i,j}](\cdot ),$ of the interarrival times X _i, j of the compound pattern W, and the Laplace transforms, $\mathcal{L}[{T}_{s,\mathbf{W}\vert {\mathbf{w}}^{(j)}}](\cdot ),$ of the waiting time to reach W from an initial letter s, for $s = 1,2, \ldots ,N.$

Theorem 16.4.2.

The following identity holds for the Laplace transform ${L}_{t,{H}_{\mathbf{W}}}^{(s)}\!\! :$

$${L}_{t,{H}_{\mathbf{W}}}^{(s)}({s}_{ 1},{s}_{2}) = \sum \limits_{m=1}^{k}{r}_{ 0,m} \sum \limits_{i,j=1}^{k}\frac{\left (1 -\mathcal{L}[{X}_{i,j}]({s}_{1} + {s}_{2}{h}_{i})\right ){L}_{{H}_{\mathbf{W}}}^{{\nu }_{(0,m),(i,j)} }({s}_{1},{s}_{2})} {{s}_{2}({s}_{1} + {s}_{2}{h}_{i})\left (1 - {L}_{{H}_{\mathbf{W}}}^{{\nu }_{(i,j),(i,j)} }({s}_{1},{s}_{2})\right )} \;,$$

where the joint Laplace transforms ${L}_{{H}_{\mathbf{W}}}^{{\nu }_{({i}_{1},{j}_{1}),({i}_{2},{j}_{2})} }({s}_{1},{s}_{2})$ have been introduced above.

16.4.3 Structured motifs

Structured motifs are special compound patterns, usually containing a huge number of single patterns. In this subsection we consider both the waiting time until the first occurrence, and the intersite distance between consecutive occurrences, of a structured motif. The interest in these waiting times is due to the biological challenge of identifying promoter motifs along genomes. A structured motif is composed of several patterns separated by a variable distance. If the number of patterns is n, then the structured motif is said to have n boxes. The formal definition of a structured motif with 2 boxes follows. Let w ⁽¹⁾ and w ⁽²⁾ be two patterns of length k ₁ and k ₂, respectively. The alphabet size equals N, and the strings are generated by the Markov chain introduced in Section 16.2. A structured motif m formed by the patterns w ⁽¹⁾ and w ⁽²⁾, and denoted by $\mathbf{m} ={ \mathbf{w}}^{(1)}({d}_{1} : {d}_{2}){\mathbf{w}}^{(2)},$ is a string with the following property. Pattern w ⁽¹⁾ is a prefix and pattern w ⁽²⁾ is a suffix of the string, and the number of letters between the two patterns is not smaller than d ₁ and not greater than d ₂. Also, it is assumed that patterns w ⁽¹⁾ and w ⁽²⁾ appear only once in the string. The pgf’s of both the waiting time, τ_m ^(s), to reach for the first time the structured motif m from state s, and the intersite distance, ${\tau }_{\mathbf{m}}^{(intersite)},$ between two consecutive occurrences of m, are of interest.

Let $\mathbf{W} =\{{ \mathbf{w}}^{(1)},{\mathbf{w}}^{(2)}\}$ be a compound pattern consisting of two patterns. For brevity, denote by T _i, j, i, j ∈ { 1, 2}, the waiting time to reach pattern w ^(j) from pattern w ⁽ⁱ⁾, and by T _j ^(s) the waiting time to reach pattern w ^(j) from state s. The quantities r _i, j and X _i, j, i, j ∈ { 1, 2}, are introduced in Subsection 16.4.1. Let

$${a}_{i,j}(x) = P({X}_{i,j} = x).$$

In order to reach the structured motif m, we need to reach first the pattern w ⁽¹⁾ and, from this occurrence of w ⁽¹⁾, to reach the pattern w ⁽²⁾ such that ${d}_{1} + {k}_{2} \leq {X}_{1,2} \leq {d}_{2} + {k}_{2}$. Introduce the following random variables:

$$\begin{array}{rcl}{ F}_{12}& =& ({X}_{1,2}\:\vert \:{X}_{1,2} < {d}_{1} + {k}_{2}\;\;\mathrm{or}\;\;{X}_{1,2} > {d}_{2} + {k}_{2}), \\ {S}_{12}& =& ({X}_{1,2}\:\vert \:{d}_{1} + {k}_{2} \leq {X}_{1,2} \leq {d}_{2} + {k}_{2})\end{array}$$

(.)

F ₁₂ corresponds to an occurrence of w ⁽²⁾ that fails to achieve the structured motif, whereas for S ₁₂, w ⁽²⁾ achieves the structured motif. One may notice that the pgf’s of F ₁₂ and S ₁₂ are given by

$$\begin{array}{rcl}{ G}_{{F}_{12}}(t)& =& \left ({G}_{{X}_{12}}(t) - \sum \limits_{x={d}_{1}+{k}_{2}}^{{d}_{2}+{k}_{2} }{a}_{1,2}(x){t}^{x}\right ){\left (1 - {q}_{ S}\right )}^{-1} \\ {G}_{{S}_{12}}(t)& =& \left ( \sum \limits_{x={d}_{1}+{k}_{2}}^{{d}_{2}+{k}_{2} }{a}_{1,2}(x){t}^{x}\right ){q}_{ S}^{-1}, \\ \end{array}$$

where q _S is the probability of ‘success’ (w ⁽²⁾ achieves the structured motif), i.e., the probability that ${d}_{1} + {k}_{2} \leq {X}_{1,2} \leq {d}_{2} + {k}_{2}$. Namely, we have

$${q}_{S} = \sum \limits_{x={d}_{1}+{k}_{2}}^{{d}_{2}+{k}_{2} }{a}_{1,2}(x).$$

The following theorem provides explicit and calculable expressions for the pgf’s of both the waiting time to reach for the first time the structured motif $\mathbf{m} ={ \mathbf{w}}^{(1)}({d}_{1} : {d}_{2}){\mathbf{w}}^{(2)}$ from state s, and the intersite distance between two consecutive occurrences of m.

Theorem 16.4.3.

The pgf, ${G}_{\mathbf{m}}^{(s)}(t),$ of the waiting time to reach for the first time a structured motif m starting from state s, and the pgf, ${G}_{\mathbf{m}}^{(intersite)}(t),$ of the intersite distance between two consecutive occurrences of m, admit the following explicit expressions:

$${G}_{\mathbf{m}}^{(s)}(t) = \frac{{r}_{1,2}\,{q}_{S}\,{G}_{{T}_{1}^{(s)}}(t)\,{G}_{{S}_{12}}(t)} {(1 - (1 - {r}_{1,2}){G}_{{X}_{1,1}}(t))\left (1 - (1 - {q}_{S})\left (\frac{{r}_{1,2}\,{G}_{{T}_{2,1}}(t)\,{G}_{{F}_{12}}(t)} {1-(1-{r}_{1,2}){G}_{{X}_{1,1}}(t)} \right )\right )}\;,$$

$${G}_{\mathbf{m}}^{(intersite)}(t) = \frac{{r}_{1,2}\,{q}_{S}\,{G}_{{T}_{2,1}}(t)\,{G}_{{S}_{12}}(t)} {\left (1 - (1 - {r}_{1,2}){G}_{{X}_{1,1}}(t)\right )\left (1 - (1 - {q}_{S})\left (\frac{{r}_{1,2}\,{G}_{{T}_{2,1}}(t)\,{G}_{{F}_{12}}(t)} {1-(1-{r}_{1,2}){G}_{{X}_{1,1}}(t)} \right )\right )}\;,$$

where ${G}_{{F}_{12}}(t),{G}_{{S}_{12}}(t),$ and q _S are given above.

The proof of this theorem is found in Stefanov, Robin, and Schbath (2007). Note that, in view of this theorem, the availability of the pgf’s ${G}_{{X}_{i,j}}(t),\;i,j = 1,2,$ is enough to calculate explicit, closed-form expressions for ${G}_{\mathbf{m}}^{(s)}(t)$ and ${G}_{\mathbf{m}}^{(intersite)}(t).$ Explicit expressions for the ${G}_{{X}_{i,j}}(t),$ in terms of the ${G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t),$ are derived from the identities at the end of Subsection 16.4.1. Also, recall that the ${G}_{{T}_{{\mathbf{ w}}^{(i)},{\mathbf{w}}^{(j)}}}(t)$ are calculated from Theorem 16.2.1 in Section 16.2.

Neat closed-form expressions for the relevant pgf’s associated with structured motifs with n boxes are found in Stefanov, Robin, and Schbath (2009).

References

Antzoulakos, D. L. (2001). Waiting times for patterns in a sequence of multistate trials. Journal of Applied Probability, 38, 508–518.
Article MATH MathSciNet Google Scholar
Balakrishnan, N. and Koutras, M. (2002). Runs and Scans with Applications. Wiley, New York.
MATH Google Scholar
Biggins, J. D. (1987). A note on repeated sequences in Markov chains. Advances in Applied Probability, 19, 739–742.
Article MATH MathSciNet Google Scholar
Biggins, J. D. and Cannings, C. (1987). Markov renewal processes, counters and repeated sequences in Markov chains. Advances in Applied Probability, 19, 521–545.
Article MATH MathSciNet Google Scholar
Blom, G. and Thorburn, D. (1982). How many random digits are required until given sequences are obtained? Journal of Applied Probability, 19, 518–531.
Article MATH MathSciNet Google Scholar
Chadjiconstantinidis, S., Antzoulakos, D. L. and Koutras, M. V. (2000). Joint distributions of successes, failures and patterns in enumeration problems. Advances in Applied Probability, 32, 866–884.
Article MATH MathSciNet Google Scholar
Chryssaphinou, O. and Papastavridis, S. (1990). The occurrence of a sequence of patterns in repeated dependent experiments. Theory of Probability and Its Applications, 35, 167–173.
Article MathSciNet Google Scholar
çinlar, E. (1975). Introduction to Stochastic Processes. Prentice-Hall, Englewood Cliffs, NJ.
Google Scholar
Feller, W. (1950). An Introduction to Probability Theory and Its Applications, Vol. 1. Wiley, New York.
Google Scholar
Fu, J. C. (1996). Distribution theory of runs and patterns associated with a sequence of multistate trials. Statistica Sinica, 6, 957–974.
MATH MathSciNet Google Scholar
Fu, J. C. and Chang, Y. M. (2002). On probability generating functions for waiting time distributions of compound patterns in a sequence of multistate trials. Journal of Applied Probability, 39, 70–80.
Article MATH MathSciNet Google Scholar
Fu, J. C. and Lou, W. Y. W. (2003). Distribution Theory of Runs and Patterns and its Applications, World Scientific, Hackensack, NJ.
MATH Google Scholar
Gerber, H. and Li, S-Y. R. (1981). The occurrence of sequence patterns in repeated experiments and hitting times in a Markov chain. Stochastic Processes and Their Applications, 11, 101–108.
Article MATH MathSciNet Google Scholar
Glaz, J., Kulldorff, M., Pozdnyakov, V. and Steele, J. M. (2006). Gambling teams and waiting times for patterns in two-state Markov chains. Journal of Applied Probability, 43, 127–140.
Article MATH MathSciNet Google Scholar
Guibas, L. J. and Odlyzko, A. M. (1981). String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory. Series A, 30, 183–208.
Article MATH MathSciNet Google Scholar
Han, Q. and Hirano, K. (2003). Sooner and later waiting time problems for patterns in Markov dependent trials. Journal of Applied Probability, 40, 73–86.
Article MATH MathSciNet Google Scholar
Inoue, K. and Aki, S. (2007). On generating functions of waiting times and numbers of occurrences of compound patterns in a sequence of multistate trials. Journal of Applied Probability 44, 71–81.
Google Scholar
Kijima, M. (1997). Markov Processes for Stochastic Modeling. Chapman & Hall, London.
MATH Google Scholar
Li, S-Y. R. (1980). A martingale approach to the study of occurrence of sequence patterns in repeated experiments. Annals of Probability, 8, 1171–1176.
Article MATH MathSciNet Google Scholar
Nicodème, P., Salvy, B. and Flajolet, P. (2002). Motif statistics. Theoretical Computer Science, 287, 593–617.
Article MATH MathSciNet Google Scholar
Nuel, G. (2008). Pattern Markov chains: optimal Markov chain embedding through deterministic finite automata. Journal of Applied Probability, 45, 226–243.
Google Scholar
Pozdnyakov, V. (2008). A note on occurrence of gapped patterns in i.i.d. sequences. Discrete Applied Mathematics 156, 93–102.
Google Scholar
Pozdnyakov, V., Glaz, J., Kulldorff, M. and Steele, J. M. (2005). A martingale approach to scan statistics. Annals of the Institute of Statistical Mathematics, 57, 21–37.
Article MATH MathSciNet Google Scholar
Reinert, G., Schbath, S. and Waterman, M. (2000). Probabilistic and statistical properties of words: an overview. Journal of Computational Biology, 7, 1–46.
Article Google Scholar
Robin, S. and Daudin, J. (1999). Exact distribution of word occurrences in a random sequence of letters. Journal of Applied Probability, 36, 179–193.
Article MATH MathSciNet Google Scholar
Robin, S. and Daudin, J. (2001). Exact distribution of the distances between any occurrences of a set of words. Annals of the Institute of Statistical Mathematics, 36, 895–905.
Article MathSciNet Google Scholar
Robin, S., Daudin, J., Richard, H., Sagot, M.-F. and Schbath, S. (2002). Occurrence probability of structured motifs in random sequences. Journal of Computational Biology, 9, 761–773.
Article Google Scholar
Rukhin, A. (2002). Distribution of the number of words with a prescribed frequency and tests of randomness. Advances in Applied Probability, 34, 775–797.
Article MATH MathSciNet Google Scholar
Rukhin, A. (2006). Correlation matrices of chains for Markov sequences, and testing for randomness. (Russian) Teoriya Veroyatnostei i ee Primeneniya, 51, 712–731.
Google Scholar
Stefanov, V. T. (2000). On some waiting time problems. Journal of Applied Probability, 37, 756–764.
Article MATH MathSciNet Google Scholar
Stefanov, V. T. (2003). The intersite distances between pattern occurrences in strings generated by general discrete- and continuous-time models: an algorithmic approach. Journal of Applied Probability, 40, 881–892.
Article MATH MathSciNet Google Scholar
Stefanov, V. T. (2006). Exact distributions for reward functions on semi-Markov and Markov additive processes. Journal of Applied Probability, 43, 1053–1065.
Article MATH MathSciNet Google Scholar
Stefanov, V. T. and Pakes, A. G (1997). Explicit distributional results in pattern formation. Annals of Applied Probability, 7, 666–678.
Article MATH MathSciNet Google Scholar
Stefanov, V. T., Robin, S. and Schbath, S. (2007). Waiting times for clumps of patterns and for structured motifs in random sequences. Discrete Applied Mathematics, 155, 868–880.
Article MATH MathSciNet Google Scholar
Stefanov, V. T., Robin, S. and Schbath, S. (2008). Occurrence of structured motifs in random sequences: arbitrary number of boxes. (in preparation).
Google Scholar
Szpankowski, W. (2001). Average Case Analysis of Algorithms on Sequences. John Wiley & Sons, New York.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Mathematics and Statistics, University of Western Australia, Crawley, Australia
Valeri T. Stefanov

Authors

Valeri T. Stefanov
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. Statistics, University of Connecticut, Glenbrook Rd. 215, Storrs, 06269-3120, U.S.A.
Joseph Glaz
Dept. Statistics, University of Connecticut, Glenbrook Road 215, Storrs, 06269-4120, U.S.A.
Vladimir Pozdnyakov
Center for Biomathematical Sciences, Mount Sinai School of Medicine, Gustave L Levy Place 1, New York, 10029, U.S.A.
Sylvan Wallenstein

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Stefanov, V. (2009). Occurrence of Patterns and Motifs in Random Strings. In: Glaz, J., Pozdnyakov, V., Wallenstein, S. (eds) Scan Statistics. Statistics for Industry and Technology. Birkhäuser Boston. https://doi.org/10.1007/978-0-8176-4749-0_16

Download citation

DOI: https://doi.org/10.1007/978-0-8176-4749-0_16
Published: 27 April 2009
Publisher Name: Birkhäuser Boston
Print ISBN: 978-0-8176-4748-3
Online ISBN: 978-0-8176-4749-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Occurrence of Patterns and Motifs in Random Strings

Abstract

Similar content being viewed by others

Distribution of Patterns of Constrained Length in Binary Sequences

Number of appearances of events in random sequences: a new generating function approach to Type II and Type III runs

Markov Chains with Special Structures

Keywords and phrases

16.1 Introduction

16.2 Patterns: Discrete-Time Models

Theorem 16.2.1.

16.3 Patterns: General Discrete-Time and Continuous-Time Models

16.3.1 Waiting times

Theorem 16.3.1.

16.3.2 Joint generating functions associated with waiting times

Theorem 16.3.2.

16.4 Compound Patterns

16.4.1 Compound patterns containing a small numberof single patterns

Theorem 16.4.1.

16.4.2 Weighted counts of compound patterns

Theorem 16.4.2.

16.4.3 Structured motifs

Theorem 16.4.3.

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Occurrence of Patterns and Motifs in Random Strings

Abstract

Similar content being viewed by others

Distribution of Patterns of Constrained Length in Binary Sequences

Number of appearances of events in random sequences: a new generating function approach to Type II and Type III runs

Markov Chains with Special Structures

Keywords and phrases

16.1 Introduction

16.2 Patterns: Discrete-Time Models

Theorem 16.2.1.

16.3 Patterns: General Discrete-Time and Continuous-Time Models

16.3.1 Waiting times

Theorem 16.3.1.

16.3.2 Joint generating functions associated with waiting times

Theorem 16.3.2.

16.4 Compound Patterns

16.4.1 Compound patterns containing a small numberof single patterns

Theorem 16.4.1.

16.4.2 Weighted counts of compound patterns

Theorem 16.4.2.

16.4.3 Structured motifs

Theorem 16.4.3.

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation