Keywords

MSC 2020

1 Introduction

Periodicity is a structural  property of DNA sequences. It is expressed as either nucleotides or words of nucleotides, that have a tendency to appear with specific distances in-between. It is worth noting that, in DNA analysis, periodicity refers to a tendency of letters or words to reappear at certain distances, in contrast with the formal mathematical definition of periodicity. Repetitive patterns in DNA often cause human diseases and algorithmic techniques have been applied to detect such patterns using statistically based criteria [5]. Also, there exist some structural types of periodicity in DNA, that are not linked to diseases. Mainly, there have been observed two such types of periodic behaviour in the DNA. The first one was discovered in 1980 [26] regarding the signal of nucleosomes contained in the nucleus. The authors observed that certain dinucleotides in the DNA of chromatin tend to appear at approximately every 10 to 11 bases. Subsequent studies suggested that the period of chromatin sequences converges to 10.4 bases [10]. Also, a more recent study, which investigated the genome of three organisms (A. thaliana, C. elegans and H. sapiens), suggested that the dinucleotide AA has almost perfect 10.5-base periodic behaviour in sequences of these organisms [22]. One explanation about this type of periodicity is that the distance of 10.5 bases is the “step” of the double strand, which suppress the long DNA sequence into the area of the nucleus [14]. Previous studies have used the Fourier transformation and spectral density analysis, as the main tool for exploring the periodic behaviour in DNA sequences [7,8,9, 24, 31]. The second type of periodicity has been observed in areas of the genome that are transcribed and later translated into proteins, called coding regions. Previous studies, using similar methods, have shown that in coding regions, there is a tendency of certain nucleotides to reappear every 3-bases [32]. Also, this type of periodicity has only been observed in coding regions, while for non-coding regions there was not found any similar periodic behaviour [12, 25, 27]. As each of the amino acids is encoded with a triplet of nucleotides (codon) and some specific amino acids are more abundant than others, the authors concluded that the periodic behaviour, in fact exists, due to the abundance of certain amino acids and the period of 3-bases is due to the triplet nature of the DNA [2]. As the whole genome sequence of each organism is frequently of several billions bases and the coding regions only constitute a small part of DNA, the information about the periodic behaviour of the coding regions of the DNA could be helpful for detecting these regions and distinguish between protein coding regions and non-coding regions [20]. Also, other well-known and highly accurate probabilistic algorithms use hidden-Markov models, in order to predict the different gene structures inside DNA [6]. Markov chains have been previously used in the analysis of letter and DNA sequences and some of the models could be found in the book of Waterman [29] and also in [1, 3, 13, 23, 30]. In this paper we consider that a DNA sequence is described by a semi-Markov chain (SMC) \(X_t\), with discrete state space \(S = \lbrace A,C,G,T \rbrace \), where t denotes the index position inside the sequence and \(\boldsymbol{C}(m)= \lbrace c_{i,j}(m) \rbrace \) is the core matrix of the SMC. Previously, in [19] a similar modelling was examined to derive distributions of the word location and frequency of occurrences. The applied semi-Markov model was of a discrete finite state space S with elements specific words i.e. finite combinations of letters taken from the alphabet with known length and non-overlapping occurrences. An overview of probabilistic and statistical properties of words, as occurrences in biological sequences, is provided in [21]. Semi-Markov chains are a generalization of the Markov chains and allow the sojourn time between transitions to follow arbitrary distributions. An overview of the basic theory of homogeneous semi-Markov chains could be found in the book of Howard [15]. Further theory and applications of semi-Markov modelling can be found in [4, 11, 16,17,18]

In Sect. 9.2.1, a recursive equation of the homogeneous semi-Markov model that could be used as an identification tool for regions that have strong or weak d-periodic behaviour is constructed. In Sect. 9.2.2 the previous theoretical results are generalized for the non-homogeneous case, considering the triplet nature of the DNA and assuming each coding position corresponds to a different transition matrix \(\boldsymbol{P}(k)\). In Sect. 9.3, the case of quasiperiodicity of a state is included. The above tool is structured considering the fact that it is possible for the chain to lose its periodic behaviour for a number of cycles or the state to appear not exactly after a period of d positions, but in a radius of \(d\pm \varepsilon \) positions. This could be due to the fact of genetic mutations that could shift the way the sequence is read. In Sect. 9.4, we present illustrations with data from both synthetic and real DNA sequences, regarding the 3-base periodicity. In the final section, conclusions are provided.

2 The Basic Framework

We assume that the DNA sequence is a realization of a semi-Markov chain \( X_t \) with state space the four nucleotides \( S = \lbrace A,C,G,T\rbrace \). The semi-Markov chain is described by a sequence of Markov transition matrices \( \lbrace \boldsymbol{P}(t) \rbrace _{t=0}^{\infty }\) and a sequence of conditional holding time matrices \(\lbrace \boldsymbol{H}(m) \rbrace _{m=1}^{\infty }\), such as \( \boldsymbol{P}(t)= \lbrace p_{i,j}(t) \rbrace , i,j \in S, t \in \mathcal {N}, \) where

$$\begin{aligned} \begin{array}{ll} p_{i,j}(t)=Prob[&{}\text {the SMC will make its next transition to state}\, j \\ &{}\text {/ the SMC entered state}\, i\, \text {at time} t], \end{array} \end{aligned}$$

with \(p_{i,j}(t)\ge 0 , \; \forall i,j\in S, \; t \in \mathcal {N}\) and \({\displaystyle \sum _{j\in S}} p_{i,j}(t) = 1, \; \forall i, \; t \in \mathcal {N}\) and

$$\begin{aligned}&\boldsymbol{H}(m)=\lbrace h_{i,j}(m) \rbrace , \qquad i,j \in S, \; m \in \mathcal {N}, \\&\begin{array}{c} h_{i,j}(m) =Prob[\text {The SMC will stay in state}\, i\, \text {for}\, m\, \text {time units} \\ \text {before moving to state}\, j]. \end{array} \end{aligned}$$

We define the probabilities of the waiting time \(w_{i}(t,m)\), which are the probabilities for the SMC to hold for m time units in state i, before making its next transition, while it entered state i at time t, to be \( w_{i}(t,m)= \sum _{j\in S} p_{i,j}(t)h_{i,j}(m). \) Also the cumulative distribution for the waiting time is \( ^{>}w_{i}(t,n)= {\displaystyle \sum _{m=n+1}^{\infty }} w_{i}(t,m)={\displaystyle \sum _{m=n+1}^{\infty } \sum _{j\in S}} p_{i,j}(t)h_{i,j}(m). \) The basic parameter of the SMC is the core matrix and it is defined as \( \boldsymbol{C}(t,m)= \lbrace c_{i,j}(t,m) \rbrace _{i,j \in S} = \boldsymbol{P}(t) \circ \boldsymbol{H}(m), \) where the operator \(\lbrace \circ \rbrace \) denotes the element-wise product of matrices (Hadamard product). Also, we define the interval transition probabilities \(q_{i,j}(t,n)\), which are the probabilities for the SMC to be in state j after n time units, while it entered state i in time t, to be

$$\begin{aligned} \begin{array}{l} \boldsymbol{Q}(t,n) = \lbrace q_{i,j}(t,n) \rbrace _{i,j \in S} \\ = \,^{>}\boldsymbol{W}(t,n)+{\displaystyle \sum _{m=0}^{n}}[\boldsymbol{P}(t)\circ \boldsymbol{H}(m)]\boldsymbol{Q}(t+m,n-m), \end{array} \end{aligned}$$
(9.1)

where \(^{>}\boldsymbol{W}(t,n)=diag \lbrace \,^{>}w_{i}(t,n)\rbrace \). The elements of the matrix \(\boldsymbol{Q}(t,n)\) are

$$\begin{aligned} q_{i,j}(t,n)=\,\delta _{i,j} \;^{>}w_{i}(t,n)+ \sum _{r \in S} \sum _{m=1}^{n}c_{i,r}(t,m)q_{r,i}(t+m,n-m), \; i,j \in S, \; t,n\in \mathcal {N}. \end{aligned}$$

2.1 The Homogeneous Case

In the following, we consider the DNA sequence to be a homogeneous semi-Markov chain, therefore we have \( p_{i,j}(t)=p_{i,j}, \forall t \in \mathcal {N}. \) Furthermore, we assume that DNA sequences do not contain virtual transitions, therefore subsequent appearances of the same state count as holding and \(p_{i,i}(t) = 0, \;\forall i\in S, \; t \in \mathcal {N} \). For the purpose of the present, the parameter of time indicates the position, based on the nature of the DNA sequences, as their evolution depends on the index position of every letter in the sequence. In order to study the d-periodic behaviour of a DNA sequence, we would like to examine the probability of a letter reappearance after d positions. Also, for a sequence with strong d-periodic behaviour, it is expected that for every periodic state, the frequency of the state appearances, every kd positions, would be high. Therefore, an interesting question is whether the chain is in the same state, not only for the first cycle of length d, but also for a number of n successive cycles of the same length. Thus, we define the following probabilities.

Definition 1

Let \(p_{i}(1,d)\) be the probability that the SMC will be in state i in position d, while in the initial position it was observed to be in state i, that is

$$\begin{aligned} p_{i}(1,d)=Prob[&\text {the SMC will be in state}\, i \, \text {in position}\, d / \\&\text {the initial state was observed to be}\, i]. \end{aligned}$$

Similarly, we define the probability that the SMC will be in state i every d positions for n cycles, while in the initial position it was observed to be in state i, as follows

$$\begin{aligned} p_{i}(n,d)=Prob[&\text {the SMC will be in state i every d positions for n cycles /}\\&\text {the initial state was observed to be}\, i]. \end{aligned}$$

It is important to note that for a given DNA sequence, we do not know if the initial position is due to a letter transition or reappearance of the same letter, therefore we have to include both cases in order to calculate the probability above. If we observed the process to be in state i in the initial position, it would be unlikely that upon the first observation the SMC had just entered this state. On the other hand, it would be more plausible to think that we started to observe the process in a position, where the entrance to a state has already been achieved. As a result, the process will stay in state i for the remaining positions and then make a transition to state j. The basic parameters of the SMC under random starting concern only the behaviour of the process until the first transition. Hence, let us denote by \(_{r}p_{i,j}(\cdot )\) the transition probabilities under random starting and \(_{r}h_{i,j}(\cdot )\) the distributions of the holding positions under random starting. A more detailed specification of the SMC under random starting could be found in the book of Howard [15].

Lemma 1

Let \(\boldsymbol{P}(1,d)\) and \(\boldsymbol{P}(n,d)\) be the \((N\times 1)\) vectors, which consist of the probabilities \(p_{i}(d)\) and \(p_{i}(n,d),\; i \in S\) respectively, following Definition 1. Then,

$$\begin{aligned}&(a){} & {} \boldsymbol{P}(1,d)=\, \Big [ ^{>}\!_{r}\!\boldsymbol{W}(d)+\sum _{x=1}^{d}\, \boldsymbol{I} \circ \big [ _{r}\!\boldsymbol{C}(x)[\boldsymbol{Q}(d-x) \circ \boldsymbol{(U-I)}] \big ] \Big ] \cdot \boldsymbol{1}. \end{aligned}$$
(9.2)
$$\begin{aligned}&(b){} & {} \boldsymbol{P}(n,d)=\boldsymbol{P}(n-1,d)\circ \boldsymbol{P}(1,d), \end{aligned}$$
(9.3)

where \(\boldsymbol{I}\) is the identity matrix, \(^{>}\!_{r}\!\boldsymbol{W}(d)=diag\lbrace ^{>}\!_{r}\!w_{i}(d) \rbrace \) denotes the survival function of the waiting time distribution under random starting, \(_{r}\!\boldsymbol{C}(x)=\lbrace _{r}\!c_{i,j}(x) \rbrace \) denotes the core matrix of the SMC under random starting, which consist of the elements \( _{r}\!c_{i,j}(x)= {}_{r}{p}_{i,j} \cdot {}_{r}{h}_{i,j}(x)\), \(\boldsymbol{U} = \lbrace u_{i,j} \rbrace \), where \(u_{i,j}=1,\) for every \(i,j \in S\) and \(\boldsymbol{1} = [1,1,\cdots , 1]^{T}\).

Proof

Let \( S_x = \underset{x-times}{\underbrace{i\,i\,i \,i \, \cdots \, i}} \,j \, u \, u \cdots \,u \,i, \) be the sequence of states of length d, where \(x = 1,2,...,d\), j denotes any state different than i and u denotes any state from the state space S. For a given sequence, let us now consider the following instances which are mutually exclusive and exhaustive events:

$$\begin{aligned} S_1&= i\,j\,u \,u \, \cdots \,u \,i\\ S_2&= i\,i\,j \,u \,u \cdots \,u\,i\\ S_3&= i\,i\,i \,j\,u\,u \,\cdots \,u\,i\\&\,\,\,\vdots \\ S_{d-2}&= i\,i\ \, \cdots \,i\,j\,u\,i \\ S_{d-1}&= \,i \,i\,i\ \, \cdots \,i\,j\,i \\ S_{d}&= \,i\,i\,i\,i\,i \, \cdots \,i \end{aligned}$$

According to the previous, the semi-Markov chain, with initial observed state i, will be in state i after d positions, if either it holds for more than d steps in the initial state or makes a transition to a different state j at position x before the end of the cycle, but in any case to occupy state i in the final position. Thus, using probabilistic argument and summing over all possible states and holding times, we can conclude to the equation \( p_{i}(1,d)=\, ^{>}\!_{r}\!w_{i}(d)+ {\displaystyle \sum _{j\ne i}^{N} \sum _{x=1}^{d}} \,_{r}c_{i,j}(x)q_{j,i}(d-x). \) Let the element of the ith row of a vector \(\boldsymbol{P}(1,d)\) be the probability \(p_{i}(1,d)\). The matrix notation in Eq. 9.2 can immediately be deduced by keeping only the non-diagonal elements, i.e. multiplying by the matrix \([\boldsymbol{U-I}]\). Similarly, concerning Eq. 9.3, let us consider that the elements of the matrix \(\boldsymbol{P}(n,d)\) to be the probabilities \(p_{i}(n,d)\). Hence, in order for the SMC to be in the same state after n successive cycles of length d, we have \( p_{i}(n,d)=\, \big [ ^{>}\!_{r}\!w_{i}(d)+ {\displaystyle \sum _{j\ne i}^{N} \sum _{x=1}^{d}}\,_{r}c_{i,j}(x)q_{j,i}(d-x) \big ]^{n}. \) The matrix form is deduced immediately by the result above.       \(\square \)

Remark 1

For the interval transition probability matrix \(\boldsymbol{Q}(n)\), instead of using the recursive formula 9.1, one can apply the closed analytic form, as proposed by Vassiliou and Papadopoulou [28]

$$\begin{aligned} \begin{array}{l} \boldsymbol{Q}(n) = \,^{>}\boldsymbol{W}(n) + \boldsymbol{C}(n) + {\displaystyle \sum _{j=2}^{n}} \lbrace \boldsymbol{C}(j-1)+{\displaystyle \sum _{k=1}^{j-2}} \boldsymbol{S}_j(k,m_k) \rbrace \\ \times \lbrace ^{>}\boldsymbol{W}(n-j+1)+ \boldsymbol{C}(n-j+1) \rbrace , \end{array} \end{aligned}$$
(9.4)

where \( \boldsymbol{S}_j(k,m_k) = {\displaystyle \sum _{m_k=2}^{j-k}\sum _{m_{k-1}=1+m_k}^{j-k+1} \cdots \sum _{m_1=1+m_2}^{j-1}\prod _{r=-1}^{k-1}} \boldsymbol{C}(m_{k-r-1}-m_{k-r}) \) for \(j\geqslant k+2\), while if \(j\leqslant k+2\) we have \(\boldsymbol{S}_j(k,m_k)=0\).

2.2 The Case of Partial Non Homogeneity

The partial non-homogeneous semi-Markov chain (PNHSMC) is constructed based on the fact that every amino acid consists of three nucleotides (codon). Using this information, we can create three discrete coding positions \(k = \lbrace 1,2,3\rbrace \) and for the PNHSMC, we have three stochastic matrices \(\boldsymbol{P}(k),\; k = 1,2,3\) for the embedded Markov chain. Similar to the homogeneous case, it would be of interest to find the probability for the PNHSMC to be in the same state after a length of d positions and also for n successive cycles of length d.

Definition 2

Let us define the quantity \(p_{i}(k,1,d)\) to be the probability that the PNHSMC will be in state i in position d, while in the initial position it was observed to be in state i, in coding position k, that is

$$\begin{aligned} p_{i}(k,1,d)=Prob[&\text {the SMC will be in state { i} in position { d} /}\\&\text { the initial state was observed to be { i} in coding position { k}}]. \end{aligned}$$

Furthermore, we define the quantity \(p_{i}(k,n,d)\) to be the probability that the PNHSMC will be in state i every d positions for n cycles, while in the initial position it was observed to be in state i, in coding position k, that is

$$\begin{aligned} \begin{gathered} p_{i}(k,n,d)=Prob[\text {the SMC will be in state i every d positions}\\ \text {for n cycles/ the initial state was observed to be { i} in coding position { k}}]. \end{gathered} \end{aligned}$$

Lemma 2

Let \(\boldsymbol{P}(k,1,d)\) and \(\boldsymbol{P}(k,n,d)\) be \((N\times 1)\) vectors, consisting of the probabilities \(p_{i}(k,1,d)\) and \(p_{i}(k,n,d),\; i \in S\) respectively, following Definition 2. Then

$$\begin{aligned}&(a){} & {} \boldsymbol{P}(k,1,d)= \end{aligned}$$
(9.5)
$$\begin{aligned}{} & {} {}&\Big [ ^{>}\!_{r}\!\boldsymbol{W}(k,d)+{\displaystyle \sum _{x=1}^{d}}\, \boldsymbol{I} \circ \big [ _{r}\!\boldsymbol{C}(k,x) [\boldsymbol{Q}(k+x \bmod s,d-x) \circ \boldsymbol{(U-I)}] \big ] \Big ] \cdot \boldsymbol{1} \nonumber \\&(b){} & {} \boldsymbol{P}(k,n,d)=\boldsymbol{P}(k,n-1,d)\circ \boldsymbol{P}(k,1,d), \end{aligned}$$
(9.6)

where \(^{>}\!_{r}\!\boldsymbol{W}(k,d)=diag\lbrace ^{>}\!_{r}\!w_{i}(k,d) \rbrace \) denotes the survival function of the waiting time distribution of the PNHSMC under random starting, \(_{r}\!\boldsymbol{C}(k,x)=\lbrace _{r}\!c_{i,j}(k,x) \rbrace \) denotes the core matrix of the PNHSMC under random starting, which consist of the elements \( _{r}c_{i,j}(k,x)= {}_{r}{p}_{i,j}(k) \cdot {}_{r}{h}_{i,j}(x)\) and \(\boldsymbol{U} = \lbrace u_{i,j} \rbrace \), where \(u_{i,j}=1\).

Proof

Let

$$ S_x = \underset{x-times}{\underbrace{i_{k}\,i_{k+1}\, \cdots \, i_{k+x-1 \bmod s}}} \,j_{k+x \bmod s} \, u_{k+x+1 \bmod s} \cdots \,u_{k+d-1 \bmod s} \,i_{k+d \bmod s} $$

be the sequence of states of length d, where \(x = 1,2,...,d\), j denotes any state different than \(i, \;\) u denotes any state from the state space \(S, \;\) k denotes the coding position and s denotes the total number of different coding positions. For a given sequence, let us define the following instances which are mutually exclusive and exhaustive events:

$$\begin{aligned} S_1&= i_{k} \, j_{k+1} \,u_{k+3} \,u_{k+3} \,u_{k+4} \, \cdots \,u_{k+d-1 \bmod s} \,i_{k+d \bmod s}\\ S_2&= i_{k} \,i_{k+1} \,j_{k+2} \,u_{k+3} \,u_{k+4} \cdots \,u_{k+d-1\bmod s} \,i_{k+d\bmod s} \\ S_3&= i_{k} \,i_{k+1}\,i_{k+2} \,j_{k+3}\,u_{k+4}\,\cdots \,u_{k+d-1\bmod s} \,i_{k+d\bmod s}\\&\,\,\,\vdots \\ S_{d-2}&= i_{k}\,i_{k+1}\,i_{k+2} \, \cdots \,j_{k+d-2\bmod s} \,u_{k+d-1\bmod s}\,i_{k+d\bmod s} \\ S_{d-1}&= \,i_{k} \,i_{k+1}\,i_{k+2}\,i_{k+3} \, \cdots \,j_{k+d-1\bmod s}\,i_{k+d\bmod s} \\ S_{d}&= \,i_{k}\,i_{k+1}\,i_{k+2}\,i_{k+3}\,i_{k+4}\,i_{k+5} \, \cdots \,i_{k+d\bmod s}, \end{aligned}$$

The PNHSMC, with initial observed state i in coding position k, will be in state i after d positions, either if it holds for more than d positions in the initial state or moves to a different state j at position \(x+k \bmod s\) before the end of the cycle, but in any case to occupy state i in the final position. Thus, using probabilistic argument and summing over all possible states and holding positions, we obtain

$$\begin{aligned} p_{i}(k,1,d)=\,^{>}\!_{r}\!w_{i}(k,d)+ \sum _{j\ne i}^{N} \sum _{x=1}^{d}\,_{r}\!c_{i,j}(k,x)q_{j,i}((k+x) \bmod s,d-x). \end{aligned}$$

Let the element of the ith row of a vector \(\boldsymbol{P}(k,1,d)\) to be the probability \(p_{i}(k,1,d)\). The matrix notation in Eq. (9.5) can be deduced immediately by multiplying with the matrix \([\boldsymbol{U-I}]\). Similarly, concerning equation (9.6), let us consider the elements of the matrix \(\boldsymbol{P}(k,n,d)\) to be the probabilities \(p_{i}(k,n,d)\). In order for the PNHSMC to be in the same state after n successive cycles of length d, we have

$$\begin{aligned} p_{i}(k,n,d)=\, \big [ ^{>}\!_{r}\!w_{i}(k,d)+ \sum _{j\ne i}^{N} \sum _{x=1}^{d}\,_{r}c_{i,j}(k,x)q_{j,i}(k+x \bmod s, d-x) \big ]^{n}. \end{aligned}$$

The matrix form in (9.6) is deduced immediately by applying the Hadamard product over n matrices of the form \(\boldsymbol{P}(k,1,d)\).       \(\square \)

Remark 2

For the interval transition probability matrix \(\boldsymbol{Q}(t,n)\), instead of using the recursive formula, we can apply the closed analytic form, which is [28]

$$\begin{aligned} \begin{gathered} \boldsymbol{Q}(k,n) = \,^{>}\boldsymbol{W}(k,n) + \boldsymbol{C}(k,n) + \sum _{j=2}^{n} \lbrace \boldsymbol{C}(k,j-1)+\sum _{x=1}^{j-2}\boldsymbol{S}_j(x,k,m_x) \rbrace \\ \times \lbrace ^{>}\boldsymbol{W}(k+j-1,n-j+1)+ \boldsymbol{C}(k+j-1,n-j+1) \rbrace , \end{gathered} \end{aligned}$$
(9.7)

where \( \boldsymbol{S}_j(x,k,m_x) = {\displaystyle \sum _{m_x=2}^{j-x}\sum _{m_{x-1}=1+m_x}^{j-x+1} \cdots \sum _{m_1=1+m_2}^{j-1}\prod _{r=-1}^{x-1}} \boldsymbol{C}(k+m_{x-r}-1,m_{x-r-1}-m_{x-r}) \) for \(j\geqslant x+2\), while if \(j\leqslant x+2\) we have \(\boldsymbol{S}_j(x,k,m_x)=0\).

3 Quasiperiodicity

The previous results, for both the homogeneous and non-homogeneous case, correspond to the probability of a state i to reappear again after d positions and n successive cycles. However, for the model to be more coherent, we also have to include the event that the periodicity is not strict and the state i does not appear exactly after d positions, but in the interval \((d-\varepsilon , d+\varepsilon )\). Also, we are interested in the quasiperiodic behaviour of the SMC, not only for a cycle of length d, but also for a number of n successive cycles. For simplicity we assume that \(\varepsilon =1\), although the results for \(\varepsilon >1\) are straightforward. For this purpose, let us define the entrance probabilities under random starting \(_{r}\!e_{i,j}(n)\), which are the probabilities that the SMC will enter state j at position n, given that, in the initial position, the SMC was observed to be in state i [15]. The equation for calculating the probabilities is \( _{r} e_{i,j}(n)=\delta _{i,j} \delta (n)+{\displaystyle \sum _{r=1}^{N} \sum _{m=0}^{n}} \; _{r} c_{i,r} e_{r,j}(n-m). \) Furthermore, let us define the first passage time probabilities \(f_{i,j}(n)\), which are the probabilities that the SMC will transition to state j for the first time after n positions, given that it had entered state i in the initial position [15]. The recursive formula of the probabilities \(f_{i,j}(n)\) is given by \( f_{i,j}(n)={\displaystyle \sum _{r \ne j}^{N} \sum _{m=0}^{n}} p_{i,r} h_{i, r}(m) f_{r,j}(n-m)+p_{i,j} h_{i,j}(n). \)

Definition 3

Let us define the quantity \(_{\varepsilon }p_{i}(1,d)\), assuming \(\varepsilon = 1\), to be the probability that the SMC will be in state i at least once in the position interval \(d\pm \varepsilon \), while in the initial position, the SMC was observed to be in state i. Also, let us define the probability \(_{\varepsilon }p_{i}(n,d)\) to be the probability that the SMC will be in the state i in the interval \((d-1,d+1)\) for n successive cycles, that is

$$\begin{aligned}&(a){} & {} \begin{array}{l} _{\varepsilon }p_{i}(1,d)=Prob[ \text {the SMC will be in state}\, i\, \text {either in position}\, d-1, \\ \quad \text {or { d}, or d+1 / the initial state was observed to be { i}}] \end{array} \end{aligned}$$
(9.8)
$$\begin{aligned}&(b){} & {} \begin{array}{l} _{\varepsilon }p_{i}(n,d)= Prob[\text {the SMC to be in state}\, i\, \text {either in position} d-1, \\ \quad \text {or}\, d,\, \text {or} \,d+1\, \text {for}\, n\, \text {cycles/ the initial state was observed to be}\, i] \end{array} \end{aligned}$$
(9.9)

Theorem 1

Let \(_{\varepsilon }\boldsymbol{P}(1,d)\) and \(_{\varepsilon }\boldsymbol{P}(n,d)\) be \((N\times 1)\) vectors, consisiting of the probabilities \(_{\varepsilon }p_{i}(1,d)\) and \(_{\varepsilon }p_{i}(n,d), i \in S\) respectively, following Definition 3. Then

$$\begin{aligned}&(a){} & {} \begin{array}{l} _{\varepsilon }\boldsymbol{P}(d) = \boldsymbol{P}(d-1)+\\ \Big [ {\displaystyle \sum _{m=1}^{d-1}} \, \boldsymbol{I}\circ \Big [_{r}\boldsymbol{E}(m) \big [[\boldsymbol{F}(d-m)+\boldsymbol{F}(d+1-m)] \circ \boldsymbol{(U-I)}\big ] \Big ] \cdot \boldsymbol{1} \Big ] \end{array} \end{aligned}$$
(9.10)
$$\begin{aligned}&(b){} & {} _{\varepsilon }\boldsymbol{P}(n,d) = _{\varepsilon }\!\boldsymbol{P}(n-1,d)) \circ \nonumber \\{} & {} {}&\Big [ \boldsymbol{P}(d-1)+\Big [ {\displaystyle \sum _{m=1}^{d-1}}\, \boldsymbol{I}\circ \Big [_{r}\boldsymbol{E}(m) \big [[\boldsymbol{F}(d-m)+\boldsymbol{F}(d+1-m)] \circ \boldsymbol{(U-I)}\big ] \Big ] \cdot \boldsymbol{1} \Big ] \Big ], \end{aligned}$$
(9.11)

where \(_{r}\!\boldsymbol{E}(\cdot ) = \lbrace _{r}e_{i,j}(\cdot ) \rbrace \) is the matrix which consists of the entrance probabilities under random starting and \(\boldsymbol{F}(\cdot ) = \lbrace f_{i,j}(\cdot ) \rbrace \) is the matrix with the first passage time probabilities.

Proof

Let us define the events \(A_0\), \(A_1\), \(A_2\) as

$$\begin{aligned}&A_0 ={} & {} [\text {the SMC is in state}\, i \, \text {in position}\, d-1 / \text {the initial state was}\\{} & {} {}&\text {observed to be} i ]. \\&A_1={} & {} [\text {the SMC is in state}\, i \, \text {in position}\, d\, \text {and in state}\, r\ne i\, \text {in position}\, d-1 /\\{} & {} {}&\text { the initial state was observed to be}\, i ]. \\&A_2={} & {} [\text {the SMC is in state}\, i \, \text {in position}\, d+1\, \text {and in state}\, r \ne i\, \text {in positions}\\{} & {} {}&d-1\, \text {and}\, d / \text {the initial state was observed to be}\, i ]. \end{aligned}$$

Schematically, we can visualize the events defined above, as the following sequences

$$\begin{aligned} A_0&= i\,u\,u \,u \, \cdots \,u \,i\\ A_1&= i\,u\,u \,u \, \cdots \,u\,r \,i\\ A_2&= \underset{d-1}{\underbrace{i\,u\,u \,u \, \cdots \,u\,r}} \,r \,i \; , \end{aligned}$$

where u denotes any state from state space S and r denotes a state different from i. It is obvious that the events are mutually exclusive, therefore \(Prob[A_0\cup A_1 \cup A_2] = Prob[A_0] + Prob[A_1] + Prob[A_2]\). The probability for the event \(A_0\) is defined as

$$\begin{aligned}&Prob[A_0]=p_{i}(1,d-1)=Prob[\text {the SMC will be in state}\, i \, \text {in position} d-1 / \\&\qquad \qquad \qquad \qquad \qquad \qquad \text {the initial state was observed to be}\, i ]. \end{aligned}$$

For the event \(A_{1}\) to happen, it is required for the SMC to be in a state \(r \ne i\) in position \(d-1\) and transition to state i in position d. Therefore, the SMC could have entered state \(r \ne i\) at a position \(m \le d-1\) and then transitioned to state i for the first time after the remaining \(d-m\) positions. Using probabilistic argument and summing over all the different positions and states, we can deduce the following equation \( Prob[A_1]={\displaystyle \sum _{r\ne i}^{N} \sum _{m=0}^{d-1}}\,_{r} e_{i,r}(m)f_{r,i}(d-m). \) Similarly we can deduce the probability of the event \(A_2\) to happen, \( Prob[A_2]={\displaystyle \sum _{r\ne i}^{N} \sum _{m=0}^{d-1}} \,_{r} e_{i,r}(m)f_{r,i}(d+1-m). \) For the sum of the probabilities of the three events we can derive the following expression

$$\begin{aligned}&_{\varepsilon }p_{i}(d) = Prob[A_0] + Prob[A_1] + Prob[A_2] \\&=p_{i}(d-1)+ \sum _{r\ne i}^{N} \sum _{m=0}^{d-1}\,_{r} e_{i,r}(m)f_{r,i}(d-m) + \sum _{r\ne i}^{N} \sum _{m=0}^{d-1}\,_{r}e_{i,r}(m)f_{r,i}(d+1-m)=\\&=p_{i}(d-1)+ \sum _{r\ne i}^{N} \sum _{m=0}^{d-1}\,_{r} e_{i,r}(m)[f_{r,i}(d-m)+f_{r,i}(d+1-m)]. \end{aligned}$$

Equation (9.2) can be written in matrix form as

$$\begin{aligned} _{\varepsilon }\boldsymbol{P}(d) =&\boldsymbol{P}(d-1) + \\&\Big [ \sum _{m=1}^{d-1}\, \boldsymbol{I}\circ \Big [\boldsymbol{E}(m)\big [ [\boldsymbol{F}(d-m)+\boldsymbol{F}(d+1-m)] \circ \boldsymbol{(U-I)} \big ] \Big ] \cdot \boldsymbol{1} \Big ]. \end{aligned}$$

Last and by applying Lemmas 1 and 2, we can derive the corresponding equations for the probabilities \(_{\varepsilon }p_{i}(n,d)\), which are described in matrix notation, as follows

$$\begin{aligned}&_{\varepsilon }\boldsymbol{P}(n,d) = _{\varepsilon }\!\boldsymbol{P}(n-1,d)) \circ \\&\Big [\boldsymbol{P}(d-1)+\sum _{m=1}^{d-1}\, \boldsymbol{I}\circ \Big [\boldsymbol{E}(m)\big [[\boldsymbol{F}(d-m)+\boldsymbol{F}(d+1-m)] \circ \boldsymbol{(U-I)}\big ] \Big ] \boldsymbol{1} \Big ]. \end{aligned}$$

      \(\square \)

4 Illustrations of Real and Synthetic Data

For the illustrations of the homogeneous semi-Markov model, synthetic DNA sequences as well as real genomic and mRNA sequences were used. The coding sequence used was human dystrophin mRNA and the non-coding sequence, which was used for comparison, was the human b-nerve growth factor gene (BNGF). These sequences have already been examined using the spectral density analysis by Tsonis [27]. We assumed that each of the sequences could be described by a homogeneous semi-Markov chain \(\lbrace X_t \rbrace _{t=0}^{\infty }\), with state space \(S = \lbrace A,C,G,T\rbrace \) and the index t denotes the position of each nucleotide inside the sequence. The basic parameters \(\boldsymbol{P}_{i,j}(s)\) and \(\boldsymbol{H}_{i,j}(m)\) of the SMC were estimated using the empirical estimators \(\widehat{p}_{i,j}(k) = {\displaystyle \frac{N(i(k)\rightarrow j)}{{\displaystyle \sum _{x\in S}}N(i(k)\rightarrow x)}}\) and \(\widehat{h}_{i,j}(m) = {\displaystyle \frac{N(i\rightarrow j,m)}{{\displaystyle \sum _{x\in S}} N(i\rightarrow x,m)}}\) where \(N(i(k)\rightarrow j)\) denotes the number of transitions from state i to state j, starting from coding position k and \(N(i\rightarrow j,m)\) denotes the number of transitions from state i to state j, while the SMC remained in state i for m positions. In order to estimate the initial condition, which are the probabilities of the matrix \(\boldsymbol{P}(1,d)\), the first 10 cycles of length 3 have been used and the basic parameters \(\boldsymbol{P}\) and \(\boldsymbol{H}(m)\) have been estimated. After that and for each cycle n, the core matrix \(\boldsymbol{C}(m)\) has been estimated, using the letters of the sequence up until the position \( 30 + n \cdot d \). This specific process has been implemented, correcting the estimations, as in the current application the length of each period is small (\(d=3\)), resulting in an non adequate sample size for each cycle. However, if we were interested in examining the periodic behaviour for larger periods, this correction procedure would not be necessary. Finally, the probability for the chain to be in the same state for every \(n\cdot d\) positions has been calculated using the recursive equation \(\boldsymbol{P}(n,d)\). Let us define the ratio by \( \boldsymbol{R}(n)= \big [ [\boldsymbol{P}(n-1,d)\boldsymbol{1}] \circ \boldsymbol{I} \big ]^{-1} \cdot \boldsymbol{P}(n,d), \) where \(\boldsymbol{1} = [ 1,1, ..., 1]\). The quantity \(\boldsymbol{R}(n)\) is a \((N \times 1)\) vector and the ith element of matrix \(\boldsymbol{R}(n)\) is the ratio of the probability \(p_{i}(n,d)\) over \(p_{i}(n-1,d)\) for every n and illustrates the variations between the probabilities \(p_{i}(n,d)\) and \(p_{i}(n-1,d)\), in order to investigate the periodicity over a number of cycles. It is obvious that the probabilities \(p_{i}(k,n,d)\) will converge to zero, as they are a product of n probabilities. The most important things in the periodic investigation, are the initial probability \(\boldsymbol{P}(1,d)\), which contains the probabilities for the chain to be in the same state after d positions and also the ratio \(\boldsymbol{R}(n)\), which measures the relationship between the probabilities of the current cycle and the previous one using the correction procedure. For higher values of \(\boldsymbol{R}(n)\), the probabilities \(p_{i}(n,d)\) decrease with a slower rate, while for lower values of \(\boldsymbol{R}(n)\),the probabilities \(p_{i}(n,d)\) converge to zero faster.

4.1 DNA Sequences of Synthetic Data

Example 1

(Comparison between random and periodic DNA sequences) Let L be a DNA sequence of length \(N=1000\) of the form: \(L=\lbrace U,U,U,...,U\rbrace \), where the letter U corresponds to any nucleotide, from a uniform distribution. Thus,

$$\begin{aligned} Prob[U = A]=Prob[U=C]=Prob[U=G]=Prob[U=T]= 1/4. \end{aligned}$$

This kind of sequence would not exhibit any periodic behaviour, however the estimated probability matrix \(\boldsymbol{P}(n,d)\), for \(d=3\), will be estimated for comparison. The estimation of the embedded Markov matrix is \( \boldsymbol{P}= {\scriptstyle \begin{pmatrix} 0 &{} 0.2 &{} 0.8 &{} 0 \\ 0.375 &{} 0 &{} 0.5 &{} 0.125 \\ 0.125 &{} 0.5 &{} 0 &{} 0.375 \\ 0.25 &{} 0.75 &{} 0 &{} 0 \end{pmatrix} }\) and the core matrix \(\boldsymbol{C}(m)\) is \( \boldsymbol{C}(1)= {\scriptstyle \begin{pmatrix} 0 &{} 0 &{} 0.8 &{} 0 \\ 0.375 &{} 0 &{} 0.5 &{} 0.125 \\ 0.125 &{} 0.375 &{} 0 &{} 0.375 \\ 0.25 &{} 0.5 &{} 0 &{} 0 \end{pmatrix} } ,\) \( \boldsymbol{C}(2)= {\scriptstyle \begin{pmatrix} 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \\ 0 &{} 0.125 &{} 0 &{} 0 \\ 0 &{} 0 &{} 0 &{} 0 \end{pmatrix} }\) while the only non zero element of \(\boldsymbol{C}(3)\) is \(c_{4,2}(3) = 0.25\). The initial condition is \( \boldsymbol{P}(1,3)= {\scriptstyle \begin{pmatrix} 0.32\\ 0.34\\ 0.42\\ 0.27 \end{pmatrix}.} \)

Figure 9.1 visualizes the ratio \(\boldsymbol{R}(n)\) for the whole sequence. We observe that, as expected, there exist no clear tendency for any state to achieve a stronger periodic behaviour, compared to the other states.

Fig. 9.1
figure 1

R(n) for the synthetic DNA sequence of a uniform distribution

Now, let L be a DNA sequence of length \(N=1000\) of the form: \( L=\lbrace A,U,U,A,U,U,...\rbrace , \) where the letter A corresponds to adenine and the letter U corresponds to any nucleotide from a uniform distribution, therefore

$$\begin{aligned} Prob[U = A]=Prob[U=C]=Prob[U=G]=Prob[U=T]= 1/4. \end{aligned}$$

We will investigate the periodic behaviour, of period \(d=3\). One can notice that the letter A can possibly have a non-zero waiting time probability \(w_{A}(m)\) for every m. On the other hand, for the other three letters CGT, the waiting time probabilities are zero if m exceeds two, as between every three letters, the letter A always appears at least once. The estimated embedded Markov transition matrix is \( \boldsymbol{P} = {\scriptstyle \begin{pmatrix} 0 &{} 0.30 &{} 0.30 &{} 0.40 \\ 0.73 &{} 0 &{} 0.15 &{} 0.12 \\ 0.69 &{} 0.17 &{} 0 &{} 0.14 \\ 0.70 &{} 0.14 &{} 0.16 &{} 0 \end{pmatrix} } \) and the core matrix is \( \boldsymbol{C}(1)= {\scriptstyle \begin{pmatrix} 0 &{} 0.19 &{} 0.16 &{} 0.27 \\ 0.60 &{} 0 &{} 0.15 &{} 0.13 \\ 0.56 &{} 0.17 &{} 0 &{} 0.15 \\ 0.50 &{} 0.14 &{} 0.16 &{} 0 \end{pmatrix} } \), \( \boldsymbol{C}(2)= {\scriptstyle \begin{pmatrix} 0 &{} 0.08 &{} 0.11 &{} 0.09 \\ 0.13 &{} 0 &{} 0 &{} 0 \\ 0.13 &{} 0 &{} 0 &{} 0 \\ 0.20 &{} 0 &{} 0 &{} 0 \end{pmatrix} } \) while the other matrices \(\boldsymbol{C}(m)\) for \(m>2\), have non-zero elements only in the first row, that is for the letter A. The initial condition is \( \boldsymbol{P}(1,3)= {\scriptstyle \begin{pmatrix} 0.83\\ 0.18\\ 0.20\\ 0.25 \end{pmatrix}}. \) The probability for the chain to be in state A, every \(d=3\) positions, while starting from state A, is greater than the other three states, as we expected. This is also confirmed by the ratio, as presented in Fig. 9.2, that shows that state A exhibits higher values compared to the other states.

Fig. 9.2
figure 2

R(n) for the synthetic DNA sequence with 3-base periodicity of adenine

Example 2

(Detection of periodic regions inside a sequence) Let L be a DNA sequence of length \(N=5000\) of the form: \(L=\lbrace U,U,U,...,U\rbrace \), where the letter U corresponds to any random nucleotide from a uniform distribution. In the position intervals 1500–2000 and 3000–3500, which correspond to the 3-base cycles 500–666 and 1000–1166 respectively, the letter U has been substituted with the letter A, starting from the first position and at every 3 positions thereafter. Figure 9.3 shows the values of the ratio R(n) for the letter A, where the green regions are the 3-base cycles of the sequence R(n) where the sequence is increasing, while the red regions are the 3-base cycles where the sequence R(n) decreases. It is observed, that the regions, in which we have synthetically added periodic behaviour for the letter A, have an increasing ratio R(n) for A, indicating that in these regions the periodic behaviour of A is stronger.

Fig. 9.3
figure 3

R(n) of the letter A of the synthetic sequence with periodicity in the cycles 500-666 and 1000-1166

4.2 DNA Sequences of Real Data

The information about the periodic behaviour of the coding regions of the genome could possibly be used, in order to distinguish these regions, over a DNA sequence with great length. For the coding sequences of real DNA, the human dystrophin mRNA has been used, while for the non coding region, the human b-nerve growth factor has been used. These sequences have a length greater than 5000 bases and they have already been studied for periodic behaviour [27]. One can notice through Fig. 9.4, that for the human dystrophin mRNA sequence, the nucleotide A has a higher chance to appear every 3 positions, while all the other nucleotides have almost the same behaviour. The ratio for the nucleotide A is higher compared to the other three states for the human dystrophin mRNA sequence, indicating the stronger periodic behaviour for adenine. However, Fig. 9.5 indicates that for the human b-nerve growth factor gene, which contains in more than \(90\%\) intronic sequences, the results are similar with the random sequence, that was created in the first example.

Fig. 9.4
figure 4

R(n) for the human dystrophin mRNA sequence

Fig. 9.5
figure 5

R(n) for the human b-nerve growth factor sequence

5 Conclusion

In the present paper, a method is developed, in order to investigate some attributes related to the periodicity of DNA sequences. The applied model is a semi-Markov chain of discrete and finite state space and discrete time, where the elements of the state space are the four nucleotides, i.e. \(S = \lbrace A,C,G,T \rbrace \) and time denotes the index position in the sequence. The purpose of the model was to describe the periodic behaviour of a given DNA sequence, something that could possibly discriminate between coding and non-coding regions. It is known that the coding regions of the genome have different structure from the non-coding regions, as they exhibit a characteristic tendency of repetition of some nucleotides every 3 bases. Considering the previous fact and by modelling a DNA sequence with a semi-Markov chain, a recursive equation that could be used as an identification tool for regions that have strong or weak d-periodic behaviour is constructed. The corresponding probabilities are calculated in relation to the basic parameters of the model in closed analytic form. The theoretical results are also generalized for the non-homogeneous case, considering the triplet nature of the DNA and assuming each coding position corresponds to a different transition matrix \(\boldsymbol{P}(k)\). In addition, the case of quasiperiodicity of a state is examined. The above theory is developed considering the fact that small perturbations in the cycle of the period may appear, such as a shift of the position of a letter due to genetic mutations and lead the chain to lose its periodic behaviour for a number of cycles. Therefore, the state will appear not exactly after a period of d positions, but in a radius of \(d \pm \epsilon \) positions. The numerical results of the implementation of the model on actual data confirmed the previous studies, as it was apparent that periodic behaviour is a characteristic of the coding segments, unlike non-coding segments that did not show similar behaviour. For the estimation of the parameters, a correction procedure was applied, due to the short duration of the period (\(d =3\)) for the specific application. The approach could potentially be used as an initial method for investigating periodicity for any DNA sequence and also it could be used to separate two different DNA segments, in terms of their periodic behaviour. Although the examples produced satisfactory results, they should be perceived with caution, due to the complexity of the structure of DNA and its various peculiarities. For example, additional parameters could be included in the model, such as the sequence length, the frequencies of each nucleotide, the open reading frames (ORFS), the target organism, specific mutations and others.