1 Introduction

Stochastic models with background continuous time Markov chain (CTMC) are widely used in stochastic modeling. Phase type (PH) distributions and Markov arrival processes (MAP) exemplify the flexibility and the ease of application of such models. In this work we cope with terminating stochastic processes [1]. Indeed, Phase type distributions are defined by a terminating (also referred to as transient) background Markov chain, but it generates exactly one event. A transient Markovian arrival process (TMAPs) is a point processes with a finite number of possibly correlated inter event times which is governed by a terminating background Markov chain [8]. Basic properties of TMAPs, such as the distribution of the number of generated arrivals or the time until the last arrival, are presented in [8], further properties and moments based characterization are discussed in [6]. TMAPs can be used in a wide range of application fields from traffic modeling of computer systems to risk analysis, including also population dynamics in biological systems. For instance TMAPs are applied to women’s lifetime modeling in several countries in [5].

In this work we consider the parameter estimation of TMAPs to experimental data sets based on the EM method. The EM method has been used successfully for parameter estimation of several models with background Markov chains, e.g., for PH distributions [3], for PH distributions with structural restriction [10], for MAPs [4], for MAPs with structural restrictions [7, 9]. The experiences from these previous research results indicate that the inherent redundancy of the stochastic models with background Markov chains makes the parameter estimation of the general models inefficient. In this work we avoid the implementation of the EM based estimation of general TMAPs and immediately apply a similar structural restriction as the one which turned out to be efficient in case of PH distributions [10] and MAPs [7, 9]. The formulas of the EM method for TMAP fitting show similarities with the ones for MAP fitting in [7], but there are intricate details associated with the handling of background process termination which require a non-trivial reconsideration of the expectation and the maximization steps of the method.

Apart of the algorithmic description of the EM method for TMAP fitting we pay attention to efficient implementation for both traditional computing devices (CPU) and graphics processing unit (GPU). Both platforms required various implementation optimizations for efficient computing of the steps of the fitting method. Together with the fitting results and the related computation times we present the applied implementation optimization methods and the related considerations.

The rest of the paper is organized as follows. The next section summarizes the basic properties of TMAPs. Section 3 presents the theoretical foundation of the EM method for TMAP fitting and the high level procedural description of the method. Section 4 discusses several implementation versions for CPU as well as for GPU-based computation. Numerical results are provided in Sect. 5 and the paper is concluded in Sect. 6.

2 Transient Markovian Arrival Processes

Transient Markovian Arrival Processes (TMAPs) are continuous time terminating point processes where the inter-arrival times depend on a background Markov chain, hence they can be dependent.

TMAPs can be characterized by an initial probability vector, \(\alpha \), holding the initial state distribution of the background Markov chain at time 0 (\(\alpha \mathbbm {1}=1\), where \(\mathbbm {1}\) is the column vector of ones), and two matrices, \({{\varvec{D}}}_\mathbf{0}\) and \({{\varvec{D}}}_\mathbf{1}\). Matrix \({{\varvec{D}}}_\mathbf{0}\) contains the rates of the internal transitions that are not accompanied by an arrival, and matrix \({{\varvec{D}}}_\mathbf{1}\) consists of the rates of those transitions that generate an arrival. However, contrary to non-terminating MAPs, the generator matrix of the background Markov chain of TMAPs, \({{\varvec{D}}}={{\varvec{D}}}_\mathbf{0}+{{\varvec{D}}}_\mathbf{1}\), is transient, that is \({{\varvec{D}}}\mathbbm {1}\ne 0\) and the non-negative vector \(d=-{{\varvec{D}}}\mathbbm {1}\) describes the termination rates of the background Markov chain. Based on practical considerations we assume that the termination is an observed event (an arrival), which means that a TMAP generates at least one arrival. (If only the “arrival events” are known, which is commonly the case in practice, the TMAPs which do not generate any arrival are not observed. Without knowing how many TMAPs terminated without generating any arrival event there is no information to estimate the parameters of those invisible cases.) It also means that the TMAPs considered here are special cases of the ones defined in [8], since we assume that \(d_0=0\) and our vector d equals to vector \(d_1\) in [8]. The fact that the background Markov chain is transient ensures that the number of events generated by the process is finite. The Markov chain representing the number of arrivals and the state of the background process is depicted in Fig. 1.

Fig. 1.
figure 1

The structure of the Markov chain representing the number of arrivals and the state of the background process.

Matrix \({{\varvec{P}}}=(-{{\varvec{D}}}_\mathbf{0})^{-1}{{\varvec{D}}}_\mathbf{1}\) describes the state transition probabilities embedded at arrival instants. \({{\varvec{P}}}\) holds the state transition probabilities of a transient discrete time Markov chain (DTMC) with termination vector \(p=\mathbbm {1}-{{\varvec{P}}}\mathbbm {1}\). Note that \({{\varvec{P}}}\) is sub-stochastic matrix (it has non-negative elements and \({{\varvec{P}}}\mathbbm {1}\le \mathbbm {1}\)), and \(({{\varvec{I}}}-{{\varvec{P}}})^{-1}p=\mathbbm {1}\) holds.

In case of TMAPs not only the statistical quantities related to the inter-arrival times are of interest, but also the ones related to the number of generated arrivals.

The number of arrivals \(\mathcal {K}\) is characterized by a discrete phase-type (DPH) distribution with initial vector \(\alpha \) and transition probability matrix \({{\varvec{P}}}\). Hence, the mean number of arrivals is given by

$$\begin{aligned} E\!\left( \mathcal {K}\right) = \sum _{k=1}^\infty \alpha k {{\varvec{P}}}^{k-1} p = \alpha ({{\varvec{I}}}-{{\varvec{P}}})^{-2} p = \alpha (I-{{\varvec{P}}})^{-1} \mathbbm {1}. \end{aligned}$$
(1)

If the inter-arrival times are denoted by \(\mathcal {X}_1,\mathcal {X}_2,\dots \), then the joint density function of the inter-arrival times is

$$\begin{aligned} f(x_1,x_2,\dots ,x_k)&=\lim _{\varDelta \rightarrow 0}\frac{1}{\varDelta }P(\mathcal {X}_1\in (x_1,x_1+\varDelta ),\dots ,\mathcal {X}_k\in (x_k,x_k+\varDelta )) \nonumber \\&=\alpha e^{{{\varvec{D}}}_\mathbf{0}x_1}{{\varvec{D}}}_1e^{{{\varvec{D}}}_\mathbf{0}x_2}{{\varvec{D}}}_1\cdots e^{{{\varvec{D}}}_\mathbf{0}x_k}({{\varvec{D}}}_1\mathbbm {1}+d). \end{aligned}$$
(2)

If it exists, the nth moment of \(\mathcal {X}_{k+1}\) is

$$\begin{aligned} E\!\left( \mathcal {X}_{k+1}^n|\mathcal {X}_{k+1}<\infty \right) =\frac{E\!\left( \mathcal {X}_{k+1}^n \mathcal{I}_{\{\mathcal {X}_{k+1}<\infty \}}\right) }{P(\mathcal {X}_{k+1}<\infty )}= \frac{n! \alpha {{\varvec{P}}}^{k} (-{{\varvec{D}}}_\mathbf{0})^{-n} \mathbbm {1}}{\alpha {{\varvec{P}}}^{k} \mathbbm {1}}. \end{aligned}$$
(3)

The mean of the inter-arrival times \(E\!\left( \mathcal {X}\right) \) is not as easy to express as for ordinary MAPs, it is obtained from \(E\!\left( \mathcal {X}\right) = E\!\left( \sum _{k=1}^{\mathcal {K}} \mathcal {X}_k\right) /E\!\left( \mathcal {K}\right) \), where the numerator is derived as

$$\begin{aligned} E\!\left( \sum _{k=1}^{\mathcal {K}} \mathcal {X}_k\right)&= \sum _{\kappa =1}^\infty E\!\left( \mathcal {I}_{\{\mathcal {K} = \kappa \}} ~\sum _{k=1}^{\mathcal {K}} \mathcal {X}_k \right) = \sum _{\kappa =1}^\infty \sum _{i=0}^{\kappa -1} \alpha {{\varvec{P}}}^{i} {{\varvec{U}}} {{\varvec{P}}}^{\kappa -1-i} p \nonumber \\&=\sum _{i=0}^{\infty } \sum _{\kappa =0}^\infty \alpha {{\varvec{P}}}^{i} {{\varvec{U}}} {{\varvec{P}}}^{\kappa } p=\alpha ({{\varvec{I}}}-{{\varvec{P}}})^{-1} {{\varvec{U}}} ({{\varvec{I}}}-{{\varvec{P}}})^{-1} p, \end{aligned}$$
(4)

where \({{\varvec{U}}}=(-{{\varvec{D}}}_\mathbf{0})^{-1}\), and the denominator is given by (1). As a result the mean inter-arrival time is

$$\begin{aligned} E\!\left( \mathcal {X}\right) = \frac{E\!\left( \sum _{k=1}^{\mathcal {K}} \mathcal {X}_k\right) }{E\!\left( \mathcal {K}\right) } = \frac{\alpha ({{\varvec{I}}}-{{\varvec{P}}})^{-1} {{\varvec{U}}} ({{\varvec{I}}}-{{\varvec{P}}})^{-1} p}{\alpha ({{\varvec{I}}}-{{\varvec{P}}})^{-2} p}= \frac{\alpha ({{\varvec{I}}}-{{\varvec{P}}})^{-1} {{\varvec{U}}} \mathbbm {1}}{\alpha ({{\varvec{I}}}-{{\varvec{P}}})^{-1} \mathbbm {1}}. \end{aligned}$$
(5)

To discuss the correlation of the inter-arrival times we introduce the notation \(\hat{\mathcal {X}}_k = \mathcal {X}_k~|~\mathcal {X}_k\!<\!\infty \). Note that \(\hat{\mathcal {X}}_1 = \mathcal {X}_1\) due to the modeling assumption of at least one arrival. By this notation from (3) we have

$$\begin{aligned} E\!\left( \hat{\mathcal {X}}_{k+1}^n\right) =\displaystyle \frac{n! \alpha {{\varvec{P}}}^{k} {{\varvec{U}}}^{n} \mathbbm {1}}{\alpha {{\varvec{P}}}^{k} \mathbbm {1}}. \end{aligned}$$

The expectation of the product of two subsequent inter-arrival times is

$$\begin{aligned} E\!\left( \mathcal {X}_1 \hat{\mathcal {X}}_{k+1}\right)&= \frac{E\!\left( \mathcal {X}_1 \mathcal {X}_{k+1} \mathcal {I}_{\{\mathcal {X}_{k+1}<\infty \}}\right) }{P(\mathcal {X}_{k+1}<\infty )} \\&= \frac{\alpha (-{{\varvec{D}}}_\mathbf{0})^{-2} {{\varvec{D}}}_\mathbf{1} {{\varvec{P}}}^{k-1} (-{{\varvec{D}}}_\mathbf{0})^{-2} ({{\varvec{D}}}_\mathbf{1} \mathbbm {1}+ d)}{\alpha (-{{\varvec{D}}}_\mathbf{0})^{-1} {{\varvec{D}}}_\mathbf{1}{{\varvec{P}}}^{k-1} (-{{\varvec{D}}}_\mathbf{0})^{-1} ({{\varvec{D}}}_\mathbf{1} \mathbbm {1}+ d)} = \frac{\alpha {{\varvec{U}}} {{\varvec{P}}}^{k} {{\varvec{U}}} \mathbbm {1}}{\alpha {{\varvec{P}}}^{k} \mathbbm {1}},\nonumber \end{aligned}$$
(6)

where we used that \((-{{\varvec{D}}}_\mathbf{0})^{-1}{{\varvec{D}}}_\mathbf{1} \mathbbm {1}+ d= \mathbbm {1}\), due to \({{\varvec{D}}}_\mathbf{0}\mathbbm {1}+{{\varvec{D}}}_\mathbf{1} \mathbbm {1}+ d= 0\). Based on the joint expectation the correlation is

$$\begin{aligned} Corr(\mathcal {X}_1,\hat{\mathcal {X}}_k+1)= \frac{E\!\left( \mathcal {X}_1 \hat{\mathcal {X}}_{k+1}\right) -E\!\left( \mathcal {X}_1\right) E\!\left( \hat{\mathcal {X}}_{k+1}\right) }{\sqrt{E\!\left( \mathcal {X}_1^2\right) -E^2\!\left( \mathcal {X}_1\right) } \sqrt{E\!\left( \hat{\mathcal {X}}_{k+1}^2\right) -E^2\!\left( \hat{\mathcal {X}}_{k+1}\right) } }. \end{aligned}$$
(7)

3 An EM Algorithm for TMAPs

In this section an EM algorithm is presented to create a TMAP from measurement data. The measurement data is given by samples \(X=(x^{(\ell )}_{k},~k=1,\dots ,K_\ell ,~\ell =1,\dots ,L)\). We refer the set of dependent samples for a given \(\ell \) as the \(\ell \)th run, where the \(\ell \)th run is composed by \(K_\ell \) samples. The aim of the EM algorithm is to find \(\varTheta =(\alpha , {{\varvec{D}}}_\mathbf{0}, {{\varvec{D}}}_\mathbf{1})\) by which the likelihood of the observations,

$$\begin{aligned} \mathcal {L}(\varTheta |X) = \prod _{\ell =1}^{L} \alpha e^{{{\varvec{D}}}_\mathbf{0} x^{(\ell )}_{1}} {{\varvec{D}}}_\mathbf{1} \cdots e^{{{\varvec{D}}}_\mathbf{0} x^{(\ell )}_{K_\ell }} d, \end{aligned}$$
(8)

is maximized. Introducing the run-dependent forward likelihood (row) vectors recursively as

$$\begin{aligned} a^{(\ell )}[k]&= {\left\{ \begin{array}{ll} \alpha , &{} k=0, \\ a^{(\ell )}[k-1] e^{{{\varvec{D}}}_\mathbf{0}x^{(\ell )}_{k}}{{\varvec{D}}}_\mathbf{1}, &{} k>0, \end{array}\right. } \end{aligned}$$
(9)

for \(\ell =1,\dots ,L\) and \(k=0,\dots ,K_\ell -1\), and backward likelihood (column) vectors as

$$\begin{aligned} b^{(\ell )}[k]&={\left\{ \begin{array}{ll} e^{{{\varvec{D}}}_\mathbf{0}x^{(\ell )}_{k}}{{\varvec{D}}}_\mathbf{1} b^{(\ell )}[k+1], &{} k<K_\ell , \\ d, &{} k=K_\ell , \end{array}\right. } \end{aligned}$$
(10)

for \(\ell =1,\dots ,L\) and \(k=K_\ell ,\dots ,1\), the likelihood can be obtained as

$$\begin{aligned} \mathcal {L}(\varTheta |X) = \prod _{\ell =1}^{L} a^{(\ell )}[k_\ell ]\cdot b^{(\ell )}[k_\ell +1], \end{aligned}$$
(11)

for every \(k_\ell =0,\dots ,K_\ell -1\).

The forward and backward likelihood vectors play an important role in the presented EM algorithm. However, computing the matrix exponential terms is numerically demanding. To reduce the computational complexity we apply the same structural restriction as in [7, 9, 10], thus we introduce a special TMAP structure composed of a number of Erlang distributed branches. When a given branch is selected the inter-arrival time is Erlang distributed defined by the parameters (rate and order) of the selected Erlang branch, and after each arrival event a sub-stochastic transition probability matrix determines which Erlang branch to choose for the next inter-arrival given the branch generating the current arrival (see Fig. 2). Due to applied structural restriction the computations of matrix exponential terms, e.g. in (8), are replaced by the computations of scalar exponential terms in the form of (12).

Fig. 2.
figure 2

The special TMAP structure used for fitting.

In the proposed special structure the inter-arrival times are generated by one of the R Erlang branches. The order and the intensity parameters of the branches are denoted by \(r_i,\lambda _i, \text{ for } i\in \{1,\dots ,R\}\), respectively. The density of the inter-arrival times generated by branch i is

$$\begin{aligned} f_{i}(x)=\frac{(\lambda _i x)^{r_i-1}}{(r_i-1)!}\lambda _i e^{-\lambda _i x}. \end{aligned}$$
(12)

After branch i generates an arrival event, the next one will be generated by branch j with probability \(\pi _{i,j}\). The matrix of size \(R \times R\) holding these branch switching probabilities is denoted by \({\varvec{\varPi }}=[\pi _{i,j}]\). Since TMAPs generate a finite number of events we have \({\varvec{\varPi }}\mathbbm {1}<\mathbbm {1}\). Observe that the TMAP with the applied structural restriction is uniquely characterized by parameters \(\varTheta =\{\alpha _i,r_i,\lambda _i,\pi _{i,j}, \text{ for } i,j\in \{1,\dots ,R\}\}\).

By this special TMAP structure the forward and backward likelihood vectors can be obtained without computing matrix exponentials, since

$$\begin{aligned} a_{i}^{(\ell )}[k]&= \sum _{j=1}^M a_{j}^{(\ell )}[k-1] f_{j}(x_{k}^{(\ell )}) \pi _{j,i}, \end{aligned}$$
(13)
$$\begin{aligned} b_{j}^{(\ell )}[k]&= \sum _{i=1}^M f_{j}(x_{k}^{(\ell )}) \pi _{j,i} b_{i}^{(\ell )}[k+1]. \end{aligned}$$
(14)

The EM algorithm assumes that the data X available for fitting is incomplete, and there is a hidden data Y. In our case, the hidden data \(y_k^{(\ell )}\in Y\) is and integer number representing the Erlang branch that generated the kth inter-arrival time of run \(\ell \), thus \(x_k^{(\ell )}\). If the hidden data was known, the logarithm of the likelihood would be easy to express, as

$$\begin{aligned} \log \mathcal {L}(\varTheta |X,Y) = \sum _{\ell =1}^{L}\sum _{k=1}^{K_\ell } \log \big (f_{y_k^{(\ell )}}(x_k^{(\ell )})\big ). \end{aligned}$$
(15)

Maximizing (15) with respect to \(\lambda _i\) gives

$$\begin{aligned} \hat{\lambda }_i&= \frac{\sum _{\ell =1}^{L}\sum _{k=1}^{K_\ell } r_i \cdot I_{\{y_k^{(\ell )}=i\}}}{\sum _{\ell =1}^{L}\sum _{k=1}^{K_\ell } x_k^{(\ell )} I_{\{y_k^{(\ell )}=i\}}}, \end{aligned}$$
(16)

where \(\hat{\lambda }_i\), and the similar subsequent notations, denotes the optimum assuming Y is known. From [2] we have

$$\begin{aligned} \hat{\pi }_{i,j}&= \frac{\sum _{\ell =1}^{L}\sum _{k=1}^{K_\ell -1} I_{\{y_k^{(\ell )}=i,y_{k+1}^{(\ell )}=j\}}}{\sum _{\ell =1}^{L}\sum _{k=1}^{K_\ell } I_{\{y_k^{(\ell )}=i\}}}. \end{aligned}$$
(17)

Note that the summation over k in the denominator runs up to \(K_\ell \), while the one in the numerator runs up to \(K_\ell -1\), thus matrix \(\hat{\varvec{\varPi }}\) is sub-stochastic, reflecting the terminating behavior of TMAPs. The maximum likelihood estimation for the initial vector is

$$\begin{aligned} \hat{\alpha }_{i}=\frac{1}{L}\sum _{\ell =1}^{L} I_{\{y_1^{(\ell )}=i\}}. \end{aligned}$$
(18)

The hidden data is, however, unknown. The marginal distribution of the hidden data \(y_k^{(\ell )}\) can be derived from the forward and backward likelihood vectors leading to

$$\begin{aligned} q_{i}^{(\ell )}[k]&= P(y_k^{(\ell )}=i|X,\varTheta ) = \frac{P(y_k^{(\ell )}=i,X|\varTheta )}{P(X|\varTheta )} \nonumber \\&= \frac{\left( a_{i}^{(\ell )}[k-1]\cdot b_{i}^{(\ell )}[k]\right) \prod _{m\ne \ell } a^{(m)}[0]\cdot b^{(m)}[1]}{\prod _{m=1}^L a^{(m)}[0]\cdot b^{(m)}[1]} \\&=\frac{a_{i}^{(\ell )}[k-1]\cdot b_{i}^{(\ell )}[k]}{\alpha \cdot b^{(\ell )}[1]}, ~~~k=1,\dots ,K_\ell ,\nonumber \end{aligned}$$
(19)

where we used \(a^{(\ell )}[0]=\alpha \).

To characterize the joint distribution of the branches generating two consecutive inter-arrival times we also need the probabilities

$$\begin{aligned} q^{(\ell )}_{i,j}[k]&=P(y^{(\ell )}_k=i,y^{(\ell )}_{k+1}=j|X,\varTheta ) = \frac{P(y^{(\ell )}_k=i,y^{(\ell )}_{k+1}=j,X|\varTheta )}{P(X|\varTheta )} \nonumber \\&=\frac{a^{(\ell )}_{i}[k-1]\cdot f_{i}(x^{(\ell )}_k)\cdot \pi _{i,j}\cdot b^{(\ell )}_{j}[k+1]}{\alpha \cdot b^{(\ell )}[1]}. \end{aligned}$$
(20)

The calculation of \(q_{i}^{(\ell )}[k]\) and \(q^{(\ell )}_{i,j}[k]\) form the E-step of the algorithm.

In the M-step new estimates for \(\varTheta \) are obtained based on the distributions of the hidden data. For \(\lambda _i\) from (16) and (19) we get

$$\begin{aligned}&\lambda _i = \frac{\sum _{\ell =1}^L\sum _{k=1}^{K_\ell } r_i\cdot q_{i}^{(\ell )}[k]}{\sum _{\ell =1}^{L}\sum _{k=1}^{K_\ell } x_k^{(\ell )} q_{i}^{(\ell )}[k]} = \frac{\sum _{\ell =1}^{L}\frac{\sum _{k=1}^{K_\ell } r_i a_{i}^{(\ell )}[k-1] b_{i}^{(\ell )}[k]}{\alpha \cdot b^{(\ell )}[1]}}{\sum _{\ell =1}^{L}\frac{\sum _{k=1}^{K_\ell } x_{k}^{(\ell )} a_{i}^{(\ell )}[k-1] b_{i}^{(\ell )}[k]}{\alpha \cdot b^{(\ell )}[1]}}. \end{aligned}$$
(21)

Similarly, the new estimates for the branch switching probabilities are obtained from (17) and (20) as

$$\begin{aligned} \pi _{i,j}&=\frac{\sum _{\ell =1}^{L}\sum _{k=1}^{K_\ell -1} q_{i,j}^{(\ell )}[k]}{\sum _{\ell =1}^{L}\sum _{k=1}^{K_\ell } q_i^{(\ell )}[k]}= \frac{\sum _{\ell =1}^{L} \frac{\sum _{k=1}^{K_\ell -1} a_{i}^{(\ell )}[k\!-\!1] f_{i}(x_{k}^{(\ell )}) \pi _{i,j} b_{j}^{(\ell )}[k\!+\!1]}{\alpha \cdot b^{(\ell )}[1]}}{\sum _{\ell =1}^{L} \frac{\sum _{k=1}^{K_\ell } a_{i}^{(\ell )}[k\!-\!1] b_{i}^{(\ell )}[k]}{\alpha \cdot b^{(\ell )}[1]}}. \end{aligned}$$
(22)

Finally, probabilities \(\alpha _i\) are derived from (18) and (19), yielding

$$\begin{aligned} \alpha _i =\frac{1}{L}\sum _{\ell =1}^{L}q_i^{(\ell )}[1] =\frac{1}{L}\sum _{\ell =1}^{L} \frac{ \alpha _{i} b_{i}^{(\ell )}[1]}{\alpha \cdot b^{(\ell )}[1]}. \end{aligned}$$
(23)
figure a

4 Details of the Numerical Algorithm

The EM algorithm presented in Sect. 3 is not straight forward to implement in an efficient way. While the special structure proposed for fitting does reduce the computational demand of the procedure significantly, the naive implementation (shown in Fig. 1) still contains many numerical pitfalls.

Our aim is to develop an implementation that enables the practical application of the algorithm, thus

  • the execution time must be reasonable with large data sets (containing millions of samples),

  • the implementation must be insensitive to the order of magnitude of the input data,

  • the implementation should exploit the parallel processing capabilities of modern hardware.

These items are addressed in the subsections below.

4.1 Initial Guess for \({\alpha }\), \({\lambda _{i}}\) and \({\varPi }\)

We use the following randomly generated initial parameters. \(\alpha \) is a random probability vector (composed of R uniform pseudo-random numbers in (0, 1) divided by the sum of the R numbers). The mean run length of the data set is computed as \(\bar{K}=\sum _{\ell =1}^L K_\ell /L\), and based on that each row of matrix \({\varvec{\varPi }}\) is a random probability vector multiplied by \(1-1/\bar{K}\) (that is, initially the exit probability is the same, \(1/\bar{K}\), in each Erlang branch). The initial values for \(\lambda _i\) are computed based on the mean inter-arrival time \(\bar{T}=\frac{\sum _{\ell =1}^L \sum _{k=1}^{K_\ell } x_k^{(\ell )} }{\sum _{\ell =1}^L K_\ell }, \) and it is \(\lambda _i=r_i/\bar{T}\). Let \(x_{max} = \max _{\ell ,k} x_k^{(\ell )}\) and \(\lambda _{max}= \max _{i} \lambda _i\). In order to avoid underflow during the computation of \(e^{-x_k^{(\ell )}\lambda _i}\) in (12) we re-scale this initial guess according to the representation limits of the single precision floating point numbers with \(8+16\) bits, where one of the 8 bits of the mantissa indicates the sign. That is \(2^{2^7} \sim e^{88}\) is the representation limit. Accordingly, if \(x_{max} \lambda _{max} >60\) then we re-scale the initial intensity values to \(\lambda _i = \frac{60 \lambda _i}{x_{max} \lambda _{max}}\), where 60 is a heuristic choice to be far enough from the representation limit (which is 88).

4.2 Improving Numerical Stability of the Forward and Backward Likelihood Vectors Computation

Computing vectors \(a^{(\ell )}[k]\) and \(b^{(\ell )}[k]\) by applying recursions (13) and (14) directly can lead to numerical overflow. To overcome this difficulty we express these vectors in the normal form

$$\begin{aligned} a_{i}^{(\ell )}[k] = \dot{a}_{i}^{(\ell )}[k]\cdot 2^{\ddot{a}^{(\ell )}[k]},\nonumber \\ b_{i}^{(\ell )}[k] = \dot{b}_{i}^{(\ell )}[k]\cdot 2^{\ddot{b}^{(\ell )}[k]}, \end{aligned}$$
(24)

where \(\ddot{a}^{(\ell )}[k]\) and \(\ddot{b}^{(\ell )}[k]\) are integer numbers and the values \(\dot{a}_{i}^{(\ell )}[k]\), \(\dot{b}_{i}^{(\ell )}[k]\) are such that \(0.5 \le \dot{a}^{(\ell )}[k]\mathbbm {1} <1\) and \(0.5\le \mathbbm {1}^T\dot{b}^{(\ell )}[k] < 1\). For a given vector \(a_{i}^{(\ell )}[k]\), \(\dot{a}_{i}^{(\ell )}[k]\) and \(\ddot{a}^{(\ell )}[k]\) can be obtained from

$$\begin{aligned} \dot{a}_{i}^{(\ell )}[k] = \frac{a_{i}^{(\ell )}[k]}{2^{\left\lceil \log _{2} \left( a^{(\ell )} [k]\mathbbm {1} \right) \right\rceil }}, ~~ \ddot{a}^{(\ell )}[k] = \left\lceil \log _{2}( a^{(\ell )}[k]\mathbbm {1}) \right\rceil \!. \end{aligned}$$
(25)

To avoid the calculation of \(a_{i}^{(\ell )}[k]\) (that can under- or overflow), it is possible to modify the recursion (13) to work with \(\dot{a}_{i}^{(\ell )}[k]\) and \(\ddot{a}^{(\ell )}[k]\) directly, leading to

$$\begin{aligned} \tilde{a}_{i}^{(\ell )}[k]&= \sum _{j=0}^{M} \dot{a}_{i}^{(\ell )}[k-1] f_{j}\left( x_{k}^{(\ell )}\right) \pi _{j,i}, \nonumber \\ \dot{a}_{i}^{(\ell )}[k]&= \frac{\tilde{a}_{i}^{(\ell )}[k]}{2^{\left\lceil \log _{2} ( \tilde{a}^{(\ell )} [k]\mathbbm {1} ) \right\rceil }}, ~~ \ddot{a}^{(\ell )}[k] = \ddot{a}^{(\ell )}[k-1] + \left\lceil \log _{2}( \tilde{a}^{(\ell )}[k]\mathbbm {1}) \right\rceil \!. \end{aligned}$$
(26)

Hence, in the first step \(\tilde{a}_{i}^{(\ell )}[k]\) is computed, from which in the second step the normalized quantity is derived and the exponent is incremented by the appropriate magnitude. To obtain the normal form of \(\dot{a}_{i}^{(\ell )}[0]\) and \(\ddot{a}^{(\ell )}[0]\), we can apply (25). The treatment of the normal form of the backward likelihood vectors \(\dot{b}_{i}^{(\ell )}[k]\) follow the same pattern.

The parameter estimation formulas using the normal form of the forward and backward likelihood vectors are

$$\begin{aligned} \lambda _{i}&= \frac{\sum _{\ell =1}^{L} \frac{1}{ \alpha \dot{b}^{(\ell )}[1] } \sum _{k=1}^{K_\ell } r_i \dot{a}_{i}^{(\ell )}[k-1] \dot{b}_{i}^{(\ell )}[k] 2^{\ddot{a}^{(\ell )}[k-1] + \ddot{b}^{(\ell )}[k] - \ddot{b}^{(\ell )}[1]} }{\sum _{\ell =1}^{L} \frac{1}{ \alpha \dot{b}^{(\ell )}[1] } \sum _{k=1}^{K_\ell } x_{k}^{(\ell )} \dot{a}_{i}^{(\ell )}[k-1] \dot{b}_{i}^{(\ell )}[k] 2^{\ddot{a}^{(\ell )}[k-1] + \ddot{b}^{(\ell )}[k] - \ddot{b}^{(\ell )}[1]} }, \end{aligned}$$
(27)
$$\begin{aligned} \pi _{i,j}&= \frac{\sum \limits _{\ell =1}^{L} \frac{1}{ \alpha \dot{b}^{(\ell )}[1] } \sum \limits _{k=1}^{K_\ell } \dot{a}_{i}^{(\ell )}[k-1] f_{i}\left( x_{k}^{(\ell )} \right) \pi _{i,j} \dot{b}_{i}^{(\ell )}[k+1] 2^{\ddot{a}^{(\ell )}[k-1] + \ddot{b}^{(\ell )}[k+1] - \ddot{b}^{(\ell )}[1]} }{\sum _{\ell =1}^{L} \frac{1}{\alpha \dot{b}^{(\ell )}[1] } \sum _{k=1}^{K_\ell } \dot{a}_{i}^{(\ell )}[k-1] \dot{b}_{i}^{(\ell )}[k] 2^{\ddot{a}^{(\ell )}[k-1] + \ddot{b}^{(\ell )}[k] - \ddot{b}^{(\ell )}[1]} }, \end{aligned}$$
(28)
$$\begin{aligned} \alpha _{i}&= \frac{1}{L} \sum _{\ell =1}^{L} \frac{\alpha _{i} \dot{b}_{i}^{(\ell )}[1]}{\alpha \dot{b}^{(\ell )}[1]}. \end{aligned}$$
(29)

Observe that the exponent of 2 depends only on the difference of \(\ddot{a}^{(\ell )}[k]\) and \(\ddot{b}^{(\ell )}[k]\) for consecutive k values according to (26), thus the multiplication and the division with large numbers has been avoided.

Finally, the log-likelihood of the whole trace data can be computed as

$$\begin{aligned} \mathcal {L}(\varTheta |X) = \log \left( \prod _{\ell =1}^{L} \alpha b^{(\ell )}[1]\right) = \sum _{\ell =1}^{L} \log \left( \alpha \dot{b}^{(\ell )}[1] \right) + \ddot{b}^{(\ell )}[1] \log \left( 2 \right) . \end{aligned}$$
(30)

4.3 Serial Implementations

For accuracy and performance comparison we have implemented three versions of the algorithm shown in Fig. 1 (with the discussed modifications for numerical stability):

  • Java implementation using double precision floating point numbers,

  • C++ implementation using double precision floating point numbers,

  • C++ implementation using single precision floating point numbers.

4.4 Parallel Implementation

We have adapted the presented algorithm to be executed on GPUs (graphics processing units) by using CUDA library. GPUs are cheap in the sense of computing power, however, their computing cores are much simpler compared to the ones of CPU. Therefore to fully utilize the hardware low level technical details have to be considered such as the thread grouping, the multi-level memory hierarchy, reducing the number of conditional jumps, memory operations, etc.

The entry part of the algorithm (shown in Fig. 2) is executed on the host environment (i.e. processed by CPU) from which the so called kernels (shown in Figs. 3 and 4) are invoked to be executed on GPU device. Upon kernel launching the number of threads in block, the number of blocks in grid and amount of shared memory (in bytes) to be allocated for every block has to be specified. After kernel launch host process waits until all the threads are processed by the kernel, and then resumes.

The Kernel-a, shown in Fig. 3, computes the normalized likelihood vectors \(\dot{a}^{(\ell )}[k]\), \(\dot{b}^{(\ell )}[k]\) and their respective exponents \(\ddot{a}^{(\ell )}[k]\), \(\ddot{b}^{(\ell )}[k]\). The number of threads in block and grid size can be chosen freely, so that to utilize the specific capabilities of the GPU. However, the threads should be assigned with similar amount of work in order not to waste computing resources.

The Kernel-b, shown in Fig. 4, computes new parameter estimates \(\varTheta = (\lambda _i, \pi _{i,j}, \alpha _i)\). Synchronization between threads is necessary before computing the actual parameter estimate after numerator and denominator values are computed. Since thread synchronization is possible only within a block, the number of blocks is determined by the number of \(\pi _{i,j}\) estimates, thus \(R^2\). Thread count in block can by chosen freely.

Note that work load for the kernels are different. The data runs are allocated to threads in grid for Kernel-a. While for Kernel-b all the runs are allocated among threads for every block.

Even run allocation to threads is a complex problem. A simple greedy solution is to assign runs in descending order (of number of inter-arrival time samples) to the thread, which has been assigned with the smallest number of inter-arrivals.

Global GPU memory accessing operations are slower compared to shared memory. It is a common practice to load frequently used data from global memory into shared one and after calculations write results back into global memory. In our case parameter estimates as well as structure parameters are uploaded in shared memory.

Additionally previously computed likelihood vector values are cached for computing the next ones. Also Erlang branch densities are computed and stored in shared memory just before to be used in subsequent calculations.

Shared memory can be used for communication, since it is visible for all the threads within block. In Kernel-b threads perform summation across the assigned runs and the intermediate results are written to shared memory to be loaded by one designated thread to compute the final estimate value.

figure b

5 Numerical Experiments

We start the section with a general note on the applied special structure. In spite of the natural expectation that the result of the fitting (in terms of likelihood) with the special structure is worse than the one with the general TMAP class of the same size, however, similar to related results in the literature [7, 9] our numerical experience is just the opposite. The general TMAP class is redundant [6], and the EM algorithm goes back and forth between different representations of almost equivalent TMAPs. Our special TMAP class has much less parameters, and the benefit of optimizing according to less parameters dominates the drawback coming from reduced flexibility of the special structure.

Hereafter we compare the behavior of the four implementations (three serial ones (Java (double), C++ (double), C++ (single)) and the one for GPU) of the presented EM algorithm. All numerical experiments were made on an average PC with an Intel Core 2 CPU clocked at 2112 MHz with 32 KB L1 cache and 4096 KB L2 cache, and an ASUS GeForce GTX 560 Ti graphics card with a GPU clocked at 900 MHz having 1 GB of RAM and 384 CUDA cores. For the GPU implementation the first kernel is launched with 64 blocks of 32 threads each, the second one is launched with 9 (\(R^2\)) blocks of 192 threads each.

Two data sets are considered, in the first one there are 1000000 runs and 8824586 inter-arrival times in total, and in the second one there are 2000000 runs and 14503248 inter-arrival times in total.

Table 1. Execution times and log likelihoods of different implementations

Based on the experiences in [7] we adopt three Erlang branches (\(R=3\)) with 1, 2 and 3 states (\(r_1=1,r_2=2,r_3=3\)). For fair comparison we have run 30 iterations of the EM algorithm in all cases (after which the algorithm seemed to converge) and the compared results are always initiated with the same initial guesses. The run times and log-likelihoods are compared in Table 1.

Time necessary to allocate/deallocate arrays is not included in run time, because of different C++ and Java memory management policies. However the time for data allocation/deallocation on GPU device is included. The trace data sample distribution among threads is done in advance, thus not included in run time.

After every iteration log-likelihood was computed using double precision floating point variables. Java (double) and C++ (double) implementations gave identical log-likelihoods and are shown in Figs. 3 and 4. Log-likelihoods obtained by running C++ (single) are relatively similar to ones acquired using C++ (double) implementation. Therefore, it is more convenient to plot the difference of log-likelihood obtained from C++ (single) minus C++ (double). The same applies for results obtained by CUDA (single) implementation.

Fig. 3.
figure 3

Log-likelihood of 1000000 sample trace data obtained by C++ (double) procedure.

Fig. 4.
figure 4

Log-likelihood of 2000000 sample trace data obtained by C++ (double) procedure.

figure c
figure d

6 Conclusions

An EM procedure to estimate special structure TMAP parameters was developed and four of its implementations were tested by fitting reasonable large data sets. The C++ implementations of the fitting procedure indicated that both the single and the double precision floating points versions are stable, and converged to similar limits. Due to the fact that the log-likelihood vectors of independent runs can be computed independently parallel implementation on GPU can speed up the procedure significantly. Log-likelihoods obtained by using CUDA implementation are close to ones obtained computing with serial implementation on CPU.