1 Introduction

Discrete-valued time series can broadly be classified into two categories, namely the count time series and the categorical time series. Categorical time series can again be of ordinal or nominal type. Some examples of count time series are the annual counts of hurricanes, the number of patients treated each day in an emergency department or the daily counts of swine flu cases in Mexico. Sleep status in successive minutes is one example of ordinal categorical time series. On the other hand, a sequence of rainfall data in which successive days are recorded as “wet” or “dry” is one example of nominal categorical time series.

This paper is concerned about the coherent forecasting of discrete-valued time series, i.e., for data which are discrete in nature. By coherent forecasting, we mean that the forecasting values are either integer or categorical. In the count time series context, very few works are available in modeling as well as for coherent forecasting. Freeland and McCabe (2004) discussed some methods of coherent forecasting for thinning operator-based Poisson integer-valued autoregressive model of order 1 [denoted by PINAR(1)], which was introduced in McKenzie (1985) and Al-Osh and Alzaid (1987). Later, this thinning-based INAR(1) model was extended to INAR(\(p\)), INMA(\(q\)) and INARMA(\(p,q\)) models by McKenzie (1988) and Alzaid and Al-Osh (1990). Although the \(h\)-step ahead conditional mean to make \(h\)-step ahead forecasting can be derived without knowing the exact \(h\)-step ahead forecasting distribution, in general this conditional mean may not be an integer and hence it is not coherent. Also for nominal categorical time series, where one cannot assign numerical values to the categories, conditional mean does not make any sense and hence cannot be used for forecasting purpose. However, many authors obtained the exact expression for \(h\)-step ahead forecasting distribution and used its median and mode, which are coherent by its nature, to study the \(h\)-step ahead coherent forecasting. Later Jung and Tremayne (2006), Bu and McCabe (2008) and Silva et al. (2009) also used the same methods to study the coherent forecasting in more general setup. But, in general, these models are not applicable in modeling categorical time series with finite number of categories.

Jacobs and Lewis (1978a, b, c), in a series of papers introduced a simple method for obtaining a stationary sequence of dependent random variables with a specified marginal distribution and correlation structure chosen independently. It was perhaps the first attempt to obtain a general class of simple models for discrete variate time series including categorical processes. These models are structurally based on the well-known autoregressive-moving-average processes and are referred to as DARMA models. However, the most well-known approach towards fitting categorical time series data is perhaps the mixture transition distribution (MTD) model, a class of models based on time homogeneous higher order Markov chain, proposed by Raftery (1985). Later it had been modified and generalized by Berchtold and Raftery (2002) and references therein. In contrast, Pegram (1980) used a very special kind of Markovian model towards fitting discrete-valued time series, especially for categorical time series. It is important to note that the model proposed by Pegram (1980) is equivalent to the DAR(\(p\)) model considered by Jacobs and Lewis (1978c, 1983). In particular, the DAR(1) process in Jacobs and Lewis (1978c) is exactly same as that of the Pegram’s AR(1) process. In the recent past, Biswas and Song (2009) had extended the Pegram’s autoregressive model of order \(p\), denoted by PAR\((p)\), to more general setup—Pegram’s autoregressive and moving-average model [denoted by PARMA(\(p\),\(q\))] which is equivalent to the NDARMA(\(p\), \(q\)) model of Jacobs and Lewis (1983). Also see the alternative representation of the model in Weiß and Göb (2008). Regression model for categorical time series was also developed and applied in sleep status data by Fokianos and Kedem (2003).

In this article, we derive the exact \(h\)-step ahead coherent forecasting distributions of three discrete time series models, namely PARMA\((p,q)\), MTD model of order \(p\) or MTD(\(p\)) and logistic regression model of order \(p\) or Logistic(\(p\)). It is important to note that, if a categorical time series has \(k+1\) categories, then the number of parameters to be estimated in the PARMA(\(p,q\)) model is only \(\left( k+p+q\right) \), whereas it is \(\left( k(k+1)+p-1\right) \) for the MTD(\(p\)) model and \(pk^{2}\) for the Logistic(\(p\)) model. In other words, the PARMA models involve much less number of parameters compared to the other two models for sufficiently large values of \(k\) and \(p\). In addition, the PARMA models exhibit the classical Yule–Walker serial dependence structure and it carries simple stochastic properties such as stationarity, ergodicity and so on. However, the model has one big disadvantage that it can only be used for time series exhibiting long runs of a certain value. In spite of the limitation, it is evident that the PARMA models are more flexible in terms of the range of correlation and the ease of interpretation. Therefore, in this article, forecasting study for the PARMA(\(p\),\(q\)) model is carried out in detail with MTD and logistic models. Different methods of coherent forecasting for ordinal and nominal categorical time series, e.g., median and mode predictors are discussed. To study the forecasting performance, different measures of forecasting accuracy are studied. The list includes percentage of true prediction, Kolmogorov–Smirnov distance, Euclidean distance, maximum absolute distance between true and predicted distributions. In addition, we introduce a different notion of interval forecasting based on highest predicted probability (HPP), namely \(100(1-\alpha )\,\%\) HPP set, and study its performance using some simulation studies. All these methods are illustrated using one real dataset of ordinal categorical time series, namely infant sleep status data.

The rest of the article is organized as follows. In Sect. 2, different methods of coherent forecasting with some measures of forecasting accuracy are discussed to study the forecasting performance. Coherent forecasting for PAR\((p)\), PMA\((q)\) and PARMA\((p,q)\) models is presented in Sect. 3. Coherent forecasting for MTD\((p)\) and Logistic\((p)\) models is discussed in Sects. 4 and 5, respectively. Some extensive simulation results are presented in Sect. 6. In Sect. 7, a practical categorical data, namely infant sleep status data, are analyzed to illustrate the proposed methods. Section 8 concludes. All technical proofs are relegated to the Appendix.

2 Coherent forecasting

It is important to note that, forecasting which is an integral part of time series analysis, has received very little attention in the discrete-valued time series literature, especially in categorical time series analysis. In the context of count time series, Freeland and McCabe (2004) have introduced some coherent methods of \(h\)-step ahead forecasting. The list includes nearest integer of mean predictor, median predictor and mode predictor. If the time series data are categorical, then the nearest integer of mean predictor cannot be used since moments are not defined there. To use median predictor for categorical time series, the order of the categories is mandatory and hence median predictor can only be used for ordinal/ordered categorical time series. However, mode predictor does not depend on the order of the categories, and hence can always be used to obtain the \(h\)-step ahead coherent forecasting.

On the other hand, to examine the forecasting accuracy for time series of real-valued data, one can always use the popular measures like predicted root mean squared error (PRMSE) or predicted mean absolute error (PMAE) which can be defined as follows. Let \(\{Y_{t}\}, \; t=1,2,\ldots ,N\) be a time series and let us denote \(\mathcal {Y}_{n}=\{Y_{n}, Y_{n-1}, \ldots , Y_{1}\}\), then

$$\begin{aligned} \text{ PRMSE }(h)&= \sqrt{E\left( \left( Y_{n+h} - \widehat{Y}_{n+h}\right) ^{2}|\mathcal {Y}_{n}\right) }; \qquad h=1,2,\ldots \\&{\hat{=}} \sqrt{\dfrac{1}{M}\displaystyle \sum _{i=1}^{M}\left( \widehat{y}_{(n+h)i} - {y}_{(n+h)i}\right) ^{2}}, \end{aligned}$$
$$\begin{aligned} \text{ PMAE }(h)&= E\left( \left| Y_{n+h} - \widehat{Y}_{n+h}\right| |\mathcal {Y}_{n}\right) ; \qquad h=1,2,\ldots \\&{\hat{=}}\dfrac{1}{M}\displaystyle \sum _{i=1}^{M}\left| \widehat{y}_{(n+h)i} - {y}_{(n+h)i}\right| . \end{aligned}$$

where \(y_{(n+h)i}\) be the true \(i\)th observation at time point \((n+h)\) and \(\widehat{y}_{(n+h)i}\) be the predicted observation at the same time point observed by some forecasting methods and \(M\) is the number of iterations.

Unlike for time series of real-valued data, the PRMSE and PMAE cannot be observed, particularly for nominal categorical time series. For ordinal categorical process, although these measures can be observed after assigning some numbers to the categories, but these may lead to some wrong conclusions since there is a no unique way to assign numbers to the ordinal categories (discussed earlier). However, to examine the forecasting accuracy for count and categorical data, we can always use measure like percentage of true prediction (PTP) which is defined as

$$\begin{aligned} \text{ PTP }(h)&= E\left( I(Y_{n+h} = \widehat{Y}_{n+h})|\mathcal {Y}_{n}\right) \times 100; \qquad h=1,2,\ldots \\&{\hat{=}}\dfrac{1}{M}\displaystyle \sum _{i=1}^{M}I({y}_{(n+h)i} = \widehat{y}_{(n+h)i}) \times 100. \end{aligned}$$

In addition, we intend to propose some popular distance functions between true and predicted distributions as the measures of forecasting accuracy to study the forecasting accuracy for categorical time series analysis. The list includes (discrete) Kolmogorov–Smirnov distance (KSD), Euclidean distance (ED) (see, e.g., Carruth et al. 2012), and maximum absolute difference (MAD) which are defined as follows.

Let \(\{Y_{t}\}, \; t=1,2,\ldots ,N\) be a time series of categorical data with \((k+1)\) many categories \(\{C_{0}, C_{1}, \ldots , C_{k}\}\), and let us assume that \(\mathbf {p}_{h}=\left( p_{h}(0), p_{h}(1), \ldots , p_{h}(k)\right) \) denotes the \(h\)-step ahead true distribution of \(Y_{n+h}\) given \(\mathcal {Y}_{n}\) with \(\displaystyle \sum \nolimits _{i=0}^{k}p_{h}(i)=1\), where \(p_{h}(i)\) denotes the probability mass function of \(Y_{n+h}\) at \(C_{i}\) given \(\mathcal {Y}_{n}\). Let \(\widehat{\mathbf {p}}_{h}\) denote the \(h\)-step ahead forecasting distribution, then KSD, ED and MAD functions can be defined as

$$\begin{aligned} \text{ ED } ({\mathbf {p}}_{h}, \widehat{\mathbf {p}}_{h}) = \sqrt{ \displaystyle \sum _{j=0}^{k} \left( {p}_{h}(j) - \widehat{p}_{h}(j)\right) ^{2}}, \end{aligned}$$

and

It is important to mention that unlike KSD, the other two measures can be applied to any type of categorical time series—nominal or ordinal. However, KSD which is the maximum absolute difference between cumulative distribution functions depends on the ordering of the categories. Thus, when there is a natural ordering of the data, KSD is recommended, while the ED and MAD are more reliable and more easily understood than the KSD when there is no natural ordering (or partial order). In the context of goodness of fit of categorical data analysis, a comparison study between ED and KSD is also available in Carruth et al. (2012).

As far as the interval forecasting for categorical time series process is concerned, especially for nominal time series, it is not feasible to obtain the usual prediction interval of \(Y_{n+h}\) given \(\mathcal {Y}_{n}\). However, we can use some notion of prediction set in place of prediction interval, e.g., highest predicted probability (HPP) set which is defined as follows:

Definition

A \(100(1-\alpha )\,\%\) HPP set of \(Y_{n+h}\) given \(\mathcal {Y}_{n}\), denoted by \(\mathcal {S}_{h}\) and is defined as

$$\begin{aligned} \mathcal {S}_{h}=\{C_{j}, \;j \in J: \; p_{h}(j) \ge k_{\alpha }\} \end{aligned}$$

where \(J=\{0,1,\ldots ,k\}\) and \(k_{\alpha }\) is the largest number such that

$$\begin{aligned} P(Y_{n+h} \in \mathcal {S}_{h}|\mathcal {Y}_{n})= \displaystyle \sum _{\{j: C_{j} \in \mathcal {S}_{h}\}} p_{h}(j) \; \ge (1-\alpha ). \end{aligned}$$

Based on the above definition, we can obtain the \(100(1-\alpha )\,\%\) HPP set, \(\mathcal {S}_{h}\), of \(Y_{n+h}\) given \(\mathcal {Y}_{n}\). It is important to notice that \(\mathcal {S}_{h}\) does not depend on the nature of the categories, and the usual length of \(\mathcal {S}_{h}\) (like the length of the prediction interval) does not make sense here. Therefore, we introduce a notion of length of \(\mathcal {S}_{h}\), namely the cardinality of \(\mathcal {S}_{h}\), denoted by \(n(\mathcal {S}_{h})\) which gives the number of elements in the set, and study its behavior using some simulation studies to obtain the interval forecasting accuracy against \(h\) in the later sections.

3 Coherent forecasting for Pegram’s operator-based ARMA(\(p\), \(q\)) models

3.1 Pegram’s operator

Pegram’s operator \(*\), when operated on \(U\) and \(V\), say, defines a new random variable \(Z\) as a mixture of \(U\) and \(V\) with mixing coefficients \(\phi \) and \(1-\phi \). This is defined as

$$\begin{aligned} Z = (U,\phi ) *(V, 1 - \phi ), \end{aligned}$$
(3.1)

where the marginal probability function of \(Z\) is given by

$$\begin{aligned} P(Z = j) = \phi P(U = j) + (1 -\phi ) P(V = j), ~~j = 0, 1,\ldots \end{aligned}$$

The mixing operator \(*\) can be easily extended to handle more than two discrete variables. Pegram’s (1980) construction has been extended to ARMA(\(p\),\(q\)) model by Biswas and Song (2009) and Biswas and Guha (2009). The extension is equivalent to the NDARMA model by Jacobs and Lewis (1983). Also an alternative representation of the NDARMA model is available in Weiß and Göb (2008). The key advantage of Pegram’s operator is that it provides a flexible mixing operation that enables us to define the mixture among a finite number of probability distributions of categorical random variables. It may be noted here that in this model the value of the variable of interest at time \(t\) depends on its value at time \((t-1)\) only through the probability of being equal to it and so on, as pointed out by Raftery (1985), who argued that the dependence patterns for such models are restricted.

3.2 Pegram’s operator-based AR(\(p\)) model

Based on the above mixing operator \(*\), Pegram (1980) constructed a stationary AR(\(p\)) process. Let \(\left\{ Y_{t}\right\} \) denote the response series with \((k+1)\) categories \(\{C_{0}, C_{1}, \ldots ,C_{k},\}\). Then the process \(\left\{ Y_{t}\right\} \) is defined as

$$\begin{aligned} Y_t=(I(Y_{t-1}),\phi _1) *(I(Y_{t-2}),\phi _2) *\cdots *(I(Y_{t-p}),\phi _p) *(\epsilon _t,1-\phi _1-\phi _2-\cdots -\phi _p), \end{aligned}$$
(3.2)

which is a mixture of \((p+1)\) discrete distributions, where \(P(\epsilon _t = C_{i}) = p_{i}\), \(i= 0,1,\ldots ,k\), and it is denoted by \(\epsilon _{t} \sim D((C_{i},p_{i}),i=0,1,\ldots ,k)\), with respective mixing weights being \(\phi _1, \ldots ,\phi _p \) with \(\phi _{i} \in (0,1)\), \(i=1,\ldots ,p\), and \(\displaystyle \sum \nolimits _{i=1}^{p} \phi _{i} \in (0,1)\). For every \(t= 0,\pm 1,\pm 2,\ldots \) the conditional probability function takes the form

$$\begin{aligned} P(Y_{t}&= C_{i}|Y_{t-1}=C_{i_1},\ldots ,Y_{t-p}=C_{i_p}) = \phi _{1}I({i_1}={i})+\cdots +\phi _{p}I({i_p}={i}) \nonumber \\&+(1-\phi _{1}-\phi _{2}-\cdots -\phi _{p})p_{i}, \end{aligned}$$
(3.3)

where \(\phi _{j}\), \(j=1,\ldots ,p\), are chosen such that the polynomial equation \(1-\phi _1 z-\cdots -\phi _{p} z^{p} =0\) has roots lying outside of the unit disc. Here \(I(\cdot )\) is the indicator function such that \(I(A)=1\) or 0 whether \(A\) occurs or not.

Taking expectation in both sides of (3.3), we observe that \(P(Y_{t-h}=C_{i})=p_{i}\) for \(h=1,\ldots ,p\), resulting in \(P(Y_{t}=C_{i})=p_{i}\), which implies the marginal stationarity, i.e., marginally \(Y_t \sim D((C_{i},p_{i}),i=0,1,\ldots ,k)\) for all \(t\).

For a stationary PAR(1) model the following simple Theorem is proved in Biswas and Song (2009).

Theorem 1

For \(h\ge 1\), we have

$$\begin{aligned} P(Y_{t+h}=C_{i}|Y_t=C_{j}) = \phi ^h I(j=i) + (1-\phi ^h) p_{i}. \end{aligned}$$
(3.4)

A more general result for the NDARMA(\(p\), \(q\)) model, which is equivalent to the PARMA(\(p\), \(q\)) model, was derived by (Weiß and Göb (2008), Section 5), although the transition probability distribution for \(h>1\) was not derived there.

It is important to mention that, if the time series is categorical, especially nominal categorical, where one cannot assign numerical values to the categories, the moments, autocorrelation function cannot be defined. Although the autocorrelation function is not defined, some measures of serial association can always be defined for such processes. In the recent past, Weiß and Göb (2008) proposed several measures of association in the context of modeling categorical time series. The list includes popular measures like Goodman and Kruskal’s \(\tau \), Goodman and Kruskal’s \(\lambda \), Cramer’s \(\nu \), Cohen’s \(\kappa \) and many others (see Weiß and Göb 2008 for details). These measures can also be used to select the order of the models. Even if the categories are ordinal type where one can assign some ordered numerical scalings, the above measures can also be used as alternatives to the autocorrelation. This is because different people using their own numerical scalings will get different values of moments and autocorrelation for the same categorical time series. Based on these measures, a detailed numerical study is carried out in latter sections.

Now to study the different notions of \(h\)-step ahead coherent forecasting and different measures of forecasting accuracy discussed in earlier section, we derive the following results.

Theorem 2

For a stationary PAR(\(p\)) model, the \(h\)-step ahead forecasting distribution of \(Y_{n+h}\) given \(\mathcal {Y}_{n}\) is given by

$$\begin{aligned} \begin{array}{lcl} p_h(i;\varvec{\phi }) &{}=&{} P(Y_{n+h}=C_{i}|\mathcal {Y}_{n}) \\ &{}=&{} \eta _{h1}I(Y_{n}=C_{i})+\cdots +\eta _{hp}I(Y_{n-p+1}=C_{i})+(1-\eta _{h1}-\cdots -\eta _{hp}) p_{i}\\ &{}=&{} {\varvec{\eta }}_{h}^{T} {\varvec{e}} + \left( 1-{\varvec{\eta }}_{h}^{T}{\varvec{1}}\right) p_{i}, \end{array} \end{aligned}$$
(3.5)

where the vector of \(h\)-step ahead parameters \({\varvec{\eta }}_{h}=\left( \eta _{h1},\eta _{h2},\ldots , \eta _{hp}\right) ^{T}\) is given by

$$\begin{aligned} {\varvec{\eta }}_{h}={\varvec{\Phi }}^{h-1}{\varvec{\phi }}, \end{aligned}$$
(3.6)

with

and \({\varvec{\Phi }}^{h-1}=\underbrace{ {\varvec{\Phi }} \times {\varvec{\Phi }} \times \cdots \times {\varvec{\Phi }}}_{h-1}\).

Proof

See Appendix A. \(\square \)

From the above Theorem, ergodicity of the above process can be established as follows.

Proposition 1

Under the above setup, it can be obtained that

$$\begin{aligned} \lim _{h \rightarrow \infty }P(Y_{n+h}=C_{i}|\mathcal {Y}_{n})=p_{i}, \end{aligned}$$

that is, predicted distribution reduces to marginal one if one predicts sufficiently long time ahead.

Proof

See Appendix B. \(\square \)

Although this property was already discussed in Pegram (1980), here we have proved the result using Theorem 2. However, an equivalent result was also provided in Jacobs and Lewis (1978c) for the DAR(\(p\)) process. In fact, a generalized result for the NDARMA process is available in Jacobs and Lewis (1983).

3.3 PMA(\(q\)) model

Based on the Pegram’s operator, Biswas and Song (2009) proposed a stationary MA(\(q\)) process, denoted by PMA(\(q\)), in the context of discrete time series analysis and is defined as

$$\begin{aligned} Y_t=(\epsilon _t,\theta _0)*(I(\epsilon _{t-1}),\theta _1)*\cdots *(I(\epsilon _{t-q}),\theta _q), \end{aligned}$$

which implies that for every \(t\in 0,\pm 1,\pm 2,\ldots \), the conditional probability function takes the form

$$\begin{aligned} P(Y_t=C_{i}|\epsilon _{t},\epsilon _{t-1},\ldots ,\epsilon _{t-q})&= \theta _{0}I(\epsilon _t=C_{i})+\theta _{1}I(\epsilon _{t-1}=C_{i})\\&+\,\cdots +\theta _q I(\epsilon _{t-q}=C_{i}), \end{aligned}$$

where \(\theta _{i} \ge 0\) for all \(i\), and \(\displaystyle \sum \nolimits _{i=0}^q\theta _{i}=1\). It is easy to see that the marginal distribution of \(Y_{t} \sim D\{(C_{i},p_{i}),i=0,1,\ldots ,k\}\) for all \(t\). It is to be noted that the PMA(\(q\)) process due to Biswas and Song (2009) is indeed equivalent to the DMA(\(q\)) model proposed by Jacobs and Lewis (1978a, b).

3.3.1 Coherent forecasting

Consider a stationary PMA(1) model, then the \(h\)-step ahead forecasting distribution can be obtained as follows:

For \(h=1\),

$$\begin{aligned} p_{1}(i)&= P(Y_{n+1}=C_{i}|\mathcal {Y}_{n})\\&= P(Y_{n+1}=C_{i}|Y_{n}) \\&= \theta _0\theta _1\{I(Y_{n}=C_{i})-p_{i}\} + p_{i}, \end{aligned}$$

and for \(h>1\),

$$\begin{aligned} p_{h}(i)=P(Y_{n+h}=C_{i}|Y_{n}) =p_{i}. \end{aligned}$$

In general, for a stationary PMA(\(q\)) model, the \(h\)-step ahead forecasting distribution is somewhat complicated with the following representation. For \( 1\le h \le q\) and \(l=q-1\),

$$\begin{aligned} \begin{array}{lcl} p_{h}(i) &{}=&{} P(Y_{n+h}=C_{i}|Y_{n}=C_{i_{0}},\ldots ,Y_{n-l}=C_{i_{l}}) \\ &{} = &{} \frac{\displaystyle \sum \nolimits _{r_h=0}^q \displaystyle \sum \nolimits _{r_0=0}^q \cdots \displaystyle \sum \nolimits _{r_{l}=0}^q \theta _{r_h} \theta _{r_0} \cdots \theta _{r_{l}}P(\epsilon _{n+h-r_{h}}=C_{i},\epsilon _{n-r_{0}}=C_{i_{0}},\ldots ,\epsilon _{n-l-r_{l}}=C_{i_{l}})}{\displaystyle \sum \nolimits _{r_{0}=0}^q \cdots \displaystyle \sum \nolimits _{r_{l}=0}^q\theta _{r_{0}} \cdots \theta _{r_{l}}P(\epsilon _{n-r_{0}}=C_{i_{0}},\ldots ,\epsilon _{n-l-r_{l}}=C_{i_{l}})}, \end{array} \end{aligned}$$
(3.7)

and for \(h>q\), \(p_{h}(i)=P(Y_{n+h}=C_{i}|\mathcal {Y}_{n})=p_{i}.\)

An explicit expression of the \(h\)-step ahead forecasting distribution for the PMA(2) model is derived in Appendix C.

Thus, the expression for the \(h\)-step ahead forecasting distribution of \(Y_{n+h}\) given the observed values \(Y_1,\ldots ,Y_n\) is quite cumbersome for \(h\ge 2\). To avoid such complicated results, we suggest to use the following alternative, the \(h\)-step ahead forecasting distribution of \(Y_{n+h}\) given only the present observed value \(Y_n\), to obtain the \(h\)-step ahead coherent forecasting. The advantage of using the following forecasting distribution is that it has a nice and simple expression for all \(h\). Specifically, for \(0<h\le q\), we have

$$\begin{aligned} P(Y_{n+h}=C_{i}|Y_n=C_{j})&= \dfrac{P(Y_{n+h}=C_{i},Y_{n}=C_{j})}{P(Y_{n}=C_{j})}\nonumber \\&= \left( \displaystyle \sum _{r=0}^{q-h}\theta _{r} \theta _{r+h}\right) \left\{ I(i=j)-p_{i}\right\} + p_{i}, \end{aligned}$$
(3.8)

and \(P(Y_{n+h}=C_{i}|Y_n)=p_{i}\) for \(h>q\).

To study the difference between the conditional distribution of \(Y_{n+h}\) given \(Y_{n}\) presented in (3.8) and the true forecasting distribution given in (3.7), we carry out one simulation study for the PMA(2) process with different possible choices of the model parameters. We reported the results based on \(n=500\) with model parameters \((\theta _{0}, \theta _{1}, \theta _{2})=(0.2,0.6,0.2)\) and the marginal distribution \({\mathbf p}=(0.2, 0.1, 0.5, 0.15, 0.05)\) defined on the state space \(S=\{0,1,2,3,4\}\). Based on the simulated data, we obtained the exact forecasting distribution using the formula given in Appendix C and the conditional distribution of \(Y_{n+h}\) given the present observation \(Y_{n}\) given in (3.8). The fitted forecasting distribution and the fitted conditional distribution are presented in Fig. 1. As one can see, no significant difference is visualized. Therefore, one can use the conditional distribution given in Eq. (3.8) as an alternative to the actual forecasting distribution given in Eq. (3.7) whose expression is quite cumbersome to handle while making the coherent forecasting.

Fig. 1
figure 1

\(h\)-step ahead forecasting and conditional distributions for the PMA(2) process with \((\theta _{0}, \theta _{1}, \theta _{2})=(0.2,0.6,0.2) \text{ and } \text{ marginal } \text{ distribution } \mathbf {p}=(0.2, 0.1, 0.5, 0.15, 0.05)\)

3.4 PARMA(\(p,q\)) model

Pegram’s operator-based ARMA(\(p,q\)) model, denoted by PARMA(\(p\),\(q\)) due to Biswas and Song (2009) (which is equivalent to NDARMA model by Jacobs and Lewis 1983), can be constructed by combining the PAR(\(p\)) and the PMA(\(q\)) models as follows:

$$\begin{aligned} Y_t=(I(Y_{t-1}),\phi _1)*\cdots *(I(Y_{t-p}),\phi _p)*(\epsilon _t,\theta _0)*(I(\epsilon _{t-1}), \theta _{1})*\cdots *(I(\epsilon _{t-q}),\theta _q), \end{aligned}$$

which implies that for every \(t=0,\pm 1,\pm 2,\ldots \), the conditional distribution takes the form

$$\begin{aligned}&P(Y_t=C_{j}|Y_{t-1},\ldots ,Y_{t-p},\epsilon _t,\ldots ,\epsilon _{t-q})\\&\quad = \phi _1I(Y_{t-1}=C_{j})+\cdots +\phi _{t-p}I(Y_{t-p}=C_{j})\\&\qquad +\,\,\theta _{0} I(\epsilon _t=C_{j})+\cdots +\theta _{q} I(\epsilon _{t-q}=C_{j}), \end{aligned}$$

with \(\theta _j\ge 0\) for all \(j\), \(\phi _i\ge 0\) for all \(i\), and \(\displaystyle \sum _{i=1}^p\phi _i +\displaystyle \sum _{j=0}^q\theta _j=1\).

In particular, the PARMA(1,1) model takes the form

$$\begin{aligned} Y_{t}=\left( I(Y_{t-1}),\phi _1\right) *\left( \epsilon _t,\theta _0\right) *\left( I(\epsilon _{t-1}),\theta _1\right) , \end{aligned}$$

with \(\phi _1,\theta _0,\theta _1 \ge 0\) and \(\phi _1+\theta _0+\theta _1=1\). Marginal stationarity is guaranteed.

It is easy to obtain the \(h\)-step ahead forecasting distribution for the PARMA(1,1) model. For \(h=1\), it is given by

$$\begin{aligned} P(Y_{n+1}=C_{i}|Y_{n}=C_{j})=\phi _{1}I(j=i)+\theta _{0}p_{i}+\theta _{1}\dfrac{\{\theta _{0}I(j=i)+(1-\theta _{0})p_{j}\}p_{i}}{p_j}, \end{aligned}$$

and for \(h>1\),

$$\begin{aligned} P(Y_{n+h}=C_{i}|Y_{n}=C_{j})=\phi _{1}^{h} I(j=i) + (1-\phi _{1}^{h}) p_i. \end{aligned}$$

The forecasting distribution for the PARMA(\(p\),1) model can similarly be obtained as

$$\begin{aligned} p_{1}(i)&= P(Y_{n+1}=C_{i}|Y_{n}=C_{i_{0}},\ldots ,Y_{n-p+1}=C_{i_{p-1}}) \\&= \phi _{1} I(i_{0}=i)+\cdots +\phi _{p} I(i_{p-1}=i)+\theta _{0} p_{i}\\&+\,\,\theta _{1}\dfrac{\{\theta _{0} I(i_{0}=i)+(1-\theta _{0})p_{i_{0}}\}p_{i}}{p_{i_{0}}}\\&= {\varvec{\phi }}^{T} \mathbf {e} +\theta _{0} p_{i} +\theta _{1}\dfrac{\{\theta _{0} I(i_{0}=i)+(1-\theta _{0})p_{i_{0}}\}p_{i}}{p_{i_{0}}}, \end{aligned}$$

where \(\mathbf {e}=\left( I(i_{0}=i), I(i_{1}=i), \ldots , I(i_{p-1}=i)\right) ^{T}\) and for \(h>1\),

$$\begin{aligned} \begin{array}{lcl} p_{h}(i) &{}=&{} \eta _{h1}I(i_0=i)+\cdots +\eta _{hp}I(i_{p-1}=i) +(1-\eta _{h1}-\cdots -\eta _{hp})p_i \\ &{}=&{} {\varvec{\eta }}_{h}^{T}\mathbf {e} + \left( 1-{\varvec{\eta }}_{h}^{T} {\varvec{1}}\right) p_{i}, \end{array} \end{aligned}$$

where the \(h\)-step ahead parameter \({\varvec{\eta }}_{h}\) is given in (3.6). Similarly, for the PARMA(\(p\),2) model and for \(h=1\) we have,

$$\begin{aligned} p_1(i)&= \phi _{1} I(i_{0}=i)+\cdots +\phi _{p} I(i_{p-1}=i)+\theta _{0} p_{i}\\&+\theta _{1}\dfrac{\{\theta _{0} I(i_{0}=i) + (1-\theta _{0}) p_{i_{0}}\}p_{i}}{p_{i_{0}}} \\&+\theta _2\dfrac{\{\theta _0I(i_1=i)+(1-\theta _0)p_{i_1}\}p_i}{p_{i_1}}\\&= {\varvec{\phi }}^{T}\mathbf {e}+ \theta _{0} p_{i}+\theta _{1}\dfrac{\{\theta _{0} I(i_{0}=i) + (1-\theta _{0}) p_{i_{0}}\}p_{i}}{p_{i_{0}}}\\&+\theta _2\dfrac{\{\theta _0I(i_1=i)+(1-\theta _0)p_{i_1}\}p_i}{p_{i_1}}, \end{aligned}$$

and for \(h=2\),

$$\begin{aligned} \begin{array}{lcl} p_2(i)&{}=&{} \phi _1p_1(i)+\phi _2I(i_1=i)+\cdots +\phi _pI(i_{p-1}=i)\\ &{} &{} +\theta _0p_i+\theta _1p_i+\theta _2\dfrac{\{\theta _0I(i_0=i)+(1-\theta _0)p_{i_0}\}p_i}{p_{i_0}}, \end{array} \end{aligned}$$

and for \(h>2\),

$$\begin{aligned} \begin{array}{lcl} p_h(i)&{}=&{}\eta _{h1}I(i_{0}=i)+\cdots +\eta _{hp}I(i_{p-1}=i)+(1-\eta _{h1}-\cdots -\eta _{hp})p_i \\ &{}=&{} {\varvec{\eta }}_{h}^{T}\mathbf {e} + \left( 1-{\varvec{\eta }}_{h}^{T} {\varvec{1}}\right) p_{i}. \end{array} \end{aligned}$$

It can be further extended for the PARMA(\(p\),\(q\)) model.

4 Coherent forecasting for the MTD model

4.1 MTD model

The MTD model was introduced by Raftery (1985) and it bypasses the problem of an exponentially increasing number of free parameters for a Markov chain by specifying the conditional probability of \(Y_t\) given the past as a linear combination of contribution from \(Y_{t-1}, Y_{t-2},\ldots , Y_{t-p}\). More precisely, MTD(\(p\)) model assumes that

$$\begin{aligned} P(Y_t=C_{i}|Y_{t-1}=C_{i_1},\ldots ,Y_{t-p}=C_{i_p})&=\displaystyle \sum _{j=1}^p\lambda _j P(Y_t=C_{i}|Y_{t-j}=C_{i_j}) \nonumber \\&=\displaystyle \sum _{j=1}^p\lambda _j q_{i_{j}i}, \end{aligned}$$
(4.1)

where \(i, i_1,\ldots ,i_p\in \{0,1,\ldots ,k\}\), \(q_{i_{j}i}\)s are elements of the \((k+1)\times (k+1)\) transition probability matrix \(Q\) and vector of lag parameters \({\varvec{\lambda }}=(\lambda _1,\ldots ,\lambda _p)^T\) satisfies \(\displaystyle \sum _{j=1}^p\lambda _j=1\), \(\lambda _j\ge 0\) for all \(j\), so that the right-hand side of (3.1) lies between 0 and 1.

4.2 \(h\)-step ahead forecasting distribution

One-step ahead forecasting distribution follows from the model itself, that is

$$\begin{aligned} p_{1}(i)=P(Y_{n+1}=C_{i}|Y_{n}=C_{i_1},\ldots ,Y_{n-p+1}=C_{i_p})=\displaystyle \sum _{l=1}^p\lambda _l q_{i_l i}, \end{aligned}$$
(4.2)

Two-step ahead forecasting distribution is given by

(4.3)

Similarly, three-step ahead forecasting distribution is given by

In a similar fashion, we can extend it for any general \(h\). But it is customary to use this forecasting distribution for \(h\) less than equal to 4, after that it works like the marginal distribution. Even the forecasting distribution will also become cumbersome.

5 Coherent forecasting for logistic regression model

5.1 Logistic regression model

Some of the inconsistencies associated with standard time series models for count/binary data can be resolved very elegantly and successfully by logistic time series regression (as standard time series models consider simple linear regression on its lag values but logistic regression consider generalized linear regression on its lag values) though stationarity may not be retained here. In the context of categorical time series analysis, Fokianos and Kedem (2003) applied the same idea to build regression models for categorical time series. Here, we provide a brief description of the multinomial logistic regression model with covariates as its lag values, discuss the estimation of the associated parameters, and then the \(h\)-step ahead forecasting distribution and its theoretical confidence interval.

Let \(\{Y_t\}, \; t=1,2,\ldots , N\) be a categorical time series with \((k+1)\) categories. In other words, for each \(t\), the possible values of \(Y_{t}\) are \(C_{0}, C_{1}, C_{2}, \ldots , C_{k}\). As mentioned earlier, the assignment of integer values to the categories is a matter of convenience and hence it is not unique.

To reduce the amount of arbitrariness incurred by the assignment of numbers to categories, it is helpful to note that the \(t\)-th observation of any categorical time series regardless of the measurement scale can be expressed by the vector \(\mathbf {Y}_{t} = (Y_{t0},\ldots ,Y_{tq})\) where \(q=k-1\) with elements

$$\begin{aligned} Y_{tj}= {\left\{ \begin{array}{ll} 1, &{} \text{ if } \text{ the } j \text{ th } \text{ category } \text{ is } \text{ observed } \text{ at } \text{ time } t,\\ 0, &{} \text{ otherwise }, \end{array}\right. } \end{aligned}$$
(5.1)

for \(t=1,2,\ldots , N\) and \(j=0,1,\ldots , q\). Let us denote by \(\varvec{\pi }_{t}=(\pi _{t0}, \pi _{t1}, \ldots , \pi _{tq)})\), the vector of conditional probabilities given \(\mathcal {F}_{t-1}\), where

$$\begin{aligned} \pi _{tj}=P(Y_{t}=C_{j}|\mathcal {F}_{t-1}), \quad j=0,1,\ldots , q \end{aligned}$$

for every \(t=1,2, \ldots , N\). At times, we refer to the \(\pi _{tj}\) as “transition probabilities”. Define \(Y_{tk}=1-\displaystyle \sum \nolimits _{j=0}^{q}Y_{tj}\) and \(\pi _{tk}=1-\displaystyle \sum \nolimits _{j=0}^{q}\pi _{tj}\).

The multinomial logit model defined by Agresti (2002) is given by

$$\begin{aligned} \pi _{tj}(\varvec{\beta })=\dfrac{\exp (\varvec{\beta }_i^{T}\mathbf {z}_{t-1})}{1+\displaystyle \sum \nolimits _{j=1}^{k}\exp (\varvec{\beta }_{j}^{T}{\mathbf {z}}_{t-1})}, \qquad j=0,1,\ldots ,q, \end{aligned}$$

and

$$\begin{aligned} \pi _{tk}(\varvec{\beta })=\dfrac{1}{1+\displaystyle \sum \nolimits _{j=1}^{k}\exp (\varvec{\beta }_{j}^{T}{\mathbf {z}}_{t-1})}. \end{aligned}$$

Here \(\varvec{\beta }_{j}, \; j=0,1,\ldots ,q\) are \(d\)-dimensional regression parameters and \(\mathbf {z}_{t-1}\) is corresponding \(d\)-dimensional vector of stochastic time-dependent covariates independent of \(j\), and \(\varvec{\beta }=(\varvec{\beta }_{0}^{T}, \ldots , \varvec{\beta }_{q}^{T})^{T}\), denotes \((q+1)d\)-dimensional vector of parameters. A typical vector of covariates \(\mathbf {z}_{t-1}=(1, \mathbf {Y}_{t-1})^{T} =(1, Y_{(t-1)0}, Y_{(t-1)1},\ldots ,Y_{(t-1)q})^{T}\) has dimension \(d=q+2\).

To obtain the maximum partial likelihood estimates (MPLE), we maximize the log partial likelihood function which is given by

$$\begin{aligned} \log PL(\varvec{\beta }) = \displaystyle \sum _{t=1}^{N}\displaystyle \sum _{j=0}^{k} {y_{tj}} \log \pi _{tj}(\varvec{\beta }), \end{aligned}$$
(5.2)

and hence

$$\begin{aligned} \widehat{\varvec{\beta }}_{mple}= \arg \underset{\varvec{\beta } \in \varTheta }{\max } \;\log PL(\varvec{\beta }). \end{aligned}$$

5.2 Coherent forecasting

To obtain the \(h\)-step ahead forecasting for categorical time series for \(h>1\), we can extend the idea given in Fokianos and Kedem (2003). The 1-step ahead predicted response was obtained by the following rule

$$\begin{aligned} Y_{n+1} = C_{i} \Leftrightarrow \underset{j}{\max }\;{\pi }_{(n+1)j}(\widehat{\varvec{\beta }})={\pi }_{(n+1)i}(\widehat{\varvec{\beta }}). \end{aligned}$$

In a recursive way, in the second step we update this predicted observation to the covariates \(\mathbf {z}_{n+1}\) and then obtain \(\widehat{\pi }_{(n+2)j}, \; j=0,1,\ldots ,k\) and use the above rule to obtain two-step ahead forecasting, i.e., \(Y_{n+2}\) and repeat this process for \(h=3, 4,\ldots \), to obtain the \(h\)-step ahead forecasting values. Note that the \(h\)-step ahead forecasting distribution is nothing but \(p_{h}(i)=\pi _{(n+h)i}(\varvec{\beta }),\; i=0,1,\ldots ,k\), which can be used to obtain the forecasting measures KSD\((\mathbf {p}_{h},\,\widehat{ \mathbf {p}}_{h})\), ED\((\mathbf {p}_{h},\,\widehat{ \mathbf {p}}_{h})\), and MAD\((\mathbf {p}_{h},\,\widehat{ \mathbf {p}}_{h})\) defined in Sect. 2.

5.3 Confidence interval for the \(h\)-step ahead forecasting distribution

The \(h\)-step ahead forecasting distribution \(p_h(i ;\varvec{\beta })\) is a function of \(\varvec{\beta }\). Using delta method, the 95 % confidence interval for \(p_h(i ; \varvec{\beta })\) is given by \(p_h(i ;\widehat{\varvec{\beta }})\mp 1.96\sigma _h(i;\widehat{\varvec{\beta }})\) where

$$\begin{aligned} \sigma _h^2(i;\varvec{\beta })=(\bigtriangledown p_h(i;\varvec{\beta }))^T \{G^{-1}(\varvec{\beta })\}(\bigtriangledown p_h(i ;\varvec{\beta })) \quad \text{ and }\quad \varvec{\beta }^T=(\varvec{\beta }_{1}^T,\cdots ,\varvec{\beta }_{k}^T). \end{aligned}$$

Fokianos and Kedem (2003) also suggested a consistent estimator for \(G(\varvec{\beta })\) given by \(\displaystyle \sum \nolimits _{t=2}^{N}\mathbf {Z}_{t-1}\Sigma _t(\varvec{\beta })\mathbf {Z}_{t-1}^T\), where

6 Simulation study

To study the finite sample behaviors of the proposed forecasting measures, such as PTP, KSD, ED and MAD, and the cardinality of prediction interval defined in Sect. 2, and to facilitate model comparison through the Akaike information criterion (AIC) and Bayesian information criterion (BIC) and the above forecasting measures, we carried out some simulation studies based on the samples generated from the following three categorical time series models with \(4\) categories \(\{C_{0}, C_{1}, C_{2}, C_{3}\}\).

(M1):

PAR(1) model with \(\phi =0.8\) and \(\mathbf {p}=(0.2,0.2,0.5,0.1)\),

(M2):

MTD(1) model with the transition probability matrix

$$\begin{aligned} \mathbf {Q}={\begin{pmatrix} 0.85 &{}\quad 0.01 &{}\quad 0.05 &{}\quad 0.09 \\ 0.25 &{}\quad 0.20 &{}\quad 0.35 &{}\quad 0.20 \\ 0.05 &{}\quad 0.10 &{}\quad 0.80 &{}\quad 0.05 \\ 0.05 &{}\quad 0.05 &{}\quad 0.20 &{}\quad 0.70 \end{pmatrix}},\quad \text{ and } \end{aligned}$$
(M3):

Logistic regression model of order 1 with covariates \(\mathbf {z}_{t-1} = (1, \mathbf {Y}_{t-1})^{T} = (1, Y_{(t-1)1}, Y_{(t-1)2}, Y_{(t-1)3})^{T}\) and \(\varvec{\beta }_{0}=(6.80, 5.00, 3.30, 3.90)^{T}\), \(\varvec{\beta }_{1}=(2.45, 4.80, 4.05, 3.90)^{T}\), \(\varvec{\beta }_{2}=(4.05, 5.35, 6.25, 5.50)^{T}\).

To begin with, we generated samples of different sizes from the above three cases, namely M1, M2 and M3 and presented the results in Table 1. Five sample sizes are explored: samples of sizes 100 and 300 are used to study the small sample properties, samples of sizes 500 and 1,000 are used to get an idea about the moderate sample properties, and samples of size 5,000 are used to study the large sample properties. For a fixed sample size \(n\), we repeated the process 1,000 times and observed the percentage of times AIC and BIC select a particular model from the three models under comparison. Table 1 summarizes the results based on the data generated from the M1, M2 and M3. As expected, most of the times almost in all the cases AIC and BIC selected the true data-generating model, except for the second case M2. In case of M2, for small sample size (100), BIC selected the PAR(1) model \(10\,\%\) times as the true model, although the true data-generating mechanism was MTD(1). This is because the MTD model suffers from the large number of parameters which is considered as penalty in BIC.

Table 1 Percentage of times AIC and BIC select the correct model where data are generated from M1, M2, M3

In the second study, samples of size 150 were generated from all the three cases M1, M2 and M3. Then for each cases, we fitted all the three models under comparison, and obtained the forecasting measures—PTP, KSD, ED and MAD for varying \(h\). The results based on 5,000 replications are reported in Table 2. As we can see from the Table 2, for all the three cases, the measures KSD, ED and MAD increase as \(h\) increases. It means that forecasting accuracy decreases as one goes far ahead from the present as far as the KSD, ED and MAD are concerned, which is expected. On the other hand, as expected for all the cases, PTP measure decreases as \(h\) increases (see Table 2). Another important observation reveals that when the data were generated from M1, PAR(1) outperformed others with respect to all the four forecasting measures, whereas MTD(1) and Logistic(1) outperformed when data were generated from M2 and M3, respectively, which is also an expected scenario. Therefore,we may say that in all these cases the above forecasting measures played a significant role in detecting the true model.

Table 2 Values of forecasting measures PTP, KSD, ED and MAD for varying \(h\) where the data-generating model are M1, M2, M3

In an other study, we repeated the previous exercise, where we simulated samples of size 150 from all the three cases M1, M2 and M3 to study the forecasting accuracy using the HPP set (\(\mathcal {S}_{h}\)). For each data-generating mechanism, we obtained the \(100(1-\alpha )\,\%\) HPP set (\(\mathcal {S}_{h}\)) for \(h=1,\ldots ,6\) using the true data-generating models which are PAR(1), MTD(1) and Logistic(1) where \(\alpha =0.2\). The results based on all the three data-generating models are presented in Table 3. As we can see, for all the three cases cardinality of \(\mathcal {S}_{h}\) increases as \(h\) increases, which implies that to capture the same percentage of true observations as one goes far ahead from the present, one needs a larger HPP set. Therefore, the HPP set would also be a sensible measure to study the forecasting accuracy in the discrete time series analysis especially for categorical time series as far as the interval forecasting is concerned.

Table 3 \(100(1-\alpha )\,\%\) HPP set of \(Y_{n+h}\) given \(Y_{n}\) for varying \(h\) with cardinality of the set, where data are generated from M1, M2, M3, where \(\alpha =0.2\)

7 Real data example: infant sleep status data

Stoffer et al. (1988) reported a collection of 24 categorical time series of infant sleep status which is divided into two groups of 12 each based on their mothers’ drinking habit during pregnancy (one group of mothers abstained from drinking alcohol throughout their pregnancy, and the other group used alcohol moderately and consistently throughout their pregnancy), in an EEG study. Each of these 24 time series is observed for 128 min. In this section, we consider one such single time series from the first group.

During minute \(t\), the infant’s sleep status was recorded in six categories, namely “qt” being ‘quiet sleep’ with trace alternate, “qh” being ‘quiet sleep’ with high voltage, “tr” being ‘transitional sleep’, “al” being ‘active sleep’ with low voltage, “ah” being ‘active sleep’ with high voltage, and “aw” being ‘awake’. Note that the number of parameters to be estimated is 6 for PAR(1) model and is 30 for MTD(1) model which is quite large against the data size 128. On the other hand, since number of categories is 6, if we want to fit the logistic regression model, \(\mathbf {Y}_{t}\) has 5 components, i.e., \(\mathbf {Y}_{t}=(Y_{t1}, Y_{t2}, Y_{t3}, Y_{t4}, Y_{t5})\). Therefore, the number of parameters to be estimated to fit the logistic regression model of order 1 with covariates \(\mathbf {z}_{t-1}=(1, \mathbf {Y}_{t-1})=(1,Y_{(t-1)1}, Y_{(t-1)2}, Y_{(t-1)3}, Y_{(t-1)4}, Y_{(t-1)5})\) will be 30, and it is done using the partial likelihood method given in Eq. (5.2). Note that partial likelihood estimates of 30 parameters based on the data of size 128 may not be so reliable. Therefore, to bypass the problem, we reduced the number of categories from 6 to 4 by combining the quite states and active states as suggested in Stoffer et al. (2000). Hence the numbers of parameters to be estimated for the PAR(1) model becomes 4, and it is 12 for both the MTD(1) and Logistic(1) models. After combining the quite states and active states, the new labels of the categories are given by

$$\begin{aligned} \text{ qt }\equiv C_{0},\;\; \text{ qh }\equiv C_{0},\;\; \text{ tr }\equiv C_{1},\;\; \text{ al }\equiv C_{2},\;\; \text{ ah }\equiv C_{2},\;\; \text{ aw }\equiv C_{3}. \end{aligned}$$
(7.1)

The proportion of times spent by an infant in the combined sleep status \( C_{0},C_{1},C_{2} \text{ and } C_{3}\) given in (7.1) is 0.414, 0.008, 0.539, and 0.039, respectively. This indicates that the infant spent maximum time in the active sleep. The combined data are plotted in Fig. 2.

Fig. 2
figure 2

Plot of the infant sleep status data after combining some states

It is important to mention that although the infant sleep status data are of ordinal in nature, it may not be appropriate to use the ACF and PACF plots to choose the correct order. This is because the values of ACF and PACF depend on the actual numerical scaling of the categories and it changes from one scaling to another scaling of the categories. In practice, there does not exist any unique numerical scaling for such ordinal categories. We may at most say that the four scale values should be \(C_0<C_1<C_2<C_3\), and cannot specify the values of \(C_0,C_1,C_2,C_3\). Hence some alternative measures of serial association, which do not depend on the numerical scaling of the categories, should be used to select the order of the process. Weiß and Göb (2008) established one Theorem for an empirical justification of the adequacy of the NDARMA(\(p\), \(q\)) model to the observed categorical data (see Theorem 5.2 in Weiß and Göb 2008). The Theorem says that the estimates \(\widehat{\kappa }(h)\) for Cohen’s \(\kappa \), \(\widehat{v}(h)\) for Cramer’s \(v\), and the square root of estimate \(\widehat{A}_{\nu }^{(\tau )}(h)\) for Goodman and Kruskal’s \(\tau \) of different lag values \(h\) will be approximately equal if the NDARMA(\(p\), \(q\)) is adequate to the data. The formulae of these measures and their estimates are given in detail in Weiß and Göb (2008), Weiß (2011, 2013). Then to select the order of the NDARMA(\(p\), \(q\)) model, they proposed to observe the usual PACF, \(\rho _{\text {p}}(h)\) based on the estimates \(\widehat{\kappa }(h)\) for Cohen’s \(\kappa \) in place of the ACF \(\rho (h)\).

For the infant sleep status data, we obtained the values of these measures for various lag values and present the results in Table 4. As we can see, all the three measures, namely Cohen’s \(\kappa \), Cramer’s \(v\) and square root of Goodman and Kruskal’s \(\tau \) are close enough to fit the data by a PAR(\(p\)) process. On the other hand, Cohen’s \(\kappa \) estimates based estimates of PACFs, \(\widehat{\rho }_{\text {p}}(h)\), are about 0 for \(h>1\). Therefore, a PAR(1) model, which is same as DAR(1) model, will be an appropriate fit to the data.

Table 4 Estimated values of \(\kappa (h)\), \(v(h)\), \(A_{\nu }^{(\tau )}(h)\) and Cohen’s \(\kappa \)-based partial autocorrelation \(({\rho }_{\text {p}}(h))\) for the infant sleep status data

In addition, to study the effectiveness of the Cohen’s \(\kappa \) measure, we derived it for the PAR(1) model which came out to be \(\phi ^{h}\) and hence it decreases as the lag value \(h\) increases. Based on this result, we performed one simulation study. We generated samples of sizes \(n=200, 1,000, 10{,}000\) from the PAR(1) model with number of categories 4, for mixing parameter \(\phi =0.4,0.6,0.8\) and common marginal distribution \(\mathbf {p}=(0.414,0.008, 0.539, 0.039)\). Figure 3 displays the values of theoretical \(\kappa (h)\) (which is colored in black) with the empirical \(\kappa (h)\) (which is colored in gray) for varying \(h\). We see that as the sample size increases the empirical \(\kappa (h)\) coincides with the theoretical \(\kappa (h)\). Based on this observation, we fitted the PAR(1) model to the infant sleep status data and obtained the empirical and theoretical values of \(\kappa (h)\) for various values of \(h\) and presented it in Fig. 4. As we can see from Fig. 4, the empirical \(\kappa (h)\) obtained from the data coincides with the fitted PAR(1) model.

Fig. 3
figure 3

Theoretical and empirical values of Cohen’s \(\kappa \) for various lag values \(h\). Here samples are generated from PAR(1) with number of categories \(4\), for mixing parameter \(\phi =0.4,0.6,0.8\) and sample sizes \(n=200, 1,000, 10{,}000\) with the common marginal distribution \(\mathbf {p}=(0.414,0.008, 0.539, 0.039)\)

Fig. 4
figure 4

Plot of Cohen’s \(\kappa \) for varying lag values for the infant sleep data

The transition probabilities for MTD(1) model are obtained through sample proportions, whereas the parameters for PAR(1) and logistic regression models of order 1 are estimated using partial likelihood method. The estimated value of the mixing parameter \(\phi \) of the PAR(1) model is 0.78, which indicates that a large number of paired observations \((Y_t, Y_{t-1})\) with \(Y_t=Y_{t-1}\) is present in the data. Therefore, the PAR(1) model is a competing alternative to the data. The other parameter associated with the PAR(1) model is the marginal distribution \(\mathbf {p}\) which is estimated as \((0.414, 0.008,0.539,0.0391)\). Similarly the transition probability matrix (tpm) \(\mathbf {Q}\) associated with MTD(1) model is estimated as

$$\begin{aligned} {\begin{pmatrix} 0.869 &{}\quad 0.019 &{}\quad 0.115 &{}\quad 0\\ 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 \\ 0.087 &{}\quad 0 &{}\quad 0.898 &{}\quad 0.014\\ 0.200 &{}\quad 0 &{}\quad 0 &{}\quad 0.800\\ \end{pmatrix}}. \end{aligned}$$

To fit the logistic regression model, we used the setup discussed in Eq. (5.1) in Sect. 5. Note that, after combining the states the data has four categories and hence \(\mathbf {Y}_{t}\) has three components, i.e., \(\mathbf {Y}_{t}=(Y_{t1}, Y_{t2}, Y_{t3})^{T}\). Based on this multivariate representation, we plotted sample autocorrelation and cross-correlation in Fig. 5. As one can see, there is a decreasing pattern in the first and last plots in Fig. 5, which indicates that \(Y_{t}\) only depends on its lagged values, and there is no periodical term (e.g., sinusoidal term) in its covariates. Therefore, we fitted logistic regression model with covariates \(\mathbf {z}_{t-1}=(1, \mathbf {Y}_{t-1})^{T}=(1,Y_{(t-1)1}, Y_{(t-1)2}, Y_{(t-1)3})^{T}\) (we called it Logistic(1) model with intercept). The parameters associated with the model were estimated as

$$\begin{aligned}&\varvec{\beta _{0}}=(6.80, 5.00, 3.30, 3.90)^{T}, \quad \varvec{\beta _{1}}=(2.45,4.80,4.05,3.90)^{T},\quad \text{ and } \\&\varvec{\beta _{2}}=(4.05,5.35,6.25,5.50)^{T}. \end{aligned}$$

After fitting the above models, we obtained the AIC and BIC for all the three models and presented it in Table 5. As we can see, PAR(1) model has the lowest AIC and BIC values. In addition, we obtained the PTP measure by dividing the data into two parts. First part, the training part consisting first 110 observations, was used to fit the models under comparison, and we obtained the single PTP measure based on the remaining 18 observations, which is presented in Table 5. As we can see, the PAR(1) outperforms MTD(1) and Logistic(1) in terms of predicting the true observations. Hence, overall the PAR(1) model fitted the data best among these three competing models.

Fig. 5
figure 5

Sample autocorrelation and cross-correlation functions for the infant sleep status data

Table 5 Infant sleep status data analysis

8 Concluding remarks

The basic objective of the present paper is to study the different methods of coherent forecasting and their forecasting accuracy based on some forecasting measures, which has been defined in Sect. 2 including forecasting interval in the context of time series of discrete data, especially for categorical data. Theoretical results and some simulation studies with a real data analysis on infant sleep status have illustrated the proposed methods.

Note that when the time series data are categorical, popular measures for studying forecasting accuracy like PRMSE and PMAE cannot be used. Therefore, to study the forecasting accuracy for categorical time series, here we have defined different measures, namely PTP, KSD, ED and MAD. Through some extensive simulation studies, efficacy of these measures have been checked. In addition, we have introduced a different notion of interval forecasting for categorical time series analysis whose efficacy has also been checked using some simulation results. Hence, we can say that these measures can be used in practice for the analysis of categorical time series data.

On the other side, a comparison study has been performed using those forecasting methods. Note that, Pegram’s operator-based AR(\(p\)), MA(\(q\)) or ARMA(\(p\),\(q\)) models are applicable for both count and categorical data (see, e.g., Biswas and Song 2009; Biswas and Guha 2009). However, the MTD model due to Raftery (1985) and the logistic regression model due to Fokianos and Kedem (2003) have a serious drawback that the number of parameters to be estimated is very large for large number (greater than 3) of categories which makes it difficult to implement. In addition, as observed in the simulation study, even though the data are generated from the MTD model, the BIC may be larger than the Pegram’s AR model due to the large number of parameters in the MTD model. As a result, the BIC may select some other competing model as the true model even though the data-generating mechanism is MTD model. On the other hand, the logistic regression models lack stationarity unless the parameters are appropriately adjusted. The Pegram’s ARMA model is very simple-minded and it is stationary and involves smaller number of parameters than the MTD and the logistic models. Also it has many elegant theoretical properties. Hence, it can be a good choice in many practical situations.