1 Introduction

Claude Shannon’s seminal 1948 article “A Mathematical Theory of Communication” introduced a definition for entropy as a well-motivated measure of randomness [1]. He further identified entropy rate \(h_\mu \) as a measure of the minimal coding cost of a series of potentially correlated symbols in his celebrated first theorem. In 1989, Young and Crutchfield identified statistical complexity \(C_\mu \) as the entropy of causal states [2], which are the minimal sufficient statistics of prediction [3]. Said simply, \(h_\mu \) is a measure of a process’ intrinsic randomness and \(C_\mu \) a measure of process structure. Both entropy rate and statistical complexity have given insight into many disparate complex systems, from chaotic crystallography [4], biomolecule dynamics [5,6,7], neural spike trains [8], and animal behavior [9] to stochastic resonance [10], geomagnetic volatility [11], hydrodynamic flows [12, 13], and fluid and atmospheric turbulence [14, 15].

These two measures of complexity are unified by the Kolmogorov–Chaitin complexity of a discrete object, which is the size of the minimal Universal Turing Machine program that produces the object [16, 17]. Specifically, the expected Kolmogorov–Chaitin complexity \(\langle K(x_\ell )\rangle \) of a (discrete-time, discrete-symbol) time series \(x_\ell \) of length \(\ell \) grows at the Shannon entropy rate and has an offset determined by the statistical complexity: \(\log \langle K(x_\ell )\rangle \propto _{\ell \rightarrow \infty } C_\mu + \ell h_\mu \), when these quantities exist [18].

Perhaps somewhat surprisingly, estimators of the entropy rate and statistical complexity of continuous-time, discrete-symbol processes are lacking. This is unfortunate since these processes are encountered very often in the physical, chemical, biological, and social sciences as sequences of discrete events consisting of an event type and an event duration or magnitude. An example critical to infrastructure design occurs in the geophysics of crustal plate tectonics, where the event types are major earthquakes tagged with duration time, time between their occurrence, and an approximate or continuous Richter magnitude [19]. Understanding these process’ randomness and structure bears directly on averting human suffering. Another example is revealed in the history of reversals of the earth’s geomagnetic field [11], which shields the planet’s life from exposure to damaging radiation. An example from physical chemistry is found in single-molecule spectroscopy which reveals molecular dynamics as hops between conformational states that persist for randomly distributed durations [6, 7]. The structure of these conformational transitions is implicated in biomolecular functioning and so key to life processes. A common example from neuroscience is found in the spike trains generated by neurons that consist of spike-no-spike event types separated by continuous interspike intervals. The structure and randomness of spike trains are key to delineating how tissues support essential information processing in nervous systems and brains [8]. Finally, a growing set of these processes appear in the newly revitalized quantitative social sciences, in which human communication events and their durations are monitored as signals of emergent coordination or competition [20].

Here, we provide new closed-form expressions for the entropy rate and statistical complexity of continuous-time, discrete-symbol processes by first identifying the causal states of stochastic processes generated by continuous-time unifilar hidden semi-Markov models, a restricted class of the models described in Ref. [21]. In these processes successive dwell times for the symbols (or events) are drawn based on the process’ current “hidden” state. Transitions from hidden state to hidden state then follow a rule that mimics transitions in discrete-time unifilar hidden Markov models. The resulting output process consists of the symbols emitted during the dwell time. Identifying the process causal states leads to new expressions for entropy rate, generalizing one of the results given in Sect. 7.4 of Ref. [22], and for statistical complexity. The hidden semi-Markov process class we analyze here is sufficiently general we conjecture that our results yield universal estimators of the entropy rate and statistical complexity of continuous-time, discrete-event processes.

To start, we define unifilar hidden semi-Markov processes and their generators, determine their causal states, and use the causal states to calculate their entropy rate and statistical complexity. We conclude by describing a method for using the expressions given here to estimate the entropy rate and statistical complexity of continuous-time, discrete-event time series.

2 Hidden Semi-Markov Processes and Their Unifilar Generators

The continuous-time, discrete-symbol process \(\ldots ,(X_{-1},\mathcal {T}_{-1}),(X_0,\mathcal {T}_0),(X_1,\mathcal {T}_1),\ldots \) has realizations \(\ldots (x_{-1}, \tau _{-1}), (x_0,\tau _0), (x_1,\tau _1)\ldots \). Events are symbol-duration pairs \((x_i,\tau _i)\) that occur sequentially in a process. For symbols, we demand that \(x_i\ne x_{i+1}\) to enforce a unique description of the process. In other words, the discrete event symbol \(x_i\in \mathcal {A}\) appears for a total time of \(\tau _i\). The present is located almost surely during the emission of \(x_0\), and we denote the time since last emission as \(\tau _{0^+}\) and the time to next emission as \(\tau _{0^-}\). We also, for clarity, denote the last-appearing symbol as \(x_{0^+}\) and the next-appearing symbol as \(x_{0^-}\); though obviously \(x_{0^+}=x_{0^-}\) almost surely. (It follows that \(\tau _{0^+}+\tau _{0^-}=\tau _0\).) The past \(\ldots ,(X_{-1},\mathcal {T}_{-1}),(X_0,\mathcal {T}_{0^+})\) is denoted \( \smash {\overleftarrow{(X,\mathcal {T})}} \) and the future \((X_0,\mathcal {T}_{0^-}),(X_1,\mathcal {T}_1),\ldots \) is denoted as \( \smash {\overrightarrow{(X,\mathcal {T})}} \), with realizations denoted \( \smash {\overleftarrow{(x,\tau )}} \) and \( \smash {\overrightarrow{(x,\tau )}} \), respectively.

A continuous-time, discrete-symbol process’ causal states \( \mathcal {S} \) are the equivalence classes of pasts defined by the relation:

$$\begin{aligned} \smash {\overleftarrow{(x,\tau )}} \sim _\epsilon { \smash {\overleftarrow{(x,\tau )}} }^{\prime }\iff \Pr \bigg ( \smash {\overrightarrow{(X,\mathcal {T})}} \bigg | \smash {\overleftarrow{(X,\mathcal {T})}} = \smash {\overleftarrow{(x,\tau )}} \bigg ) = \Pr \bigg ( \smash {\overrightarrow{(X,\mathcal {T})}} \bigg | \smash {\overleftarrow{(X,\mathcal {T})}} = { \smash {\overleftarrow{(x,\tau )}} }^{\prime }\bigg ). \end{aligned}$$
(1)

This mimics the relation for the causal states of discrete-time processes [3]. A process’ statistical complexity is the entropy of these causal states: \(C_\mu = {\text {H}}[ \mathcal {S} ]\). A process’ prescient states are any finer-grained partition of the causal-state classes, as for discrete-time processes [3]. Thus, there is a fundamental distinction between a process’ observed or emitted symbols and its internal states. In this way, we consider general processes as hidden processes.

Causal states can be discrete random variables, continuous random variables, or mixed random variables. For discrete-event, continuous-time processes, the last of these is the likeliest. The entropy of discrete random variables and the differential entropy of continuous random variables are both given in Ref. [23], while the entropy of mixed random variables is described in Ref. [24]; all are represented as \({\text {H}}[X]\), where X is the random variable. Conditional entropy \({\text {H}}[X|Y]\) is then defined as \(\langle {\text {H}}[X|Y=y] \rangle _y\). The mutual information between random variables X and Y is given by \({\text {I}}[X;Y] = {\text {H}}[X]-{\text {H}}[X|Y]\), and numerous other definitions are given in Ref. [23].

A hidden semi-Markov process (HSMP) is a continuous-time, discrete-symbol process generated by a hidden semi-Markov model (HSMM). A HSMM is described via a hidden-state random variable \(\mathcal {G}\) with realization \(g\), an emission probability \(T^{(x)}_{g}\), which is the probability of emitting \(x\) when in hidden state \(g\), and a dwell-time distribution \(\phi _{g}(\tau )\). In other words, in hidden state \(g\), symbol \(x\) is emitted with probability \(T^{(x)}_{g}\) for time \(\tau \) drawn from \(\phi _{g}(\tau )\). For reasons that will become clear, we focus on a restricted form—the unifilar HSMM (uHSMM). For these, the present hidden state \(g_0\) is uniquely determined by the past emitted symbols \(x_{-\infty :0} = \ldots ,x_{-2},x_{-1}\). See Fig. 1. In an abuse of notation, Eq. (1) determines a function \(\epsilon (x_{-\infty :0})\) that takes the past emitted symbol sequence \(x_{-\infty :0}\) to the underlying hidden state \(g_0\). (The abuse comes from suppressing the dependence on durations.) This is the analog appropriate to this setting of the familiar definition of unifilarity in discrete-time models [3]—that symbol and state uniquely determine next state.

3 Causal Architecture

The challenge now is to identify a given HSMP’s causal states. With the causal states in hand, we can calculate their statistical complexity and entropy rate rather straightforwardly. Theorem 1 makes the identification.

Theorem 1

A unifilar hidden semi-Markov process’ causal states are the triple \((g_0,x_{0^+},\tau _{0^+})\), under weak assumptions specified below.

Fig. 1
figure 1

Generative model for a unifilar hidden semi-Markov process. The notation \(p(x)|x,\tau \sim \phi _g\) means that x is emitted with probability p(x) from hidden state g with emission time drawn from \(\phi _g\)

Proof

To identify causal states, we need to simplify the conditional probability distribution:

$$\begin{aligned} p \bigg ( \smash {\overrightarrow{(x,\tau )}} \bigg | \smash {\overleftarrow{(x,\tau )}} \bigg ) = \Pr \bigg ( \smash {\overrightarrow{(X,\mathcal {T})}} = \smash {\overrightarrow{(x,\tau )}} \bigg | \smash {\overleftarrow{(X,\mathcal {T})}} = \smash {\overleftarrow{(x,\tau )}} \bigg ). \end{aligned}$$

When the process is generated by a unifilar hidden semi-Markov model, this conditional probability distribution simplifies to:

$$\begin{aligned} p \bigg ( \smash {\overrightarrow{(x,\tau )}} \bigg | \smash {\overleftarrow{(x,\tau )}} \bigg )&= p \big ((x_{0^-},\tau _{0^-}),(x_1,\tau _1),\ldots \big |\ldots ,(x_{-1},\tau _{-1}),(x_{0^+},\tau _{0^+}) \big ) \nonumber \\&= p \big ((x_1,\tau _1),\ldots \big |\ldots ,(x_0,\tau _0) \big ) p \big (\tau _{0^-} \big |\ldots ,(x_{-1},\tau _{-1}),(x_{0^+},\tau _{0^+}),x_{0^-} \big ) \nonumber \\&\quad \times p \big (x_{0^-} \big |\ldots ,(x_{-1},\tau _{-1}),(x_{0^+},\tau _{0^+}) \big ) , \end{aligned}$$
(2)

where we have \(\tau _0 = \tau _{0^+} + \tau _{0^-}\). Almost surely, we have:

$$\begin{aligned} p \big (x_{0^-}\big | \ldots ,(x_{-1},\tau _{-1}),(x_{0^+},\tau _{0^+}) \big ) = \delta _{x_{0^+},x_{0^-}} \end{aligned}$$
(3)

and, due to the unifilarity constraint:

$$\begin{aligned} p \big ( \tau _{0^-} \big | \ldots ,(x_{-1},\tau _{-1}),(x_{0^+},\tau _{0^+}),x_{0^-} \big )&= p \big (\tau _{0^-} \big |\tau _{0^+},g_0 = \epsilon ^+(x_{:0}) \big ). \end{aligned}$$
(4)

Together Eqs. (2)–(4) imply that the triple \((g_0,x_{0^+},\tau _{0^+})\) are prescient statistics.

These states are causal (minimal and prescient) when three things happen: first, when the number of hidden states \(g\) is the smallest possible; second, when \(\phi _{g}(\tau )\) does not take either the “eventually Poisson” or “eventually \(\Delta \)-Poisson” form described in Ref. [25]; and third, when two of the “conveyor belts” as shown in Fig. 3 do not merge.

Each of these conditions is worth spelling out.

The first condition states that of all such unifilar hidden semi-Markov models which generate the given process, we choose the model with the smallest number of hidden states. Two hidden states can be considered equivalent when they have identical future morphs, \(p( \smash {\overrightarrow{(x,\tau )}} |g,x_+,\tau _+)\). Here, Eqs. (2)–(4) aid in detecting such identical hidden states.

As for the second condition, to avoid an eventually Poisson-like dwell time distribution, we demand that \(\phi _{g}(\tau )\) cannot be written as \(\phi _{g}(T) e^{-\lambda (t-T)}\), for all \(t\ge T\) and some \(\lambda >0\) and \(T\ge 0\). To avoid an eventually \(\Delta \)-Poisson-like dwell time distribution, we demand that \(\phi _{g}(\tau )\) cannot be written as:

$$\begin{aligned} \phi _{g}(t) = \phi _{g} \big (T+(t-T) \!\!\!\! \mod \Delta \big ) e^{-\lambda \lfloor (t-T)/\Delta \rfloor } , \end{aligned}$$

for any \(0<\Delta ,\lambda ,T<\infty \). Almost all naturally occurring dwell-time distributions take neither of these forms, though a Poisson neuron with a refractory period generates an eventually Poisson renewal process.

Alternatively, and this is the third condition, it is possible that \(\epsilon (g_0,x_0) = \epsilon (g_0',x_0)\) and \(p(\tau _{0^-}|g_0,\tau _{0^+}) = p(\tau _{0^-}|g_0',\tau _{0^+}')\) for all \(\tau _{0^-}\). Here, the notation is overloaded: \(\epsilon (g_0,x_0)\) is the hidden state to which we transition given that we are in hidden state \(g_0\) and observe symbol \(x_0\). In this case, the conditional probability distributions in Eq. (2) are equal, despite the different hidden states and potentially different times since last event. Such an equality would engender merging of the renewal process-like conveyor belts in Fig. 3. And so, we stipulate that there does not exist a pair \(g_0\ne g_0'\) and \(\tau _{0^+}\) and \(\tau _{0^+}'\) such that \(p(\tau _{0^-}|g_0,\tau _{0^+}) = p(\tau _{0^-}|g_0',\tau _{0^+}')\) for all \(\tau _{0^-}\). This condition, described in more detail below, is shown in Fig. 2 and is almost always met for the applications cited earlier. Thus, we say that the typical unifilar hidden semi-Markov process’ causal states are given by the triple \((g_0,x_{0^+},\tau _{0^+})\).

The statistics described in Theorem 1 are still prescient statistics—i.e., sufficient statistics of prediction—even when the weak assumptions are not met. In discrete-time applications, this would imply that the entropy of those prescient states upper-bounds the statistical complexity. In continuous-time applications, prescient states and causal states are often mixed random variables and, so, no such upper bound can be stated generally.

As a result, it is well worth spelling out the least intuitive of these weak assumptions. The third weak assumption stipulated above is best illustrated by an example. To recapitulate, “typically” causal states are the triple \((g_0,x_0,\tau _{0^+})\): hidden state, observed symbol, and time since last symbol. Usually, \((g_0,x_0,\tau _{0^+})\) and \((g_0',x_0',\tau _{0^+}')\) are not predictively equivalent when \(g_0\ne g_0'\) and the uHSMM has the minimal number of states possible (the first weak assumption). However, to engineer such an equivalence without violating this first weak assumption, one first needs the two observed symbols to be identical, \(x_0=x_0'\). Next, one needs the hidden state to which we transition to be identical, so that:

$$\begin{aligned} p((x_1,\tau _1)|g_0,x_0,\tau _0) = p(x_1|\epsilon (g_0,x_0)) \phi _{\epsilon (g_0,x_0)}(\tau _1). \end{aligned}$$

And, finally, one needs:

$$\begin{aligned} p(\tau _{0^-}|g_0,\tau _{0^+}) = \frac{\phi _{g_0}(\tau _{0^+}+\tau _{0^-})}{\Phi _{g_0}(\tau _{0^+})} \end{aligned}$$

to be equal to \(p(\tau _{0^-}|g_0',\tau _{0^+}')\), for all \(\tau _{0^+}\ge 0\). This last equality can be obtained in myriad ways. For instance, in the example of Fig. 2, we choose:

$$\begin{aligned} \phi _C(t) = {\left\{ \begin{array}{ll} \left( \int _0^T \phi _A(t') dt'\right) \delta \left( t-\frac{T}{2}\right) &{}\quad t\le T \\ \phi _A(t) &{}\quad t>T \end{array}\right. } . \end{aligned}$$
(5)

Then, the triple (A, 0, t) is predictively equivalent to (C, 0, t) for \(t>T\) and the originally distinct conveyor belts corresponding to t “merge”.

Fig. 2
figure 2

With \(\phi _C(t)\) given by Eq. (5), the triple (A, 0, t) is predictively equivalent to (C, 0, t) for \(t>T\)

Calculating information measures of a time series exactly often requires finding the probability distribution \(p(g_0,x_{0^+},\tau _{0^+})\) over causal states. For instance, to find a process’ statistical complexity \(C_\mu \), we must determine \({\text {H}}[ \mathcal {S} ]\) which, in turn, entails finding the probability distribution \(p(g_0,x_{0^+},\tau _{0^+})\). Implicitly, we are deriving labeled transition operators as in Ref. [25]. We start by decomposing:

$$\begin{aligned} p(g_0,x_{0^+},\tau _{0^+}) = p(g_0) p(x_{0^+}|g_0) p(\tau _{0^+}|g_0,x_{0^+}). \end{aligned}$$

Since the dwell-time distribution depends only on the hidden state \(g_0\) and not on the emitted symbol \(x_{0^+}\), we find that:

$$\begin{aligned} p(\tau _{0^+}|g_0,x_{0^+}) = p(\tau _{0^+}|g_0). \end{aligned}$$

As in Ref. [25] in Sect. IVA, having a dwell time of at least \(\tau _{0^+}\) implies that:

$$\begin{aligned} p(\tau _{0^+}|g_0)&= \int _{\tau _{0^+}}^{\infty } p(\tau _0|g_0) d\tau _0 \nonumber \\&= \mu _{g_0} \Phi _{g_0}(\tau _{0^+}) , \end{aligned}$$
(6)

where \(\Phi _{g_0}(\tau _{0^+}) := \int _{\tau _{0^+}}^{\infty } \phi _{g_0}(t) dt\) will be called the survival distribution, and:

$$\begin{aligned} \mu _{g_0} = 1 \big / \int _0^{\infty } t\phi _{g_0}(t) dt \end{aligned}$$

is the inverse mean interevent interval from hidden state \(g_0\). From the setup, we also have:

$$\begin{aligned} p(x_{0^+}|g_0) = T_{g_0}^{(x_{0^+})}. \end{aligned}$$

Finally, to calculate \(p(g_0)\), we consider all ways in which probability can flow from \((g',x',\tau ')\) to \((g,x,0)\):

$$\begin{aligned} p(g,x,0)&= \sum _{g',x'} \int _0^{\infty } p(g',x',\tau ') p((g',x',\tau ')\rightarrow (g,x,0)) d\tau '. \end{aligned}$$
(7)

The transition probability \(p((g',x',\tau ')\rightarrow (g,x,0))\) is:

$$\begin{aligned} p((g',x',\tau ')\rightarrow (g,x,0)) = \delta _{g,\epsilon (g',x')} T_{g}^{(x)} \frac{\phi _{g'}(\tau ')}{\Phi _{g'}(\tau ')}. \end{aligned}$$
(8)

The term \(\delta _{g,\epsilon (g',x')}\) implies that one can only transition to \(g\) from \(g'\) if the emitted symbol \(x'\) allows. Then, \(T_{g}^{(x)}\) implies that there is a probability of emitting symbol \(x\) from newly-transitioned-to hidden state \(g\). And, \(\phi _{g'}(\tau ') / \Phi _{g'}(\tau ')\) is the probability of emitting \(x'\) for total time \(\tau '\), given that \(x'\) has already been emitted for total time at least \(\tau '\). Combining Eq. (8) with Eq. (7) gives:

$$\begin{aligned} p(g) T_{g}^{(x)} \mu _{g}&= \sum _{g',x'} \int _0^{\infty } p(g') T_{g'}^{(x')} \mu _{g'} \Phi _{g'}(\tau ') \delta _{g,\epsilon (g',x')} T_{g}^{(x)} \frac{\phi _{g'}(\tau ')}{\Phi _{g'}(\tau ')} \\ p(g)&= \frac{1}{\mu _{g}} \sum _{g',x'} \mu _{g'} p(g') T_{g'}^{(x')} \delta _{g,\epsilon (g',x')} \\&= \sum _{g'} \frac{\mu _{g'}}{\mu _{g}} \left( \sum _{x'} T_{g'}^{(x')} \delta _{g,\epsilon (g',x')}\right) p(g'). \end{aligned}$$

We therefore see that \(p(g)\) is the eigenvector (appropriately normalized, \(\sum _{g} p(g) = 1\)) associated with eigenvalue 1 of a transition matrix given by:

$$\begin{aligned} T_{g'\rightarrow g} := \frac{\mu _{g'}}{\mu _{g}} \left( \sum _{x'} T_{g'}^{(x')} \delta _{g,\epsilon (g',x')}\right) . \end{aligned}$$
(9)

Recalling computation theory [26], we see that the \(\epsilon \)-machine of such a process takes on the form of connected counters, albeit in continuous time. See Fig. 3.

Fig. 3
figure 3

Continuous-time \(\epsilon \)-machine for the hidden semi-Markov process generated by the uHSMM of Fig. 1, as determined by Theorem 1. Continuous-time causal states \( \mathcal {S} _A^+\), \( \mathcal {S} _B^+\), and \( \mathcal {S} _C^+\) track the times \(\tau _0\) or \(\tau _1\) since last event—the times obeying distributions \(\phi _A\), \(\phi _{B,0}\), \(\phi _{B,1}\), and \(\phi _C\) associated with uHMM states A, B, and C, respectively. Each renewal subprocess is depicted as a semi-infinite vertical line and is isomorphic with the positive real line. If no event is seen, probability flows towards increasing time since last event, as described in Eq. (9). Otherwise, the surfaces leaving \( \mathcal {S} _A^+\), \( \mathcal {S} _B^+\), and \( \mathcal {S} _C^+\) indicate allowed transitions back to the next reset state or 0 node located at the nonarrow end of the associated \( \mathcal {S} ^+\), denoting that a new event occurred associated with the next state A, B, and C, as appropriate. Note that when leaving state B there are two distinct diffusion processes on \( \mathcal {S} _B^+\) associated with emitting either 0 and 1. The domains of these diffusions are depicted with two separate semi-infinite lines, denoted \( \mathcal {S} _{B,0}^+\) and \( \mathcal {S} _{B,1}^+\), respectively. Figure 1’s generative HSMM is displayed underneath for reference

A process’ statistical complexity is defined as the entropy of its causal states. To calculate this, we need to find the probability distribution over the triple \((g_0,x_{0^+},\tau _{0^+})\), from which we can compute the statistical complexity as stated in Proposition 1. In the proposition, we overloaded notation: \({\text {H}}[p(g)] = -\sum _{g} p(g) \log p(g)\).

Proposition 1

The statistical complexity of a unifilar hidden semi-Markov process, under weak assumptions, is given by:

$$\begin{aligned} C_\mu = {\text {H}}[p(g)] - \sum _{g} p(g) \left( \sum _{x} T^{(x)}_{g} \log T_{g}^{(x)}\right) - \sum _{g} p(g) \int _0^{\infty } \left( \mu _{g} \Phi _{g}(\tau )\right) \log \left( \mu _{g} \Phi _{g}(\tau )\right) d\tau , \end{aligned}$$

where, as above, \(p(g)\) is the normalized right eigenvector of eigenvalue 1 of the matrix of Eq. (9).

Proof

Altogether, we find that the statistical complexity is:

$$\begin{aligned} C_\mu&= {\text {H}}[\mathcal {G},X,\mathcal {T}] \\&= {\text {H}}[\mathcal {G}] + {\text {H}}[X|\mathcal {G}] + {\text {H}}[\mathcal {T}|\mathcal {G},X] \\&= {\text {H}}[p(g)] - \sum _{g} p(g) \left( \sum _{x} T^{(x)}_{g} \log T_{g}^{(x)}\right) - \sum _{g} p(g) \int _0^{\infty } \left( \mu _{g} \Phi _{g}(\tau )\right) \log \left( \mu _{g} \Phi _{g}(\tau )\right) d\tau , \end{aligned}$$

where \(p(g)\) is given above.

As with continuous-time renewal processes, this is the statistical complexity of a mixed random variable and, hence, is not always an upper bound on the excess entropy \(\mathbf{E}:= {\text {I}}\bigg [ \smash {\overleftarrow{(X,\mathcal {T})}} ; \smash {\overrightarrow{(X,\mathcal {T})}} \bigg ]\).

4 Informational Architecture

Finally, we use the causal-state identification to calculate a uHSMP’s entropy rate. Entropy rates are defined via:

$$\begin{aligned} h_\mu = \lim _{T\rightarrow \infty } \frac{{\text {H}}\big [ \smash {\overrightarrow{(X,\mathcal {T})}} ^T \big ]}{T}, \end{aligned}$$

but can be calculated using:

$$\begin{aligned} h_\mu = \lim _{\delta \rightarrow 0} \frac{{\text {H}}\big [ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta } \big | \smash {\overleftarrow{(X,\mathcal {T})}} \big ]}{\delta }. \end{aligned}$$

Starting there and recalling the definition of causal states, we immediately have:

$$\begin{aligned} h_\mu = \lim _{\delta \rightarrow 0} \frac{{\text {H}}\big [ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta } \big | \mathcal {S} \big ]}{\delta }. \end{aligned}$$
(10)

We find that entropy rate is as stated in Theorem 2. We could have just as well conditioned on prescient states [3], and so the weak assumptions of Theorem 1 are irrelevant to Theorem 2. The form of the expression appears as a weighted sum of differential entropies of dwell times from the hidden states.

Theorem 2

The entropy rate of a unifilar hidden semi-Markov process is given by:

$$\begin{aligned} h_\mu = -\sum _{g} p(g) \int _0^{\infty } \mu _{g} \phi _{g}(\tau )\log \phi _{g}(\tau ) d\tau , \end{aligned}$$

where \(p(g)\) is the normalized right eigenvector associated with eigenvalue 1 of the matrix in Eq. (9).

Proof

As just noted, the entropy rate can be calculated via:

$$\begin{aligned} h_\mu = \lim _{\delta \rightarrow 0} \frac{{\text {H}}\big [ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta } \big | \mathcal {S} \big ]}{\delta } , \end{aligned}$$

where \( \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\) are trajectories of time length \(\delta \). Following Ref. [25], we consider the random variable \(X_{\delta }\) to be 0 when the emitted symbol is constant throughout the trajectory of length \(\delta \), 1 when there is one switch from one emitted symbol to another in this trajectory of length \(\delta \), and so on. Basic information formulae give:

$$\begin{aligned} {\text {H}}\left[ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\big | \mathcal {S} \right] = {\text {H}}[X_{\delta }| \mathcal {S} ] + {\text {H}}\left[ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\big |X_{\delta }, \mathcal {S} \right] . \end{aligned}$$
(11)

The key reason that we condition on \(X_{\delta }\) is that two switches are highly unlikely to happen relative to one switch. Furthermore, if no switches in emitted symbols occur, then the trajectory is entirely predictable and does not contribute to the entropy rate. In particular:

$$\begin{aligned} \Pr (X_{\delta }=1| \mathcal {S} =(g,x,\tau ))&= \int _0^{\delta } \frac{\phi _g(\tau +s)}{\Phi _g(\tau +s)} ds \\&\approx \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \delta + O(\delta ^2) \quad \text {and}\\ \Pr (X_{\delta }=k| \mathcal {S} =(g,x,\tau ))&= O(\delta ^k) \end{aligned}$$

and so:

$$\begin{aligned} \Pr (X_{\delta }=0| \mathcal {S} =(g,x,\tau ))&= 1-\Pr (X_{\delta }\ge 1| \mathcal {S} =(g,x,\tau )) \\&= 1 - \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \delta + O(\delta ^2). \end{aligned}$$

In a straightforward way, it follows that:

$$\begin{aligned} {\text {H}}[X_{\delta }| \mathcal {S} =(g,x,\tau )]&= -\left( 1 - \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \delta \right) \log \left( 1 - \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \delta \right) \nonumber \\&\quad - \left( \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \delta \right) \log \left( \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \delta \right) + O(\delta ^2\log \delta ) \nonumber \\&= \frac{\phi _g(\tau )}{\Phi _{g}(\tau )}\delta - \left( \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \log \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \right) \delta - \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )}\delta \log \delta + O(\delta ^2\log \delta ) , \end{aligned}$$
(12)

after several Taylor approximations; e.g., \(\log (1+x)=x + O(x^2)\).

Now, consider the second term in Eq. (11):

$$\begin{aligned}&{\text {H}}\left[ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\big |X_{\delta }, \mathcal {S} =(g,x,\tau )\right] \\&\quad = \Pr (X_{\delta }=0| \mathcal {S} =(g,x,\tau )) {\text {H}}\left[ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\big |X_{\delta }=0, \mathcal {S} =(g,x,\tau )\right] \\&\qquad + \Pr (X_{\delta }=1| \mathcal {S} =(g,x,\tau )) {\text {H}}\left[ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\big |X_{\delta }=1, \mathcal {S} =(g,x,\tau )\right] \\&\qquad + \sum _{k=2}^{\infty } P(X_{\delta }=k| \mathcal {S} =(g,x,\tau )) {\text {H}}\left[ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\big |X_{\delta }=k, \mathcal {S} =(g,x,\tau )\right] . \end{aligned}$$

If \(X_{\delta }=0\), the trajectory is completely determined by \( \mathcal {S} =(g,x,\tau )\), and hence:

$$\begin{aligned} {\text {H}}\left[ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\big |X_{\delta }=0, \mathcal {S} =(g,x,\tau )\right] = 0. \end{aligned}$$

If \(X_{\delta }=1\), then the trajectory is completely determined by one time—that at which emitted symbols switch. As in Ref. [25], the distribution of switching time is roughly uniform over the interval, and so:

$$\begin{aligned} {\text {H}}\left[ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\big |X_{\delta }=1, \mathcal {S} =(g,x,\tau )\right] = \log \delta + O(\delta ). \end{aligned}$$

Finally, from maximum entropy arguments:

$$\begin{aligned} P(X_{\delta }=k| \mathcal {S} =(g,x,\tau )) {\text {H}}\left[ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\big |X_{\delta }=k, \mathcal {S} =(g,x,\tau )\right] \end{aligned}$$

is at most of \(\delta ^k (\log \delta )^k\). In particular, we noted earlier that \(P(X_{\delta }=k| \mathcal {S} =(g,x,\tau ))\) was \(O(\delta ^k)\) and that k emissions over a time interval of no more than \(\delta \) yields differential entropy of no more than \((\log \delta )^k\). In addition, we have:

$$\begin{aligned} {\text {H}}\left[ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\big |X_{\delta }= k\right] \ge {\text {H}}\left[ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\big |X_{\delta }=1\right] . \end{aligned}$$

That is, if given a trajectory with a single transition, one can construct trajectories that approximate it arbitrarily closely with more than one transition. And so, \(\left| {\text {H}}\big [ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\big |X_{\delta }= k\big ]\right| \) is at least \(O(|\log \delta |)\) and at most \(O(|\log \delta |^k)\). Hence:

$$\begin{aligned} \sum _{k=2}^{\infty } P(X_{\delta }=k| \mathcal {S} =(g,x,\tau )) {\text {H}}\big [ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\big |X_{\delta }=k, \mathcal {S} =(g,x,\tau )\big ] = O(\delta ^2(\log \delta )^2). \end{aligned}$$

Altogether, we have:

$$\begin{aligned} {\text {H}}\left[ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\big | \mathcal {S} =(g,x,\tau )\right]&= \frac{\phi _g(\tau )}{\Phi _{g}(\tau )}\delta - \left( \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \log \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \right) \delta - \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )}\delta \log \delta \\ {}&\quad + \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \delta \log \delta + O(\delta ^2(\log \delta )^2) \\&= \frac{\phi _g(\tau )}{\Phi _{g}(\tau )}\delta - \left( \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \log \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \right) \delta + O(\delta ^2(\log \delta )^2) \end{aligned}$$

and, thus:

$$\begin{aligned} {\text {H}}\left[ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }| \mathcal {S} \right]&= \left\langle {\text {H}}\big [ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\big | \mathcal {S} =(g,x,\tau )\big ] \right\rangle _{g,x,\tau }\\&= \left\langle \frac{\phi _g(\tau )}{\Phi _{g}(\tau )} - \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \log \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \right\rangle _{g,x,\tau } \delta + O(\delta ^2(\log \delta )^2). \end{aligned}$$

And so, from Eq. (10), we find the entropy rate:

$$\begin{aligned} h_\mu&= \lim _{\delta \rightarrow \infty } \frac{{\text {H}}\left[ \smash {\overrightarrow{(X,\mathcal {T})}} ^{\delta }\big | \mathcal {S} \right] }{\delta } \\&= \left\langle \frac{\phi _g(\tau )}{\Phi _{g}(\tau )} - \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \log \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \right\rangle _{g,x,\tau } \\&= \sum _{g,x} p(g) T^{(x)}_{g} \int _0^{\infty } \mu _{g} \Phi _{g}(\tau )\frac{\phi _g(\tau )}{\Phi _{g}(\tau )} d\tau - \sum _{g,x} p(g) T^{(x)}_{g} \int _0^{\infty } \mu _{g} \Phi _{g}(\tau ) \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \log \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} \\&= \sum _{g,x} p(g) T^{(x)}_{g} \left( \int _0^{\infty } \mu _{g} \phi _{g}(\tau )d\tau - \int _0^{\infty } \mu _{g} \phi _{g}(\tau ) \log \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} d\tau \right) . \end{aligned}$$

We directly have:

$$\begin{aligned} \int _0^{\infty } \mu _{g} \phi _{g}(\tau )d\tau = \mu _{g} \end{aligned}$$

and

$$\begin{aligned} \int _0^{\infty } \mu _{g} \phi _{g}(\tau ) \log \frac{\phi _{g}(\tau )}{\Phi _{g}(\tau )} = \int _0^{\infty } \mu _{g} \phi _{g}(\tau )\log \phi _{g}(\tau ) d\tau - \int _0^{\infty } \mu _{g} \phi _{g}(\tau )\log \Phi _{g}(\tau ) d\tau . \end{aligned}$$

The second term simplifies substituting \(u=\Phi _{g}(\tau )\):

$$\begin{aligned} \int _0^{\infty } \mu _{g} \phi _{g}(\tau )\log \Phi _{g}(\tau ) d\tau&= - \int _1^0 \mu _{g} \log u du \\&= \mu _{g} \left( u\log u-u\right) |_0^1 \\&= -\mu _{g}. \end{aligned}$$

Altogether, we find:

$$\begin{aligned} h_\mu = -\sum _{g} p(g) \int _0^{\infty } \mu _{g} \phi _{g}(\tau )\log \phi _{g}(\tau ) d\tau . \end{aligned}$$

Theorem 2 generalizes a recent result about the entropy rate of semi-Markov processes [22] to unifilar hidden semi-Markov processes. In this way, Theorem 2 demonstrates the practical use of identifying causal states.

5 Conclusion

Proposition 1 and Theorem 2 provide new expressions for the statistical complexity and entropy rate of continuous-time, discrete-event processes generated by a restricted class of hidden semi-Markov models, what we called “unifilar hidden semi-Markov models”. Their causal architecture reminds one of a discrete-time \(\epsilon \)-machine crossed with the \(\epsilon \)-machine of a continuous-time renewal process [25]. Pursuing this, straightforward extensions of the techniques in Ref. [25] allowed us to derive new expressions for statistical complexity and entropy rate.

These expressions can be used as new plug-in estimators for the statistical complexity and entropy rate of continuous-time, discrete-event processes. It might seem that using them requires an accurate estimation of \(\phi _g(\tau )\), which in turn might require unreasonable amounts of data. However, a method first utilized by Kozachenko and Leonenko and popularized by Victor [27] utilizes the fact that we need only calculate scalar functions of \(\phi _g(\tau )\) and not \(\phi _g(\tau )\) itself. A second concern arises from the super-exponential explosion of discrete-time, discrete-alphabet \(\epsilon \)-machines with the number of hidden states [28]. How do we know the underlying topology? Here, as a first try, we suggest taking Ref. [29]’s approach, replacing the hidden states in these general models with the last k symbols. Further research is required, though, to determine when the chosen k is too small or too large. And, finally, we hope this work stimulates new work on continuous-time, discrete-alphabet quantum machines such as found in Ref. [30].