1 Introduction

A major conceptual contribution of statistical mechanics is that it successfully refocused its attention from phenomenological concepts such as heat and energy flow, which dominated thermodynamics in the second half of the nineteenth century, to the question of how the underlying microscopic dynamics occupies the space of potential configurations. This, in turn, made it possible to understand why a simple compact, macroscopic description is so efficient in describing microscopically diverse thermodynamic systems [1,2,3,4,5]. The key mathematical concept that is responsible for such a “miraculous” simplicity in the description is known as the concept of a typical set [6,7,8,9]. In fact, when at equilibrium, large-scale systems obey conventional thermodynamics, because they lie in microscopic configurations (or states) that are typical. Typical states comprise a fraction of those possible states that carry total probability close to one. The set of these states is thus called the typical set.

Crucially, the set of typical states comprises only a small fraction of the total number of possible states, and yet typical sets alone are sufficient to describe the macroscopic behavior of a given system. Consequently, the concept of typicality allows for a drastic reduction of the degrees of freedom needed for system’s statistical description. A quantitative definition of typical states in weakly interacting systems is most easily provided by information theory [6]. For continuous random variables, the concept of typical sets is also studied in the framework of measure theory where it is tantamount to the concentration of measure phenomenon [10,11,12]. The latter was popularized in the context of (multi)fractals by B. Mandelbrot who also dubbed the phenomenon as curdling [13]. It is, therefore, the existence of the typical set of micro-states that allows the heat and energy flow considerations (underlying phenomenological thermodynamics) to be understood in terms of the occupancy of the state space in statistical mechanics [14]. Furthermore, the equivalence between microcanonical and canonical ensemble description in the thermodynamic limit is, again, a direct consequence of the existence of a typical set of equal-energy microstates in the canonical ensemble [15].

In Shannon’s information theory, partitioning a set of states or sequences into those that are typical and those that are atypical is possible due to the Asymptotic Equipartition Property (AEP) or Shannon–McMillan–Breiman theorem [6], which states that all the sequences in the typical set have almost the same probability to occur. There, the typical set and the AEP are instrumental in proving main results for the channel capacity and noiseless coding, as well as to provide a sound mathematical basis for various information compression strategies. Shannon’s original proof of the AEP for independent and identically distributed (i.i.d.) sequence of random variables (as well as subsequent extensions to weakly dependent random variables) uses the weak law of large numbers. In such a context, Shannon entropy emerges as a natural tool for characterizing typical probabilities. and, by extension, as a quantifier of the cardinality of the typical set. Moreover, the typical set contains sequences with a sample entropy (analog of Boltzmann entropy) that is close to the Gibbs–Shannon entropy. This is an information-theoretic equivalent of the celebrated Einstein’s entropic principle [16] (i.e., reversal of Boltzmann’s entropic formula) (Fig. 1).

Fig. 1
figure 1

A Different underlying microscopic hypothesis: (Top), the standard i.i.d. assumption, or its physical analogue, the Stosszahlansatz or chaos molecular hypothesis. (Bottom) a stochastic process whose sampling space grows in time, thereby violating the assumptions over which standard statistical mechanics is built. B (Top) A physical system in equilibrium whose microscopic dynamics obeys the molecular-chaos hypothesis and (Bottom) a toy representation of the collective behaviour of a physical system with increasing sampling space. C (Top) The standard assumptions of equilibrium statistical mechanics lead naturally to the concept of typical set. (Bottom) more complex dynamics may also have typical behaviours, albeit more complex to identify or characterize. D (Top) The existence of the typical set in equilibrium configurations gives rise to the Shannon entropy, as the natural functional accounting for the typical occupation of the sampling space. (Bottom) more complex dynamics still leading to typical behaviours may give rise to generalized forms of entropy and other functionals

When the underlying dynamics satisfies appropriate boundary conditions, e.g. it is weekly interacting, then the conventional mathematical structure of equilibrium thermodynamics is directly implied by the concentration of measure phenomenon. This fact will be illustrated shortly with an example of a simple coin toss at different “temperatures”. Weakly interacting systems with (almost) independent system constituents are epitomized, for instance, in the celebrated Stosszahlansatz hypothesis [17] or Bogoliubov’s no-correlation initial conditions [18]. The question naturally arises as to whether the concentration of measure and related typicality can be applied to more general systems, such as complex dynamical systems, and if so, what mathematical structure can be expected in the ensuing generalized thermodynamics. The motivation for such a question is clear: the existence of typical behaviors implies a massive reduction of degrees of freedom and, in turn, triggers the emergence of macroscopic, interrelated functionals that allow us to characterize and define macroscopic observables. If typical sets and associated macroscopic functionals could be identified in systems with higher underlying microscopic complexity, this would open up the possibility of characterizing them (i.e. establishing predictive principles) in a way that would mimic equilibrium thermodynamics. It is the aim of this paper to address this issue and put forward some of our preliminary results.

The paper is organized as follows. In the next section, we analyze consequences of the concentration of measure and typicality in simple coin tossing systems. In particular, we discuss both microcanonical and canonical ensemble descriptions of such systems, and the role of temperature in the occupation of the state space. In Sect. 3, we extend our discussion beyond the conventional Shannon’s paradigm, and we derive the conditions for typicality using both Rényi and Tsallis entropies. Interestingly, we find that the ensuing typical sets are well defined, allowing the occupation of the sampling space to be mapped into well-known functionals. In particular, typicality arising from Rényi entropies naturally involves the emergence of free energy, whereas for Tsallis entropies the typical set bounds are phrased in terms of the partition function. We end with Sect. 4, in which we briefly discuss a possible general framework in which the assumptions on the underlying dynamics are relaxed, leading to general forms of entropy characterizing the sampling space occupation even in cases with very complex dynamics and/or unstable sampling spaces. Conclusions and perspectives are finally summarized in Sect. 5. For the sake of clarity, some more technical considerations are relegated to appendix.

2 Concentration of measure: tossing a coin

In this section, we provide the characterization of a fairly simple stochastic process—coin tossing. In particular, we show how the key observable quantities from information theory and statistical mechanics can be interpreted in terms of measure concentration and typicality, that is, in terms of state space occupation.

2.1 A coin’s “microcanonical” ensemble

Suppose we have a system that can only be in two states \(\{0,1\}\). In the following, we will refer to this system as a coin. To start, let us assume that after running (or observing) our system N times, it shows exactly \(m_0\) 0’s and \(m_1\) 1’s, with \(N=m_0+m_1\). This coin tossing process represents a two-valued discrete-time stochastic process known as a Bernoulli (binary) process. Let \(\sigma _{\mathrm g}(m_0,m_1)\) be a generic sequence of 1’s and 0’s, i.e.

$$\begin{aligned} \sigma _{\mathrm g}(m_0,m_1) \ = \ (1,0,0,1,1,0,....,0,1), \end{aligned}$$
(1)

containing exactly \(m_0\) 0’s and \(m_1\) 1’s. We refer to the set of all such sequences as \(\Omega ^N(m_0, m_1)\). The cardinality of such a set is given by the binomial coefficient:

$$\begin{aligned} |\Omega ^N (m_0,m_1)| = {N \atopwithdelims ()m_0 }\quad . \end{aligned}$$
(2)

We note that all sequences (1) are equally likely. The Boltzmann-like entropy of the macrostate defined by the set \(\Omega (m_0,m_1)\) is defined as

$$\begin{aligned} S(\Omega (m_0,m_1)) \ = \ \log |\Omega (m_0,m_1)|\,. \end{aligned}$$
(3)

By using Stirling’s approximation

$$\begin{aligned} \log (n!) \ = \ n \log n \ - \ n \ + \ \mathcal {O}(\log n), \end{aligned}$$
(4)

which is valid for sufficiently large values of n, we might rewrite (3) as

$$\begin{aligned} \log |\Omega (m_0,m_1)| \ \approx \ N H(\theta )\,. \end{aligned}$$
(5)

Here, the entropy per toss \(H(\theta )\) is nothing but the Shannon entropy associated with the Bernoulli’s random variable \(\theta \), i.e. variable that acquires two values (0 and 1), with respective probabilities

$$\begin{aligned} & p(\theta = 0) \ \equiv \ p(0) \ = \ \frac{m_0}{N}, \nonumber \\ & p(\theta = 1)\ \equiv \ p(1) \ = \ \frac{m_1}{N}, \end{aligned}$$
(6)

so that

$$\begin{aligned} H(\theta ) \ = \ -\sum _{\theta \in \{0,1\}}p(\theta )\log p(\theta )\,. \end{aligned}$$
(7)

In passing we might note that \(H(\theta )\) is maximized by a uniform distribution with a maximum \(\log 2\), thus

$$\begin{aligned} H(\theta )\ \le \ \log 2\,. \end{aligned}$$
(8)

One can interpret the above results as a toy representation of the microcanonical ensemble.

2.2 A coin’s “canonical” ensemble

Let us now consider the sequence of N coin tosses for which we know only the prior probabilities of occurrence of 0 or 1, i.e. p(0) or p(1), respectively. Instead of a specific sequence of 0’s and 1’s with fixed \(m_0\) and \(m_1\), we now have a i.i.d. sequence of random variables \(\theta _1,\theta _2,...,\theta _N\) following the stationary distribution \(\{p(0),p(1)\}\). We denote the state space of all possible sequences of length N as \(\Omega ^N\), so that

$$\begin{aligned} \Omega ^N \ = \ \{0,1\}^N. \end{aligned}$$
(9)

Clearly, the cardinality of this state space is

$$\begin{aligned} \left| \Omega ^N\right| \ = \ 2^{N}\,. \end{aligned}$$
(10)

Let us observe how the measure concentration phenomenon arises in this case as a consequence of the law of large numbers. Indeed, let us denote the average of the sum after N trials as \(\langle \theta _N\rangle \). Then for any N we have

$$\begin{aligned} \langle \theta _N\rangle \ = \ Np(1). \end{aligned}$$
(11)

At this stage we might employ, e.g. Hoeffding’s version of the law of large numbers (or Hoeffding’s inequality) [19], which states that for all \(\delta >0\)

$$\begin{aligned} & \mathbb {P}\left\{ \frac{1}{N}\left( \sum _{k=1}^N\theta _k- \langle \theta _N\rangle \right) \ge \delta \right\} \nonumber \\ & \quad = \ \mathbb {P}\left\{ \left( \frac{1}{N}\sum _{k=1}^N\theta _k- p(1)\right) \ge \delta \right\} \ \le \ e^{-2\delta ^2N}.~~~~~ \end{aligned}$$
(12)

We see that most of the weight in long sequences is carried by sequences whose arithmetic average is close to p(1), and that deviations from this behavior are extremely rare in long sequences. But can we know more? Apart from the inequality (12), it is important to know how the expected sequences look like. Obviously, not all \(2^N\) sequences of length N have the same arithmetic mean. This leads to the concept of typicality and the associated concept of typical set.

2.2.1 The typical set

Let \(p(\theta _1,\theta _2, \ldots ,\theta _N)\) be the probability of observing the sequence \(\theta _1,\theta _2, \ldots ,\theta _N\) of i.i.d. random variables, then in the large N limit the following form of AEP holds

$$\begin{aligned} -\frac{1}{N}\log p(\theta _1,\theta _2, \ldots ,\theta _N)\ \rightarrow \ H(\theta ), \end{aligned}$$
(13)

where the convergence is in probability. This can be proved in various ways [6, 20]. For instance, if we employ the fact that \(\theta _k\)’s are i.i.d., we have

$$\begin{aligned} -\frac{1}{N}\log p(\theta _1,\theta _2, \ldots ,\theta _N) \ = \ -\frac{1}{N} \sum _{k=1}^N \log p(\theta _k),\nonumber \\ \end{aligned}$$
(14)

and by Hoeffding’s inequality

$$\begin{aligned} & \mathbb {P}\left\{ \left( -\frac{1}{N}\sum _{k=1}^N\log p(\theta _k) + \langle \log p(\theta ) \rangle \right) \ge \delta \right\} \nonumber \\ & \quad =\ \mathbb {P}\left\{ \left( -\frac{1}{N}\sum _{k=1}^N\log p(\theta _k) - H(\theta ) \right) \ge \delta \right\} \nonumber \\ & \quad \le \ e^{-2\delta ^2N}. \end{aligned}$$
(15)

which directly implies (13). The aforementioned AEP has important consequences for the understanding of the state space structure. In fact, let \(A_\epsilon ^N\) be the set of sequences of length N, \(\sigma ^N_1, \ldots \sigma ^N_m,\ldots \) with a generic member \(\sigma ^N_{\mathrm g}\) satisfying

$$\begin{aligned} e^{-N[H(\theta ) +\epsilon ]} \ \le \ p(\sigma ^N_{\mathrm g}) \ \le \ e^{-N[H(\theta )-\epsilon ]}, \end{aligned}$$
(16)

(for \(\epsilon > 0\)). For reasons to be seen shortly, the set \(A_\epsilon ^N\) is known as a typical set. Equation (16) implies that

$$\begin{aligned} \left| -\frac{1}{N}\log p(\sigma ^N_{\mathrm g}) - H(\theta )\right| \ \le \ \epsilon , \end{aligned}$$
(17)

or in probability

$$\begin{aligned} & \mathbb {P}\left\{ \left| -\frac{1}{N}\log p(\theta _1,\theta _2, \ldots ,\theta _N) - H(\theta ) \right| < \epsilon \right\} \nonumber \\ & \quad =\mathbb {P}\left\{ \theta _1,\theta _2, \ldots ,\theta _N \in A_\epsilon ^N\right\} \ > \ 1 - 2e^{-2\epsilon ^2 N}. \nonumber \\ \end{aligned}$$
(18)

On the last line, we used the inequality (15) with \(\epsilon \) instead of \(\delta \). Equation (18) shows that the probability of obtaining a sequence that belongs to the typical set \(A_{\epsilon _N}^N\) converges to one in the large N limit. The result (18) directly implies that

$$\begin{aligned} 1 - 2e^{-2\epsilon ^2 N}< & \mathbb {P}\left\{ \theta _1,\theta _2, \ldots ,\theta _N \in A_\epsilon ^N\right\} \nonumber \\\le & \sum _{\theta _1,\theta _2, \ldots ,\theta _N \in A_\epsilon ^N} e^{-N[H(\theta )-\epsilon ]}\nonumber \\= & \left| A_\epsilon ^N\right| e^{-N[H(\theta ) -\epsilon ]}, \end{aligned}$$
(19)

where \(\left| A_\epsilon ^N\right| \) is the cardinality of the set \(A_\epsilon ^N\). In the derivation, we used the defining relation (16). Similarly, one can easily see (cf. e.g. [6]) that

$$\begin{aligned} 1= & \sum _{\theta _1,\theta _2, \ldots ,\theta _N \in \Omega ^N} p(\theta _1,\theta _2, \ldots ,\theta _N)\nonumber \\\ge & \sum _{\theta _1,\theta _2, \ldots ,\theta _N \in A_\epsilon ^N} p(\theta _1,\theta _2, \ldots ,\theta _N)\nonumber \\\ge & \sum _{\theta _1,\theta _2, \ldots ,\theta _N \in A_\epsilon ^N} e^{-N[H(\theta ) + \epsilon ]}\nonumber \\= & \left| A_\epsilon ^N\right| e^{-N[H(\theta ) +\epsilon ]}. \end{aligned}$$
(20)

Thus, the cardinality of \(A_\epsilon ^N\) is constrained as follows

$$\begin{aligned} (1 - 2e^{-2\epsilon ^2 N})e^{N[H(\theta ) -\epsilon ]} \ \le \ \left| A_\epsilon ^N\right| \ \le \ e^{N[H(\theta ) +\epsilon ]}\,.\nonumber \\ \end{aligned}$$
(21)

From Eq. (10), we get that the relative size of the typical set with respect to the cardinality of the state space \(\Omega ^N\) is bounded as

$$\begin{aligned} \frac{\left| A_{\epsilon }^N\right| }{\left| \Omega ^N\right| } \ \le \ e^{-N\delta }, \end{aligned}$$
(22)

with \(\delta = [\log 2 -H(\theta )] - \epsilon \). From Eq. (8), we can conclude that, outside the special case of equiprobability, the relative size of the typical set in relation to the state space decays at least exponentially fast with N, so

$$\begin{aligned} \frac{\left| A_{\epsilon }^N\right| }{\left| \Omega ^N\right| } \ \rightarrow \ 0. \end{aligned}$$
(23)

This is a hallmark of typical sets, namely their relative cardinality is very small, but they carry almost all of the probability. In our case, we may notice that the measure becomes more and more concentrated around the tiny region of the state space where the sequences have (almost) constant probability \(e^{-NH(\theta )}\). Consequently for any \(\sigma ^N_{\mathrm g}\in A_{\epsilon }^N\)

$$\begin{aligned} p(\sigma ^N_{\mathrm g})\ \sim \ e^{-NH(\theta )}, \end{aligned}$$
(24)

where the symbol \(\sim \) denotes asymptotic equivalence to the first order in the exponent, that is \(a\sim b\) if

$$\begin{aligned} \frac{\log a}{\log b}\ \rightarrow \ 1. \end{aligned}$$
(25)

(if \(b=1\), then \(a \sim 1\) is satisfied by definition). Therefore, as the sequences get closer and closer to the typical ones, they tend to become equi-distributed.

In passing we might observe that Eq. (21) allows us to write the cardinality of the typical set as

$$\begin{aligned} \left| A_{\epsilon }^N\right| \ \sim \ e^{NH(\theta )}. \end{aligned}$$
(26)

This could alternatively be rewritten as

$$\begin{aligned} \frac{\log \left| A_{\epsilon }^N\right| }{N} \ \rightarrow \ H(\theta ). \end{aligned}$$
(27)

The latter means that the Shannon entropy is the logarithm of the cardinality of \(\left| A_{\epsilon }^N\right| \) per particle. Interestingly, the cardinality of \(\left| A_{\epsilon }^N\right| \) approaches in the large N limit the cardinality of the set of sequences containing exactly \(m_0=Np(0)\) zeros and \(m_1=Np(1)\) ones, i.e. the cardinality of \(\Omega (m_0,m_1)\) discussed in the previous subsection.

From the foregoing coin tossing system, we can conclude that: (a) the typical set concentrates around the subset of states from the state space that represent the “microcanonical” ensemble states from Sect. 2.1, (b) the measure is distributed in such a way that, as N increases, it gets closer and closer to the equi-distribution,Footnote 1 i.e. the “microcanonical” ensemble distribution \(1/|\Omega ^N (m_0,m_1)| \). This latter property is typically referred to as AEP [6], and it is a particular case of the measure concentration phenomenon. Finally, (c) in the coin toss example, we have seen that entropy is just a (logarithmic) measure that quantifies the occupation of the state space and characterises the concentration of measure as long as N is large.

2.2.2 “Temperature” and occupation of the state space

Let us now “thermalize” the above Bernoulli binary scheme. To this end, we consider the following scenario: First, we take the process described above as the reference process \(\theta \), following \(\{p(0), p(1)\}\), and formally associate a unit temperature with this process. Without loss of generality, we assume \(p(0)> p(1)\) (the special case when \(p(0)= p(1) = 1/2\) will be discussed separately). Second, we deform the process \(\theta \) with a single deformation parameter \(\beta \) so that

$$\begin{aligned} \left\{ p(0), p(1)\right\} \;\rightarrow \; \left\{ p_\beta (0), p_\beta (1)\right\} , \end{aligned}$$
(28)

where \(p_\beta (k)\) is the escort transformation of the probability distribution, i.e.

$$\begin{aligned} p_\beta (k) \ = \ \frac{p^\beta (k)}{Z_\beta }\,, \end{aligned}$$
(29)

with the normalization factor \(Z_\beta \) defined as

$$\begin{aligned} Z_\beta \ = \ \sum _{k\in \{0,1\}}p^\beta (k)\,. \end{aligned}$$
(30)

We might note that

$$\begin{aligned} \lim _{\beta \rightarrow \infty }p_\beta (0)\ = \ 1, \quad \lim _{\beta \rightarrow \infty }p_\beta (1) \ = \ 0, \end{aligned}$$
(31)

i.e., the process is “frozen” at high \(\beta \)’s, meaning that only a single result will materialise in repeated tosses. On the other hand

$$\begin{aligned} \lim _{\beta \rightarrow 0}p_\beta (0) \ = \ \frac{1}{2},\quad \lim _{\beta \rightarrow 0}p_\beta (1) \ = \ \frac{1}{2}, \end{aligned}$$
(32)

i.e., at low \(\beta \)’s the process gets more and more close to a random fair coin. Note that if we had started with the reference process where \(p(0) = p(1) = 1/2\), then the escort transformation would not change this distribution; in other words, the fair coin distribution is a fixed point of the escort transformation. However, the latter fixed point is unstable, as any small deviation from the fair coin rule will cause the process to “freeze” in the \(\beta \rightarrow \infty \) limit. Consequently, \(\beta \) behaves like the inverse temperature: The higher the temperature, the higher the randomness. In terms of state space occupation, it is easy to check that

$$\begin{aligned} \lim _{\beta \rightarrow \infty } \left| A_{\epsilon }^N(\beta )\right| \ \sim \ 1, \end{aligned}$$
(33)

that is, the effective size of the state space is reduced to a single state, the sequence \((0,0,\ldots ,0)\), in which the system is frozen. The latter state plays a role analogous to that of the pure state in quantum mechanics. On the other hand

$$\begin{aligned} \lim _{\beta \rightarrow 0} \left| A_{\epsilon }^N(\beta )\right| \ \ \sim \ \left| \Omega ^N\right| \ = \ e^{N\log 2}\,, \end{aligned}$$
(34)

that is, the whole space of all possible binary sequences of length N. It is not difficult to see that for a generic \(\beta \) we have

$$\begin{aligned} \left| A_{\epsilon }^N(\beta )\right| \ \ \sim \ e^{NH(\theta (\beta ))}, \end{aligned}$$
(35)

where

$$\begin{aligned} H(\theta (\beta )) \ = \ -\sum _{i\in \{0,1\}} p_\beta (i)\log p_\beta (i), \end{aligned}$$
(36)

is the Shannon entropy of the “thermalized” coin. It is not difficult to verify that

$$\begin{aligned} \frac{\textrm{d}}{\textrm{d}\beta } H(\theta (\beta )) \ < \ 0, \end{aligned}$$
(37)

from which we can deduce that the cardinality of the typical set decreases monotonically with increasing \(\beta \), and that the actual choice of the reference process is irrelevant for this type of behavior.

Note that Eq. (33) is valid for arbitrarily large but fixed N. It is interesting to know what happens when the limit \(N \rightarrow \infty \) is performed first. With the help of Eq. (35), we can write

$$\begin{aligned} \lim _{\beta \rightarrow \infty } \lim _{N\rightarrow \infty }\frac{\log \left| A_{\epsilon }^N(\beta )\right| }{N} \ = \ H(\theta (\infty )) \ = \ 0, \end{aligned}$$
(38)

which is reminiscent of the third law of thermodynamics, where in order to get correct entropy, one must first consider the thermodynamic limit and then the zero temperature limit [21].

3 Typicality in coin tossing systems—going beyond Shannon’s paradigm

From the preceding discussion, it follows that the concept of typical sets is closely related to the concept of Shannon entropy. It is thus natural to ask how unique is the role of Shannon entropy in determining typical sets. To answer this, we will go back to our \(\theta \) process described by the distribution \(\{p(0),p(1)\}\) and consider two important classes of non-Shannonian entropies, namely Rényi and Tsallis entropies.

The Rényi entropy of order \(\alpha \) of the process \(\theta \) is defined as [22, 23]

$$\begin{aligned} H_\alpha (\theta ) \ = \ \frac{1}{1-\alpha }\log \Bigg [\sum _{k\in \{0,1\}} p^\alpha (k)\Bigg ]\,, \end{aligned}$$
(39)

where \(\alpha > 0\). By L’Hôpital’s rule, the Rényi entropy converges to the Shannon entropy for \(\alpha \rightarrow 1\), that is:

$$\begin{aligned} \lim _{\alpha \rightarrow 1}H_\alpha (\theta ) \ = \ H(\theta ). \end{aligned}$$
(40)

Similarly, we may introduce the Tsallis entropy of order \(\alpha \) as [24, 25]

$$\begin{aligned} S_\alpha (\theta ) \ = \ \frac{1}{\alpha -1}\Bigg [1- \sum _{k\in \{0,1\}} p^\alpha (k)\Bigg ]\,, \end{aligned}$$
(41)

where again

$$\begin{aligned} \lim _{\alpha \rightarrow 1}S_\alpha (\theta ) \ = \ H(\theta ). \end{aligned}$$
(42)

In this section, we will explore how these two probability functionals characterize the concentration of measure in the Bernoulli binary scheme. In particular, we will see that the characterization of the typical set by the Rényi entropy and the Tsallis entropy leads in a natural way to the equilibrium free energy and the partition function, respectively.

3.1 Typical set from Rényi entropy

Let us again consider a sequence of i.i.d. random variables \(\theta _1, \ldots ,\theta _N\) following the distribution \(\{p(0),p(1)\}\) characterizing the Bernoulli process \(\theta \). We can again use Hoeffding’s inequality to show that

$$\begin{aligned} & \hspace{-10mm}\frac{1}{1-\alpha }\log \left( \frac{1}{N}\sum _{k=1}^Np^{\alpha -1}(\theta _k)\right) \ \rightarrow \ H_\alpha (\theta ), \end{aligned}$$
(43)

where the convergence is understood as the convergence in probability. Indeed, Eq. (43) directly follows from Hoeffding’s inequality

$$\begin{aligned} & \hspace{-3mm}\mathbb {P} \left\{ \left( \frac{1}{N}\sum _{k = 1}^N p^{\alpha -1}(\theta _k) - \langle p^{\alpha -1}(\theta ) \rangle \right) \ge \delta \right\} \nonumber \\ & \le \ e^{-2\delta ^2 N}, \end{aligned}$$
(44)

where

$$\begin{aligned} \langle p^{\alpha -1}(\theta ) \rangle= & \frac{1}{N}\sum _{\theta _1, \ldots , \theta _N } p(\theta _1, \ldots , \theta _N) \sum _{k=1}^N p^{\alpha -1 }(\theta _k)\nonumber \\= & \sum _{l \in \{0,1\}} p^{\alpha }(l). \end{aligned}$$
(45)

Similarly to Shannon entropy, we can associate with the expression (43) a sequence of typical sets, which we will call Rényi-type typical sets. In fact, let \(B^N_{\epsilon }(\alpha )\) be the set of sequences of length N, i.e. \(\sigma _1^N, \ldots , \sigma _m^N, \ldots \) with a generic member \(\sigma _\mathrm{{g}}^N = (\theta _{\mathrm{{g}},1}, \ldots , \theta _{\mathrm{{g}},N})\) which satisfies

$$\begin{aligned} N e^{(1-\alpha )H_\alpha (\theta )-\epsilon }\le & \sum _{k\le N} p^{\alpha -1}(\theta _{\mathrm{{g}},k})\nonumber \\ \le & Ne^{(1-\alpha )H_\alpha (\theta )+\epsilon } , \end{aligned}$$
(46)

(for arbitrary \(\epsilon >0\)). In Appendix A, we use the concept of the Kolmogorov–Nagumo mean [26, 27] to show that this formulation of a typical set represents a natural generalization of the Shannon case. For \(\epsilon \ll 1\), this can be equivalently rewritten as

$$\begin{aligned} \left| \frac{1}{N}\sum _{k=1}^N p^{\alpha -1}(\theta _{\mathrm{{g}},k}) - \langle p^{\alpha -1}(\theta ) \rangle \right| \ \le \ \tilde{\epsilon }, \end{aligned}$$
(47)

where

$$\begin{aligned} \tilde{\epsilon } \ = \ e^{(1-\alpha )H_\alpha (\theta )}\epsilon . \end{aligned}$$
(48)

In probability, we can write (47) as

$$\begin{aligned} & \mathbb {P} \left\{ \left| \frac{1}{N}\sum _{k=1}^N p^{\alpha -1}(\theta _{k}) - \langle p^{\alpha -1}(\theta ) \rangle \right| < \tilde{\epsilon } \right\} \nonumber \\ & \quad =\ \mathbb {P}\left( \theta _1, \theta _2, \ldots , \theta _N \in B^N_{\epsilon }(\alpha )\right) \ > \ 1 \ - \ 2e^{-2 \tilde{\epsilon }^2N}.~~~~ \nonumber \\ \end{aligned}$$
(49)

On the last line, we used (44) and set \(\tilde{\epsilon } = \delta \). From (49) directly follows that

$$\begin{aligned} \mathbb {P}\left( \theta _1, \theta _2, \ldots , \theta _N\in B^N_{\epsilon }(\alpha )\right) \ \rightarrow \ 1. \end{aligned}$$
(50)

Therefore, similar to Shannon’s case, in the large N limit the set \(B^N_{\epsilon }(\alpha )\) carries almost all probability—justifying the name typical set.

To find bounds on the cardinality of \(B^N_{\epsilon }(\alpha )\), we can follow the strategy from Sect. 2.2.1. In particular, to obtain the lower bound, we can write (for \(\alpha >1\))

$$\begin{aligned} & 1 - 2e^{-2\tilde{\epsilon }^2 N} < \mathbb {P}\left\{ \theta _1,\theta _2, \ldots ,\theta _N \in B^N_{\epsilon }(\alpha )\right\} \nonumber \\ & \quad = \ \sum _{\theta _1,\theta _2, \ldots ,\theta _N \in B^N_{\epsilon }(\alpha )} e^{\frac{N}{\alpha -1} \sum _{k\le N} \frac{1}{N} \log (p^{\alpha -1}(\theta _k))}\nonumber \\ & \quad \le \ \sum _{\theta _1,\theta _2, \ldots ,\theta _N \in B^N_{\epsilon }(\alpha )} e^{\frac{N}{\alpha -1} \log \left[ \sum _{k\le N} \frac{1}{N} (p^{\alpha -1}(\theta _k))\right] } \nonumber \\ & \quad = \ \left| B^N_{\epsilon }(\alpha )\right| e^{-N[H_{\alpha }(\theta ) -\epsilon ]}. \end{aligned}$$
(51)

Here, on the third line, we used Jensen’s inequality for concave functions (in this case logarithm) and on the last line, we employed (46). Should we repeat the argument for \(\alpha <1\), we would obtain

$$\begin{aligned} 1 - 2e^{-2\tilde{\epsilon }^2 N}< & \left| B^N_{\epsilon }(\alpha )\right| e^{-N[H_{2- \alpha }(\theta ) -\epsilon ]}. \end{aligned}$$
(52)

As for the upper bound, we can write (\(\alpha >1\))

$$\begin{aligned} 1= & \sum _{\theta _1,\theta _2, \ldots ,\theta _N \in \Omega ^N} p(\theta _1,\theta _2, \ldots ,\theta _N)\nonumber \\\ge & \sum _{\theta _1,\theta _2, \ldots ,\theta _N \in B^N_{\epsilon }(\alpha )} p(\theta _1,\theta _2, \ldots ,\theta _N)\nonumber \\= & \sum _{\theta _1,\theta _2, \ldots ,\theta _N \in B^N_{\epsilon }(\alpha )} e^{\frac{N}{1-\alpha } \sum _{k\le N} \frac{1}{N} \log (p^{1-\alpha }(\theta _k))}\nonumber \\\ge & \ \sum _{\theta _1,\theta _2, \ldots ,\theta _N \in B^N_{\epsilon }(\alpha )} e^{\frac{N}{1-\alpha } \log \left[ \sum _{k\le N} \frac{1}{N} (p^{1-\alpha }(\theta _k))\right] } \nonumber \\= & \left| B^N_{\epsilon }(\alpha )\right| e^{-N[H_{2-\alpha }(\theta ) +\epsilon ]}. \end{aligned}$$
(53)

Similarly for, \(\alpha <1\) we get

$$\begin{aligned} 1 \ \ge \ \left| B^N_{\epsilon }(\alpha )\right| e^{-N[H_{\alpha }(\theta ) -\epsilon ]}. \end{aligned}$$
(54)

Thus, the cardinality of \(|B^N_{\epsilon }(\alpha )\) is constrained as follows (\(\alpha > 1\))

$$\begin{aligned} (1 - 2e^{-2\tilde{\epsilon }^2 N})e^{N[H_{\alpha }(\theta ) -\epsilon ]}\le & \left| B^N_{\epsilon }(\alpha )\right| \nonumber \\ \le & e^{N[H_{2-\alpha }(\theta ) +\epsilon ]} , \end{aligned}$$
(55)

and similarly for \(\alpha <1\).

Since maximum of Rényi entropy for the Bernoulli binary scheme is \(\log 2\), we get that

$$\begin{aligned} \frac{\left| B^{N}_{\epsilon }(\alpha )\right| }{\left| \Omega ^N\right| } \ \rightarrow \ 0, \end{aligned}$$
(56)

which again shows that the relative cardinality decays at least exponentially with N. In fact, one can prove even stronger statement [28], namely that the cardinality of the Rényi-type typical set satisfies

$$\begin{aligned} |B^{N}_{\epsilon }(\alpha )| \ \sim \ |A^{N}_{\epsilon /\alpha }|. \end{aligned}$$
(57)

Now, we rename \(\alpha \) to \(\beta \) and recall the definition of the partition function \(Z_\beta \) from (30). Similarly, as in the equilibrium thermodynamics, we can associate with the partition function \(Z_\beta \) the free-energy-like functional

$$\begin{aligned} F_\theta (\beta ) \ = \ \log Z_\beta \,, \end{aligned}$$
(58)

which can be succinctly rewritten as

$$\begin{aligned} F_\theta (\beta ) \ = \ (1-\beta )H_\beta (\theta ). \end{aligned}$$
(59)

In terms of free energy the typical set sequence identified through the Rényi entropy can be defined in a more compact way as the sequences \(p_{\sigma _k^N}\) bounded as

$$\begin{aligned} e^{F_\theta (\beta ) \ \!- \ \! \epsilon }\le & \frac{1}{N}\sum _{k\le N} p^{\beta -1}(\theta _{\mathrm{{g}},k}) \nonumber \\\le & e^{F_\theta (\beta ) \ \! + \ \! \epsilon }. \end{aligned}$$
(60)

So, the typical set arising from the Rényi entropy gives rise to the free-energy-like functional.

In passing we note that the Shannon entropy of the “thermalized” coin (36) can be rewritten as

$$\begin{aligned} H(\theta (\beta ))= & \beta H(p_\beta ,p) \ + \ (1-\beta )H_\beta (\theta )\nonumber \\= & \beta H(p_\beta ,p) \ + \ F_\theta (\beta ). \end{aligned}$$
(61)

Here the reference process \(\theta \) has temperature 1 and

$$\begin{aligned} H(p_\beta ,p) \ = \ - \sum _{k \in \{0,1\} } p_{\beta }(k) \log p(k), \end{aligned}$$
(62)

is an analogue of internal energy.Footnote 2

This connection of the Rényi entropy with the free energy through a reference process was primarily reported in [29] and studied in depth in [30].

3.2 Typical sets from Tsallis entropy

Now, we turn to Tsallis entropy. Proceeding as above, one can straightforwardly proof that

$$\begin{aligned} \frac{1}{\alpha -1}\left( 1- \sum _{k\le N}\frac{1}{N}\ \! p^{\alpha -1}(\theta _k)\right) \ \rightarrow \ S_\alpha (\theta ). \end{aligned}$$
(63)

where the convergence is again meant in probability. The latter is a simple consequence of Eq. (44).

With the expression (63), we can associate typical sets that we will call Tsallis-type typical sets. In particular, let \(C^N_{\epsilon }(\alpha )\) be the set of sequences of length N, i.e. \(\sigma _1^N, \ldots , \sigma _m^N, \ldots \) with a generic member \(\sigma _\mathrm{{g}}^N = (\theta _{\mathrm{{g}},1}, \ldots , \theta _{\mathrm{{g}},N})\), which satisfies

$$\begin{aligned} & N[1- (\alpha -1)S_\alpha (\theta )-\epsilon ] \ \le \ \sum _{k\le N} p^{\alpha -1}(\theta _{\mathrm{{g}},k})\nonumber \\ & \quad \le \ N[1- (\alpha -1)S_\alpha (\theta )+\epsilon ],~~~~ \end{aligned}$$
(64)

(for arbitrary \(\epsilon >0\)). The set \(C^N_{\epsilon }(\alpha )\) is a typical set because

$$\begin{aligned} \mathbb {P}\left( \theta _1, \theta _2, \ldots , \theta _N\in C^N_{\epsilon }(\alpha )\right) \ \rightarrow \ 1. \end{aligned}$$
(65)

Before showing the validity of this relation, we will motivate the relation (64). To this end, we rewrite Tsallis’ entropy in terms of deformed logarithm as

$$\begin{aligned} S_\alpha (\theta ) \ = \ \sum _{k\le N} p(\theta _k) \ln _{\alpha } \left( \frac{1}{p(\theta _k)} \right) , \end{aligned}$$
(66)

where the \(\alpha \)-logarithm is defined as

$$\begin{aligned} \ln _{\alpha }(x) \ = \ \int _1^x dt \ \! t^{-\alpha } \ = \ \frac{1}{1-\alpha }\left( x^{1-\alpha } - 1 \right) . \end{aligned}$$
(67)

By rewriting Eq. (16) as

$$\begin{aligned} H(\theta ) +\epsilon\ge & \sum _{k \le N} \frac{1}{N} \ \! \log \!\left( \frac{1}{p(\theta _{\mathrm{{g}},k})} \right) \nonumber \\\ge & H(\theta ) +\epsilon , \end{aligned}$$
(68)

we might propose for the Tsallis-type typical set to satisfy the defining relation

$$\begin{aligned} S_{\alpha }(\theta ) + \epsilon\ge & \sum _{k \le N} \frac{1}{N} \ \! \ln _{\alpha } \!\left( \frac{1}{p(\theta _{\mathrm{{g}},k})} \right) \nonumber \\\ge & S_{\alpha }(\theta ) -\epsilon , \end{aligned}$$
(69)

which indeed coincides with (64). Here, the factor \(1-\alpha \) was assimilated in the redefinition of \(\epsilon \), so that the new \(\epsilon \) is still positive. It is quite interesting to note that (64) can also be rewritten in a form that is reminiscent of (46), namely

$$\begin{aligned} N \left[ e_{\alpha }^{S_\alpha (\theta )-\epsilon }\right] ^{1-\alpha }\le & \sum _{k\le N} p^{\alpha -1}(\theta _{\mathrm{{g}},k})\nonumber \\ \le & N\left[ e_{\alpha }^{S_\alpha (\theta )+\epsilon }\right] ^{1-\alpha } , \end{aligned}$$
(70)

where the \(\alpha \)-exponential is defined as [25]

$$\begin{aligned} e_{\alpha }^x \ = \ [1 \ + \ (1-\alpha )x]_+^{{1}/{(1-\alpha )}}, \end{aligned}$$
(71)

with \([z]_+ = \max \{z, 0\}\). Inequality (70) is valid for \(\alpha <1\). Bounds must be reversed for \(\alpha >1\).

Let us now turn back to (65). In order to prove it, we rewrite (64) as

$$\begin{aligned} \epsilon\ge & \left| \frac{1}{N}\sum _{k=1}^N p^{\alpha -1}(\theta _{\mathrm{{g}},k}) - [ 1 + (1-\alpha )S_{\alpha }(\theta )] \right| \nonumber \\= & \left| \frac{1}{N}\sum _{k=1}^N p^{\alpha -1}(\theta _{\mathrm{{g}},k}) - \langle p^{\alpha -1}(\theta ) \rangle \right| , \end{aligned}$$
(72)

which in probability can be written as

$$\begin{aligned} & \mathbb {P} \left\{ \left| \frac{1}{N}\sum _{k=1}^N p^{\alpha -1}(\theta _{k}) - \langle p^{\alpha -1}(\theta ) \rangle \right| < {\epsilon } \right\} \nonumber \\ & \quad =\ \mathbb {P}\left( \theta _1, \theta _2, \ldots , \theta _N \in C^N_{\epsilon }(\alpha )\right) \ > \ 1 \ - \ 2e^{-2 {\epsilon }^2N}.~~~~ \nonumber \\ \end{aligned}$$
(73)

The inequality is a consequence of (44). This concludes our proof of (65).

Let us now turn our attention to the cardinality of \(C^N_{\epsilon }(\alpha )\). Again, we can follow the strategy of Sect. 2.2.1. In particular, to obtain the lower limit we can write (for \(\alpha > 1\))

$$\begin{aligned} & 1 - 2e^{-2{\epsilon }^2 N} < \mathbb {P}\left\{ \theta _1,\theta _2, \ldots ,\theta _N \in C^N_{\epsilon }(\alpha )\right\} \nonumber \\ & \quad = \ \sum _{\theta _1,\theta _2, \ldots ,\theta _N \in C^N_{\epsilon }(\alpha )} e^{\frac{N}{\alpha -1} \sum _{k\le N} \frac{1}{N} \log (p^{\alpha -1}(\theta _k))}\nonumber \\ & \quad \le \ \sum _{\theta _1,\theta _2, \ldots ,\theta _N \in C^N_{\epsilon }(\alpha )} e^{\frac{N}{\alpha -1} \log \left[ \sum _{k\le N} \frac{1}{N} (p^{\alpha -1}(\theta _k))\right] } \nonumber \\ & \quad = \ \left| C^N_{\epsilon }(\alpha )\right| \left[ e_{\alpha }^{S_{\alpha }(\theta ) -\epsilon ]}\right] ^{-N}. \end{aligned}$$
(74)

where on the last line, we used (72). For \(\alpha <1\), we would need to change \(\alpha \) to \(2-\alpha \). To obtain the upper bound we can write (for \(\alpha > 1\)), cf. (53)

$$\begin{aligned} 1= & \sum _{\theta _1,\theta _2, \ldots ,\theta _N \in \Omega ^N} p(\theta _1,\theta _2, \ldots ,\theta _N)\nonumber \\\ge & \sum _{\theta _1,\theta _2, \ldots ,\theta _N \in C^N_{\epsilon }(\alpha )} p(\theta _1,\theta _2, \ldots ,\theta _N)\nonumber \\\ge & \sum _{\theta _1,\theta _2, \ldots ,\theta _N \in C^N_{\epsilon }(\alpha )} e^{\frac{N}{1-\alpha } \log \left[ \sum _{k\le N} \frac{1}{N} (p^{1-\alpha }(\theta _k))\right] } \nonumber \\= & \left| C^N_{\epsilon }(\alpha )\right| \left[ e_{\alpha }^{S_{2-\alpha }(\theta ) +\epsilon ]}\right] ^{-N}. \end{aligned}$$
(75)

Consequently, the cardinality of \(C^N_{\epsilon }(\alpha )\) is constrained as follows (\(\alpha > 1\))

$$\begin{aligned} (1 - 2e^{-2{\epsilon }^2 N})\left[ e_{\alpha }^{S_{\alpha }(\theta ) -\epsilon }\right] ^N\le & \left| C^N_{\epsilon }(\alpha )\right| \nonumber \\ \le & \left[ e_{2-\alpha }^{S_{2-\alpha }(\theta ) +\epsilon ]}\right] ^N . \end{aligned}$$
(76)

By identifying \(\alpha \) with the “inverse of the temperature” \(\beta \), we obtain for the partition function that

$$\begin{aligned} Z_\beta \ = \ 1-(\beta -1)S_\beta (\theta ). \end{aligned}$$
(77)

As a result, the condition (69) for typical Tsallis-type sequences can be rewritten in the form

$$\begin{aligned} Z_\beta -\epsilon \ \le \ \frac{1}{N}\sum _{k\le N} p^{\beta -1}(\theta _{\mathrm{{g}},k}) \ \le \ Z_\beta +\epsilon . \end{aligned}$$
(78)

Therefore, the identification of the typical sequences from the Tsallis entropy leads to the emergence of the partition function of the “thermalized process”.

4 Entropy and typicality: going beyond i.i.d.

As we saw in the previous section, the thermodynamic-like structure of a process can be derived from the concentration of measure, under clearly stated underlying dynamical assumptions: (1) constant state space, i.e. no states appear or disappear through time, and (2) independence in the successive drawings, i.e. no time correlations are present. We then extended the notion of typicality and typical set by characterizing the occupation of the state space by different limiting procedure, giving rise to the Rényi and Tsallis entropies. In the latter context, we could observe that the corresponding entropies could not be formulated as sample entropies, i.e. logarithms (or deformed logarithms) of the cardinality of the ensuing typical sets—as it was possible in the case of Shannon. On the other hand, the typical sets obtained were instrumental in defining “equilibrium” thermodynamic functions, namely free energy and partition function, without the need to introduce thermalized coins (and escort transformation).

Going beyond the simple structure of the Bernoulli binary scheme (or, more generally, i.i.d. processes), the question arises: Could a foregoing macroscopic picture emerge under more general assumptions? For instance, for stochastic processes that may have growing or shrinking state spaces. Thus, the first task is to define a sufficiently general class of stochastic processes that includes the above cases and, if necessary, also processes violating the i.i.d. condition. We call such a class of processes Compact Stochastic Processes (CSP’s) [9]. We are especially interested in finding a sample entropy of the system that is a functional of a trace-class type, and thus generalize the formula (16). In turn, this would represent the CSP’s version of Einstein’s celebrated entropic principle [16]. In the following, we will briefly outline a general strategy in this direction. More detailed discussion can be found in [9].

4.1 Basics: CSP’s

Let us consider a time-discrete stochastic processes \(\eta \) [31, 32]. A realization of N steps of the process is denoted as \(\eta (N)\)

$$\begin{aligned} \eta (N) \ = \ \eta _1, \ldots ,\eta _N, \end{aligned}$$
(79)

where \(\eta _1, \ldots ,\eta _N\) are random variables themselves. Note that, in different realizations of t steps of the process, the sequence of random variables can be different, as the process may display path dependence, long term correlations, or changes of the phase space (either shrinking or expanding). We denote a particular trajectory (or a sample path) of the process as

$$\begin{aligned} x(t) \ \equiv \ x_1, \ldots ,x_N \ \in \ \Omega _\eta (N). \end{aligned}$$
(80)

Here \(\Omega _\eta (N)\) is the set of all possible trajectories of the process \(\eta \) after N steps. We focus on the family of stochastic processes for which there exists (i) a positive, strictly concave and strictly increasing function \(\Lambda \in \mathcal {C}^2\) in the interval \([1,\infty )\), such that \(\Lambda (1)=0\) [9, 33], and (ii) a positive, strictly increasing function \(g\in \mathcal {C}^2\), in the interval \((1,\infty )\), such that

$$\begin{aligned} \lim _{N\rightarrow \infty }\ \! \frac{1}{g(N)}\Lambda \left( \frac{1}{p(\eta (N))}\right) \ = \ 1, \end{aligned}$$
(81)

in probability. Stochastic processes satisfying the above convergence relation are CSP’s [9]. We recall that no assumptions were made about the process beyond the convergence condition (81). In particular, we do not require independence of the successive values \(\ldots ,\eta _{N-1},\eta _N,\eta _{N+1} \ldots \) or stable state spaces from which the different elements of the sequence of random variables take values. We call the pair of functions \(\Lambda , g\) compact scale of the process \(\eta \), and note that a CSP can have more than one compact scale.

The convergence condition (81) implies the following asymptotic behavior for \(\Lambda \)

$$\begin{aligned} \lim _{z\rightarrow \infty }\frac{\Lambda (\lambda z)}{\Lambda (z)} \ = \ 1\,, \;\;\; \forall \lambda \in \mathbb {R}^+\,. \end{aligned}$$
(82)

We refer to the set of functions \(\Lambda \) as \(\mathcal {L}\). Typical candidates for \(\Lambda \) are of the form \(\Lambda (z)=c\log ^d(z)\), where cd are two positive, real valued constants or, more generally

$$\begin{aligned} \Lambda (z) \ = \ c_1\log ^{d_1}(1+c_2\log ^{d_2}(1+ c_3\log ^{d_3}(\ldots ))),~~~~~ \end{aligned}$$
(83)

where \(c_1, \ldots \) and \(d_1, \ldots \) are positive, real valued constants.Footnote 3 In previous approaches, these constants have been identified with scaling exponents that enable to classify the different potential growing dynamics of the phase space [9, 34].

4.2 The typical set and entropy in CSP’s

If, given a stochastic process \(\eta \), there exists a compact scale \(\Lambda ,g\) for which the convergence condition (81) holds, then there exists a typical set \(A_\epsilon ^N\subseteq \Omega _\eta (N)\) of paths of the process \(\eta \). The typical set will represent all probability in the limit of very large N, which means that although the potential set of paths can be arbitrarily large, only paths belonging to the typical set are expected to be effectively observed. A path \(x(N)=x_1 \ldots x_N\) of the process \(\eta \) belongs to the typical set (or, alternatively, it is a typical path) if its associated probability to occur is bounded as

$$\begin{aligned} \frac{1}{\Lambda ^{-1}(g(N)(1+\epsilon ))}\le & p(x(N))\nonumber \\ \le & \frac{1}{\Lambda ^{-1}(g(N)(1-\epsilon ))}, \end{aligned}$$
(84)

where \(\Lambda ^{-1}\) is the inverse of \(\Lambda \), i.e., \((\Lambda ^{-1}\circ \Lambda )(z)=z\), which exists given the assumption that \(\Lambda \) is a monotonously increasing function. In particular, one can prove [9] that for any \(\epsilon >0\) there is a \(N'>0\) such that, if \(N>N'\)

$$\begin{aligned} \mathbb {P}\left( x(N)\in A_\epsilon ^N \right) \ > \ 1-\epsilon . \end{aligned}$$
(85)

In other words, the typical set usurps all probability, and the probability of observing non-typical paths becomes negligible. Note that since there is no assumption beyond the convergence condition (81), we cannot ensure the validity of tighter bounds on the concentration measure, as we did in Sect. 2.2, where we used Hoeffding’s inequality, see e.g. Eq. (12). From the definition and properties of the typical set, it follows directly that there exists a non-increasing sequence of positive numbers \(\epsilon _1, \ldots ,\epsilon _N, \ldots \) such that \(\epsilon _N\rightarrow 0\) (which we will write as \(\epsilon _N \searrow 0\)), defining a sequence of typical sets

$$\begin{aligned} \ldots , A_{\epsilon _{t-1}}^{N-1}, A_{\epsilon _{t}}^{N}, A_{\epsilon _{t+1}}^{N+1}, \ldots . \end{aligned}$$
(86)

by which:

$$\begin{aligned} \mathbb {P}\left( x(N)\ \in \ A_{\epsilon _N}^N\right) \ \rightarrow \ 1\,, \end{aligned}$$
(87)

i.e., the typical set concentrates all the probability.

If, in the limit of large N, all contributions to the scaling factor g(N) of paths outside the typical set vanish, one can rewrite the scaling factor as a trace-class entropic functional

$$\begin{aligned} S_\Lambda (N) \ = \ \sum _{x(N) \ \in \ \Omega _\eta (N)}p(x(N))\Lambda \left( \frac{1}{p(x(N))}\right) \,, \nonumber \\ \end{aligned}$$
(88)

which satisfies the first three of the four Shannon-Khinchin axioms for the entropic measure [9]. Consequently, the scaling term g(N) can be identified with the generalized entropy \(S_\Lambda (N)\) in simple CSP’s. In turn, by construction, \(S_\Lambda (N)\) converges to the (generalized) logarithm of the cardinality of the typical set, which describes the effective size of the state space. Indeed, for \(\epsilon _N\searrow 0\)

$$\begin{aligned} \frac{\Lambda (|A_{\epsilon _N}^N|)}{S_\Lambda (N)} \ \rightarrow \ 1. \end{aligned}$$
(89)

In addition, we observe that, for any \(x(N)\in A_{\epsilon _N}^N\), with \(\epsilon _N\searrow 0\), the following limit holds

$$\begin{aligned} \frac{\Lambda \left( \frac{1}{p(x(N))}\right) }{S_\Lambda (N)} \ \rightarrow \ 1\,. \end{aligned}$$
(90)

This implies that the probabilities of the paths belonging to typical set are all equal upon the application of the generalized logarithm \(\Lambda \). This might be viewed as a non-i.i.d generalization of the conventional AEP.

With the above formalism, we, therefore, connected; (1) the microscopic dynamics of the system, (2) the effective increase of phase space (captured by the typical set evolution), and (3) a generalized entropic form \(S_\Lambda \).

5 Discussion and conclusions

Characterizing the occupation of the state space is key to understanding the macroscopic properties of systems composed of many microscopic parts. The existence of the typical set, which is a direct consequence of the concentration of measure phenomenon, allows a massive reduction of degrees of freedom, giving rise to macroscopic functionals that characterize macroscopic configurations (i.e., macrostates). The typical set is thus the key concept that allows a rigorous justification of the reasoning behind statistical mechanics. Parallel to the considerations based on the concentration of measure, considerations on the deviations from the typical behaviors—studied by the so-called large deviations theory—provide very valuable information about the macroscopic behavior of the system [35]. In fact, the two are complementary and thus provide essential information for the possible thermodynamic interpretation of the different features of the sample space occupancy. The question remains whether and how the powerful concept of typicality and the resulting deviations from it can be extended to systems with more complex microscopic dynamics, such as non-i.i.d. systems.

To demonstrate the power of the unifying concept of typicality, we first provided a toy example, namely a “thermalized” coin, to illustrate how entropy, temperature, and occupancy of state space are interrelated. In this context, we have shown that the typical set can be characterized not only by the Shannon entropy but also by the Rényi and Tsallis entropies. A remarkable observation here is that the characterization, considering different convergence criteria, using either Rényi or Tsallis-type typical sets, naturally leads to the free energy and partition function, respectively. Therefore, in i.i.d. systems, the typical set can be characterized from different entropic functionals, and this gives rise to the different thermodynamically relevant quantities. One might naturally expect that similar results will hold if, instead of i.i.d. sequences of random variables, we extend our considerations to weakly dependent random variables. This would naturally also cover a large portion of conventional equilibrium statistical thermodynamics. Beyond equilibrium, a general convergence condition can be postulated, namely, the CSP condition. Although CSP’s encompass a broad class of microscopic stochastic dynamics, one can still define the state space occupation and the associated entropic functional. As we have seen in the case of the simple process of the coin toss, different approaches to the typical set may give rise to different, interrelated functionals representing different thermodynamically interpretable quantities. It remains an open question how different compact scales may relate (or even characterize) the same process, for general CSP’s.

In passing, it is interesting to mention a potential connection with the coarse graining method–i.e. a concept from statistical mechanics introduced more than a century ago by Paul and Tanya Ehrenfest [36] and further developed in 60s by Leo Kadanoff [37]. The coarse-graining method procedure has proven to be a powerful procedure in statistical physics, especially when coupled with the concept of the renormalization group and the resulting portfolio of ideas and techniques that allow the systematic study of changes in a physical system as viewed on different length scales [38]. Coarse-graining—or, more explicitly, the possibility of performing such an operation—is often referred to as the key mechanism allowing for the massive collapse of degrees of freedom that leads to the thermodynamic interpretation of statistical ensembles. In this respect, one might argue that the existence of typical behaviors must underlie the success of the coarse-graining strategy to derive macroscopic behaviors from a deeper microscopic behavior. However, caution is required: In so-called renormalizable theories, the system at a coarse-grained scale will generally consist of self-similar copies of itself when viewed at a smaller coarse-grained scale, with different parameters describing the components of the system. The components, or fundamental variables, may relate to atoms, elementary particles, atomic spins, etc. No such changes in system’s components are required when discussing typical sets. There, the reduction in degrees of freedom is achieved purely as a result of the law of large numbers (or related isoperimetric inequalities), with all parameters of the macroscopic system remaining the same, regardless of whether one is working with the full state space or just a typical set. The latter is certainly true for i.i.d. systems. For non-i.i.d. systems, the dynamics and probability may be non-trivially intertwined—with different system’s parameters leading to different typical sets, and the renormalization group approach may then facilitate a new understanding of such a fact.

According to the open questions listed above, future work in this direction should address: (1) connections between Shannon, Rényi and Tsallis entropies, and uncover the relationship between state-space occupancy and thermodynamic (or information-theoretic) functionals—even in the equilibrium case, where these relations are still poorly understood, (2) possible connections between typicality, large deviation theory and the existence of coarse-grained descriptions of the system, and (3) the mathematical structure of a generalized statistical mechanics based on the existence of typicality in generic, more complex dynamics, e.g. in the framework of the CSP’s. Crucially, these steps should go hand in hand with the derivation of empirically verifiable macroscopic observables.