Keywords

1 Introduction

The concepts of information and complexity seem to be intricately linked. Complexity notions are quantified in information theoretical terms, and a general principle might say that a structure is the more complex, the more information is needed to describe or build it. That principle, however, needs some qualification. One should distinguish between—usually useful—information about regularities of a structure or a process and—often useless—information about random details. The question is not only information about what?, but also where is that information?, that is, whether and how it is or can be internally stored in a system with limited capacity, at which level of a process information is needed to predict the continuation of a process, and where it can be found in a distributed system. In the latter case, we should, however, not only look for information that is exclusively located somewhere or that is shared between entities, but should also consider complementary or synergistic information, that is, information that only emerges when several sources are combined.

These lecture notes describe what is currently known about these questions, and they develop the underlying theoretical concepts and elucidate them at simple examples. Also, when we can quantify complexity concepts, we can also try to optimize the corresponding complexity measures. This will also be systematically discussed.

These notes are the result of a series of lectures that one of us (JJ) delivered at the Summer School in Como in July, 2018. They present work that we have done jointly during the last few years. JJ thanks Elisa Mastrogiacomo and Sergio Albeverio for organizing a very stimulating school, and the participants and the other lecturers, in particular Luciano Boi, Ivar Ekeland and Frank Riedel, for stimulating discussions.

2 Background: Principles of Information Theory

2.1 Shannon Information

The basic concept is that of the Shannon Information [52] of a random variable X, or equivalently, of a probability distribution p, when the possible values \(x_i\) of X are realized with probabilities \(p_i=p(x_i)\). These probabilities satisfy \(0\le p_i \le 1\) for all i, with the normalization \(\sum _i p_i=1\). The Shannon information or entropy then is

$$\begin{aligned} H(X)=H(p_1,\dots , p_n)=-\sum _i p_i \log _2 p_i \text { (bits)}. \end{aligned}$$
(1)

This is the expected reduction of uncertainty, i.e., the information gain, if we learn which concrete value \(x_i\) of the random variable X from a known distribution p with probabilities \(p_i=p(x_i)\) is realized.

This is the basic example: When we have two possible events occurring with equal probability 1/2 (an unbiased coin) we thus gain \(\log _22= 1\) bit of information when we observe the outcome.

A fair dice yields \(\log _2 6\) bits of information.

2.2 Mutual and Conditional Information

We now consider the situation where we have an additional random variable Y.

In the example of the dice, we could let \(Y=0\) (resp. 1) for an odd (even) result, each with probability 1/2. According to the basic example, we have \(H(Y)=1\) bit. When we know Y, there remain only 3 possibilities for the value of X, each with probability 1/3.

This leads us to the concept of conditional information; in this example, the remaining uncertainty about X when knowing Y is

$$\begin{aligned} H(X|Y)=\log _2 3. \end{aligned}$$
(2)

Thus, the uncertainty about the value of X is reduced from \(\log _2 6\) to \(\log _2 3\) bit when knowing Y.

The joint information is related to the conditional information by

$$\begin{aligned} H(X,Y)= H(X) + H(Y|X)=H(Y)+H(X|Y). \end{aligned}$$
(3)

Thus, \( H(X,Y) \le H(X) + H(Y)\), and < if X and Y are not independent. In the example, we have \(H(X,Y)= H(X)\), since the value of X determines that of Y.

The information gain about X from knowing Y is called the mutual information of X and Y,

$$\begin{aligned} MI(X:Y)=H(X) -H(X|Y). \end{aligned}$$
(4)

In our example \(MI(X:Y)=\log _2 6-\log _2 3=\log _2 2 =1\) bit. From Y, we gain 1 bit of information about X.

The mutual information is symmetric,

$$\begin{aligned} MI(X:Y)=MI(Y:X). \end{aligned}$$
(5)

The difference structure is perhaps the most important aspect. In many respects, (4) is more important and fundamental than (1), because we always have some prior knowledge, expressed here through Y, when we observe some X. Thus, the mutual information MI(X : Y) tells us how much we can already infer about X when we know Y. By then observing X, we only gain the additional information H(X|Y).

Summary:

$$\begin{aligned} H(X)= MI(X:Y)+H(X|Y) \end{aligned}$$
(6)

H(X) = how much you learn from observing X

MI(X : Y) = how much you learn about X by observing Y

H(X|Y) = how much you learn from observing X when you already know Y.

We can iterate the conditioning process with another random variable Z, to get the conditional mutual information

$$\begin{aligned} MI(X:Y|Z)=H(X|Z)-H(X|Y,Z). \end{aligned}$$
(7)

MI(X : Y|Z) quantifies how much additional mutual information between X and Y can be gained when we already know Z.

Careful: While always \(H(X|Z)\le H(X)\), we do not necessarily have \(MI(X:Y|Z)\le MI(X:Y)\).

Example: The XOR function (exclusive or):

x

y

z

0

0

0

1

0

1

0

1

1

1

1

0

where XY assume their two values independently with probability 1/2 each. Thus, \(MI(X:Y)=MI(X:Z)=MI(Y:Z)=0\), but \(MI(X:Y|Z)=MI(X:Z|Y)=MI(Y:Z|X)=1\), because knowing the values of two of the variables determines that of the third.

2.3 Maximum Entropy

E. Jaynes’ maximum-entropy principle [24]: Take the least informative estimate possible on the given information, that is, don’t put any information into your model that is not based on the observed data. Look for p with maximal entropy H(p) under the constraint that the expectation values of certain observables \(f_\alpha \) be reproduced,

$$\begin{aligned} E_pf_\alpha =\sum _i f^i_\alpha p_i \text { for }\alpha =1,\dots ,A. \end{aligned}$$
(8)

The solution is an exponential distribution

$$\begin{aligned} p_j=\frac{1}{Z}\exp (\sum _\alpha \lambda _\alpha f^j_\alpha )\quad \text { with } Z=\sum _i \exp (\sum _\alpha \lambda _\alpha f^i_\alpha ). \end{aligned}$$
(9)

In particular, when there are no observations,

$$\begin{aligned} p_j=\frac{1}{n} \text { for }j=1,\dots ,n. \end{aligned}$$
(10)

2.4 Kullback-Leibler Divergence

A reference for the information geometric concepts that will be introduced and used here and in the sequel is [5]. The Kullback-Leibler divergence (KL-divergence for short) or relative entropy for two probability distributions pq

$$\begin{aligned} D(p\Vert q)={\left\{ \begin{array}{ll}\sum _i p_i \log _2 \frac{p_i}{q_i} &{}\text { if }\mathrm {supp}\ p \subset \mathrm {supp}\ q\\ \quad \infty &{}\text { else} \end{array}\right. } \end{aligned}$$
(11)

is positive (\(D(p\Vert q)> 0\) if \(p\ne q\)), but not symmetric, as in general, \(D(p\Vert q)\ne D(q\Vert p)\).

Example: The mutual information is the KL-divergence between the joint distribution and the product of the marginals,

$$\begin{aligned} MI(X:Y) = D(p(x,y)||p(x) p(y)). \end{aligned}$$
(12)

Among all distributions p(xy) with the same marginals \(p(x)=\sum _y p(x,y), p(y)=\sum _x p(x,y)\), the product distribution p(x)p(y) has the largest entropy. This is, of course, a special case of Jaynes’ principle. That is, when we only know the marginals, Jaynes’ principle would suggest to take the product distribution as our estimate.

Example: The space of all probability distributions on two binary variables is a 3-dimensional simplex. It contains the 2-dimensional subfamily of product distributions. The extreme points of the simplex are the Dirac measures \(\delta ^{(x,y)}, x,y =0,1\). Maximization of the distance from the family of product distributions leads to distributions with support cardinality two (perfect correlation or anticorrelation) [4].

The formal way of expressing Jaynes’ principle is to project a given distribution onto the product family \(\mathcal E\) to maximize entropy while preserving the marginals, with \(\pi \) denoting that projection,

$$\begin{aligned} D(p \, \Vert \, {\mathcal E}) \;:= & {} \; \inf _{q \in {\mathcal E}} D(p \, \Vert \, q) \; = \; D(p \, \Vert \, \pi (p))\\ \nonumber= & {} H_{\pi (p)}(X,Y)-H_p(X,Y). \end{aligned}$$
(13)

3 Complexity

In this section, we want to introduce and discuss complexity concepts. But what is complexity? Some possible answers (see [5, 6] for a systematic discussion): Complexity is

  1. 1.

    the minimal effort or the minimal resources needed to describe or generate an object. Examples of such complexity concepts include algorithmic complexity (Kolmogorov [31], Chaitin [14], Solomonoff [57]); computational complexities; entropy (Shannon [52]), or entropy rate (Kolmogorov [31], Sinai [56]).

  2. 2.

    the minimal effort or the minimal resources needed to describe or generate the regularities or the structure of an object. Examples of such complexity concepts include Kolmogorov minimal sufficient statistics and related notions, stochastic complexity (Rissanen [46]), effective complexity (Gell-Mann and Lloyd [18]), excess entropy [53], also known as effective measure complexity [21], forecasting complexity [64], also introduced as statistical complexity by Crutchfield, Young, Shalizi [15, 51].

  3. 3.

    the extent to which an object, as a whole, is more than the sum of its parts (Aristotle [1]), that is, the extent to which the whole cannot be understood by the analysis of the parts of the system in isolation, but only by also considering their interactions.

In order to systematically explore these aspects, we start with the most basic concept, that of algorithmic complexity [14, 31, 57] (see [33] for a systematic exposition). This concept expresses 1) in its purest form.

3.1 Algorithmic Complexity

The algorithmic complexity of an object, such as a number or a piece of text, is the length of the shortest computer program that generates or produces the object as output.Footnote 1 Typically, one cannot compute this complexity, but only provide an upper bound by producing a computer program, but does not know whether this is the shortest possible one.

From a conceptual perspective, the basic premise is that irregular or random structures have the highest algorithmic complexity, because they do not admit a short description. In other words, we want to characterize the complexity of a structure by the difficulty of its description. That is, we ask the question: How much can the description of a structure be simplified by utilizing regularities?

  • Very simple structures need not be simplified any further.

  • Random structures cannot be simplified.

  • Computational complexity (see for instance the expositions in [38, 39]): Running time of shortest computer program that can generate the structure: A simple structure is produced quickly, whereas for a random one, everything has to be explicit in the program, and so, it does not need to run for a long time either.

  • Random structures are not of interest for themselves, but only as members of an ensemble; it therefore suffices to describe the latter (Gell-Mann and Lloyd [18]).

3.2 External and Internal Complexity

The question that arises from the above concept of algorithmic complexity is how to compute it, that is, how to find the shortest description of a given structure. Quite apart from the fact that this depends on the choice of the device we use to evaluate it (in theory: some universal Turing machine, and the choice of that Turing machine then introduces an additive constant), in practice, we have only bounded means to represent a structure. Thus: What do we want to know? We want to

  1. 1.

    know a rich and complex structure,

  2. 2.

    but represent it most efficiently.

More formally, we want to

  1. 1.

    maximize external complexity,

  2. 2.

    but minimize internal complexity.

This perspective was introduced in [25]. For an application in pattern classification, see for instance [3].

3.3 Optimization Principles

Organisms live in and interact with a complex environment, see for instance [61] (for a measure theoretical approach, see [7]), and need to maintain their own autopoiesis [37]. A modern society consists of several complex subsystems that follow their own rules, but need to interact with each other [35, 36]. With the concept of Shannon information, we can formulate some abstract principles that either maximize or minimize some kind of complexity (we follow [26] here). The basic versions, however, lead to trivial results, as we shall now see.

  1. 1.

    Gain as much information as possible: Look at random patterns

  2. 2.

    Avoid surprises: Look at blank screen

  3. 3.

    Try to predict future sensory inputs as accurately as possible on the basis of the current ones (and perhaps try to bring yourself into a state where this is possible [17])

  4. 4.

    Try to manipulate the environment such that the results of own actions are as accurately predictable as possible [32].

  5. 5.

    Maximize

    $$\begin{aligned} &MI(S_{t+1}:E_t) \qquad -MI(S_{t+1}:E_t|S_t) \nonumber \\ = H(&S_{t+1})-H(S_{t+1}|E_t) -H(S_{t+1}|S_t)+H(S_{t+1}|E_t,S_t) \end{aligned}$$
    (14)

    to establish the strongest possible correlation between the current state \(E_t\) of the environment and future sensory data \(S_{t+1}\), but such that this correlation can already be predicted from the current input \(S_t\) [9]

To proceed further, let us discuss some questions.

  1. 1.

    Q: Why should a system model an external probability distribution?

    A: To make predictions on the basis of regularities

  2. 2.

    Q: How can this be achieved in an environment that is vastly more complex than the system itself?

    A: Detect regularities

  3. 3.

    Q: How to detect regularities?

    A: Because of 2), the system is forced to compress.

These answers have some consequences in various fields:

  • Psychology: Use heuristics [19, 54, 55]

  • Cognition: External versus internal complexity [25]

  • Statistics: Avoid overfitting

  • Statistical learning theory: Start with models with few parameters and gradually increase as you learn (Vapnik-Chervonenkis) [59, 60].

3.4 Correlations in Time Series

We can also use the information theoretical notions to evaluate the complexity of a time series in terms of the correlations that it exhibits. A time series \(X_t, t\in \mathbb {N}\) could possibly have

  • No regularities: \(H(X_t|X_{t-1})=H(X_t)\)

  • the Markov property: \(H(X_t|X_{t-1}, X_{t-2},\dots )=H(X_t|X_{t-1})\), or

  • Long term correlations, as in texts, genetic sequences, ...

To evaluate this, we quantify how much new information is gained when one already knows n consecutive symbols and then sees the \((n+1)\)st. (Grassberger [21]).

For which n is this largest? When n is small, one perhaps cannot predict much, and if n is large, one may be able to guess the rest anyway.

The larger this n, the more complex the sequence.

For genetic sequences, \(n \sim 14\) [47], for amino acid sequences (proteins) \(n\sim 5\).

In literature analysis, such a principle can be used to evaluate the complexity of language [16].

A more sophisticated concept is the genon concept of molecular biology [30, 48, 49].

3.5 Complementarity

Instead of trying to predict the environment, one can also let the environment do the computation itself (see [26]).

If you want to catch a ball, you do not use Newtonian mechanics to compute the trajectory, but simply run so that the ball appears under a constant angle. The environment computes the trajectory, and you only need to sample. This outsourcing of computation represents one mechanism for the compression mentioned in Sect. 3.3.

More generally, embodied cognition has emerged as a new paradigm in robotics [43].

3.6 Hierarchical Models and Complexity Measures

In this section, we follow [5, 6]. Returning to Jaynes’ approach, we could maximize entropy while preserving marginals among subsets of variables. For instance, for a distribution on 3 variables, we could prescribe all single and pairwise marginals.

Assume that we have a state set V that consists of the possible values of N variables. We then consider the hierarchy

$$\begin{aligned} \mathfrak {S}_1 \; \subseteq \; \mathfrak {S}_2 \; \subseteq \; \dots \; \subseteq \; \mathfrak {S}_{N-1} \; \subseteq \; \mathfrak {S}_{N} \; := \; 2^V, \end{aligned}$$
(15)

where \( \mathfrak {S}_k\) is the family of subsets of V with \(\le k\) elements, from wich we get the set of probability distributions \({\mathcal E}_{\mathfrak {S}_k}\) with dependencies of order \(\le k\). For instance, \({\mathcal E}_{\mathfrak {S}_1}\) is the family of distributions that are simply the products of their marginals. In particular, for a probability distribution in this family, there are no correlations between the probabilities of two or more of the variables. In \( {\mathcal E}_{\mathfrak {S}_2}\), we then allow for pairwise correlations, but no triple or higher order ones.

We point out that one can also consider other families of subsets of V and the corresponding probability distributions. For instance, when V is the ordered set of integers \(\{1,\dots ,N\}\), one could consider the family of those subsets that consist of uninterrupted strings of length \(\le k\). This will be our choice when we discuss the excess entropy below.

We let \(\pi _{\mathfrak {S}_k}\) be the projection on \({\mathcal E}_{\mathfrak {S}_k}\), \(p^{(k)}:=\pi _{\mathfrak {S}_k}(p)\). For instance, \(p^{(1)}\) is the product distribution with the same marginals as p.

We have the important Pythagorean relation

$$\begin{aligned} D(p^{(l)} \, \Vert \, p^{(m)}) \; = \; \sum _{k = m}^{l - 1} D(p^{(k + 1)} \, \Vert \, p^{(k)}), \end{aligned}$$
(16)

for \(l, m=1, \dots , N-1\), \(m < l\). In particular,

$$\begin{aligned} D(p \, \Vert \, p^{(1)}) \; = \; \sum _{k = 1}^{N - 1} D(p^{(k + 1)} \, \Vert \, p^{(k)}). \end{aligned}$$
(17)

If we take configurations with dependencies of order \(\le k\), we get the Complexity measure [6] with weight vector \(\alpha = (\alpha _1,\dots ,\alpha _{N-1}) \in {\mathbb R}^{N-1}\)

$$\begin{aligned} C_{\alpha }(p)&{:=}&\sum _{k = 1}^{N-1} \alpha _k \, D(p \, \Vert \, p^{(k)}) \end{aligned}$$
(18)
$$\begin{aligned}= & {} \sum _{k = 1}^{N -1} \beta _k \, D(p^{(k + 1)} \, \Vert \, p^{(k)}), \end{aligned}$$
(19)

with \(\beta _k := \sum _{l = 1}^{k} \alpha _l\).

\(p^{(k)}\) is the distribution of highest entropy among all those with the same correlations of order \(\le k\) as p.

Thus, we consider a weighted sum of the higher order correlation structure.

Examples:

  • Tononi-Sporns-Edelman complexity [58]: \(\alpha _k=\frac{k}{N}\) addresses the issue of the interplay between differentiation and integration in complex systems (for an analysis of system differentiation from an information theoretical perspective, see also [28])

  • Stationary stochastic process \(X_n\): Conditional entropy

    $$ h_p(X_n) \; {:=} \; H_p(X_{n} \, | \, X_1,\dots ,X_{n-1}). $$

    Entropy rate or Kolmogorov–Sinai entropy [31, 56]

    $$\begin{aligned} h_p(X) \; := \; \lim _{n \rightarrow \infty } h_p(X_n) \; = \; \lim _{n \rightarrow \infty } \frac{1}{n} \, H_p(X_1,\dots ,X_n), \end{aligned}$$
    (20)

    Excess entropy (Grassberger [21])

    $$\begin{aligned} E_p(X)&{:=}&\lim _{n\rightarrow \infty } \sum _{k = 1}^n( h_p(X_k) - h_p(X) ) \nonumber \\= & {} \lim _{n\rightarrow \infty } \left( H_p(X_1,\dots ,X_n) - n h_p(X) \right) \end{aligned}$$
    (21)
    $$\begin{aligned}= & {} \lim _{n \rightarrow \infty } \underbrace{\sum _{k=1}^{n-1} \frac{k}{n-k} \, D({p_n}^{(k+1)} \, \Vert \, {p_n}^{(k)})}_{=: E_p(X_n)}, \end{aligned}$$
    (22)

    where we choose \( \mathfrak {S}_k\) as the sequences of integers \(j+1, j+2, \dots , j+\ell \) with \(\ell \le k\). The excess entropy measures the non-extensive part of the entropy, i.e. the amount of entropy of each element that exceeds the entropy rate.

3.7 Interactions Between Levels

The question of emergence, that is, how a higher level that is (at least partially) autonomous from lower levels, arises in many disciplines. For example, classical mechanics arises from an underlying quantum structure, but the laws of classical mechanics are causally closed, in the sense that for computing trajectories of Newtonian particles, we do not need information from the quantum level. Likewise, human genetics rests on the laws of Mendel and does not need to consider an underlying biochemical level. In other fields it is often not so clear, however, to what extent laws operate autonomously at a certain level without needing permanent or at least regular access to some lower level. For instance, does it suffice for understanding macroeconomic processes to consider relations between macroeconomic variables, or is an input from the microeconomic level essentially needed? Or can one understand social dynamics without access to the psychic and mental states of the participating individuals? For a general discussion of the issue of emergence from the perspective developed in the present contribution, see for instance [29].

Here, we describe the approach of [41, 42] (and refer to [41] for references to earlier work). We consider a structure

with basic level \(X,X'\) and higher level \(\widehat{X},\widehat{X}'\); an arrow \(Y\rightarrow Y'\) represents a discrete time step where \(X, X'\) form a Markov process, with transition kernel \(\phi \), which can be observed at the higher level \(\widehat{X}, \widehat{X}'\) in a lossy fashion.

The higher level could result from averaging or aggregating the lower level. Think of \(\widehat{X}\) as a coarse-graining of X given by an observation map \(\pi \).

We can propose several criteria for the upper process being closed in the sense that it depends on the lower process only through some initialization.

I:

Informational closure: The higher process is informationally closed, i.e. there is no information flow from the lower to the higher level. Knowledge of the microstate will not improve predictions of the macrostate.

$$\begin{aligned} MI(\widehat{X}' : X | \widehat{X}) = 0 \, \end{aligned}$$
(23)

where the conditional mutual information

$$\begin{aligned} MI(\widehat{X}' : X | \widehat{X})= H(\widehat{X}'|\widehat{X})-H(\widehat{X}'|X) \end{aligned}$$
(24)

measures the reduction in uncertainty about \(\widehat{X}'\) when knowing X instead of only \(\widehat{X}\).

II:

Observational commutativity: It makes no difference whether we perform the aggregation first and then observe the upper process, or we observe the process on the microstate level, and then lump together the states.

Kullback-Leibler divergence between the lower and the upper transition kernel from X to \(\widehat{X}'\) is 0, for some initial distribution on X.

$$\begin{aligned} I \Rightarrow II, \text { and in deterministic case also } II \Rightarrow I. \end{aligned}$$
(25)

(In I, probabilities at \(\widehat{X}\), in II at X)

III:

Commutativity: There exists a transition kernel \(\psi \) such that the diagram commutes (Görnerup-Jacobi, 2010)

$$\begin{aligned} II \Rightarrow III, \text { and in deterministic case also } III \Rightarrow II. \end{aligned}$$
(26)

II: Transition kernels satisfy \(\Psi = \Pi \Phi \Pi ^T\)

III: Transition kernels satisfy \(\Psi \Pi = \Pi \Phi \)

IV:

Markovianity: \(\widehat{X}, \widehat{X}'\) forms again a Markov process (Shalizi-Moore, 2003).

$$\begin{aligned} I \Rightarrow IV, \text { but } IV \nRightarrow III. \end{aligned}$$
(27)
V:

Predictive efficiency: A more abstract formulation is that an emergent level corresponds to an efficiently predictable process, that is, one that can be predicted in its own terms, without permanent recourse to a lower level.

3.7.1 A Test Case: The Tent Map

We now evaluate the preceding concepts at the example of the tent map, following [40] (see also [2] for background).

$$ T(x) = {\left\{ \begin{array}{ll} 2x &{} \text {if } 0 \le x \le 1/2 \\ 2-2x &{} \text {else } \end{array}\right. } $$

The tent map is a basic example of a chaotic dynamical iteration, because at every step differences between values can get doubled, and therefore, after several steps, even very tiny differences between initial values can become macroscopically large. The folding at \(x= 1/2\) ensures that nevertheless the unit interval is mapped to itself. Thus, some differences also get reduced. Understanding this interplay between amplification and reduction of differences is surprisingly subtle, as one may also see in the following.

For a threshold value \(\alpha \in \left[ 0, 1 \right] \) we define the symbolic dynamics

$$\begin{aligned} \phi _\alpha :&X \rightarrow \hat{X} = \lbrace 0,1 \rbrace \\ \phi _\alpha (x)&:= {\left\{ \begin{array}{ll} 0 &{} \text {if } 0 \le x < \alpha \\ 1 &{} \text {else } \end{array}\right. } \end{aligned}$$

The sequence \(x_n = T^{n}(x)\), for an initial value \(x \in X\), yields the derived symbol dynamics \(s_n = \phi _\alpha (x_n) \in \lbrace 0,1 \rbrace \).

The probability of finding \(s_n\) in the state 0 is the probability that \(x_n\) lies in the interval \(\left[ 0, \alpha \right] \) (which is \(\alpha \) for the tent map).

We consider the symbolic dynamics derived from consecutive time steps

$$ (s_{n+m}, s_{n+m-1}, \ldots , s_{n}) \, , $$

with \(k \in \mathbb {N}\)

$$ s_k(x) = \left\{ \begin{array}{ll} 0 &{} \text {if }T^{k}(x) < \alpha \\ 1 &{} \text {if }T^{k}(x) \ge \alpha \end{array} \right. \, . $$

For comparison, we take a random sequence \(\xi _n \in [0,1]\) (uniformly, i.i.d.), and consider the corresponding symbolic dynamics

$$\begin{aligned} \sigma _n:= {\left\{ \begin{array}{ll} 0 &{}\text { if } \xi _n <\alpha \\ 1&{} \text { if } \xi _n \ge \alpha . \end{array}\right. } \end{aligned}$$

The question now is: Are there systematic differences between the symbolic sequence \(s_n\) derived from iterations of the tent map and \(\sigma _n\)?

For \(\alpha =1/2\), they look the same (in fact, we simply have a Bernoulli sequence: the values 0 and 1 occur with equal probability 1/2; \(p(0)=p(1)=1/2\)). If we don’t know x, \(s_n\) looks as random as \(\sigma _n\). The transition probabilities are

$$\begin{aligned} p(0|0)=p(1|0)=p(0|1)=p(1|1)=1/2 . \end{aligned}$$

We next consider \(\alpha =2/3\). Put \(x_n:=T^n(x)\).

\(\sigma _n=0\) and \(\sigma _n=1\) occur independently with probabilities 2/3 and 1/3.

When \(s_n=1\), that is, \(2/3 < x_n \le 1\), then \(0\le x_{n+1}< 2/3\), that is \(s_{n+1}=0\). Thus, there is no transition from 1 to 1. For the state \(s_n=0\), both transitions are equally likely: when \(0\le x_n \le 1/3\), we have \(0\le x_{n+1} \le 2/3\), that is, \(s_{n+1}=0\), while for \(1/3 <x \le 2/3\), we get \(s_{n+1}=1\). Thus, for \(s_n\),

$$\begin{aligned} p(0|0)=p(1|0)=1/2,\ p(0|1)=1,\ p(1|1)=0 \end{aligned}$$

while for \(\sigma _n\)

$$\begin{aligned} p(0|0)=p(0|1)=2/3,\ p(1|0)=p(1|1)=1/3. \end{aligned}$$

This leads us to the concept of forbidden sequences. While for the threshold \(\alpha =1/2\), the symbolic dynamics of the tent map cannot be distinguished from that of a random sequence, and is Markovian, in contrast, for the threshold \(\alpha =2/3\), the sequence 11 does not occur, and the symbolic dynamics is different from a random one, but still Markovian.

For other thresholds, we can also get longer forbidden sequences and non-Markovian symbolics.

Even from a random sequence \(\xi _n\), we can derive non-Markovian symbolic dynamics.

Let \(x^1, x^2 \in [0,1]\); we consider the symbolic rule

$$\begin{aligned} s(x^1,x^2)={\left\{ \begin{array}{ll}0&{} \text { if } x^1 \le x^2\\ 1&{} \text { if } x^2 < x^1. \end{array}\right. } \end{aligned}$$

For our random sequence, take \(x^1=\xi _n, x^2=\xi _{n+1}\). Thus, we draw the points \(x^1, x^2\) randomly and independently.

The state probabilities are again \(p(0)=p(1)=1/2\), but the transition probabilities now depend on the history. The more 1s we have seen, the less likely it is to see another 1, because then \(\xi _n\) is expected to be very small, hence most likely, \(\xi _{n+1} > \xi _n\).

We now analyze the information flow of this example. The information flow between the micro-level corresponding to state \(x_n\) and the coarse-grained level \(s_n\) is the conditional mutual information

$$ MI(s_{n+1}:x_n|s_n)=H(s_{n+1}|s_n)-H(s_{n+1}|s_n,x_n)\, . $$

Since \(s_{n+1}\) is fully determined by \(x_n\), the second term vanishes,

$$ MI(s_{n+1}:x_n|s_n)=H(s_{n+1}|s_n) \;, $$

i.e., the information flow \(=\) conditional entropy on the coarse grained level, which has a local minimum at \(\alpha = 2/3\).

Instead of drawing information from below, the upper level system relies on its memory.

4 Information Decomposition

We finally turn to the concept of information decomposition. To motivate it, we start with the transfer entropy [50]Footnote 2

$$\begin{aligned} TE(Z\rightarrow X):= MI(X_+:Z_-|X_-) \end{aligned}$$
(28)

where the subscript—refers to the past and \(+\) to the future. \(TE(Z\rightarrow X)\) quantifies the amount of information contained in Z about the future of X that cannot be obtained from its own past.

Problem: \(X_+=\mathrm {XOR} (X_-,Z_-)\):

Here, the information in \(Z_-\) is only useful together with that of \(X_-\). The transfer entropy cannot distinguish this situation from one where \(X_-\) does not contribute and \(Z_-\) determines \(X_+\) by itself.

This problem is addressed by information decomposition. It was started by Williams and Beer [63] (but their measure \(I_{\mathrm {min}}\) of shared information does not distinguish whether different random variables carry the same information or just the same amount of information), and continued by Harder, Salge, Polani [23], Griffith and Koch [22], Bertschinger, Rauh, Olbrich, Ay, Banerjee, Jost [12, 13, 44, 45], and taken up by many other people (see for instance the references in [34]), with applications in different fields, like neuroscience [62]. There is no optimal solution, but that of Bertschinger, Rauh, Olbrich, Jost, Ay [13] (called the BROJA decomposition in the community) is currently the most widely accepted.

To describe our approach, we consider three random variables \(X_1 , X_2\) and S. The (total) mutual information \(MI(S: X_1 , X_2)\) quantifies the total information that is gained about S if the outcomes of \(X_1\) and \(X_2\) are known. How do \(X_1\) and \(X_2\) contribute to this information? For two explanatory variables, we expect four contributions to \(MI(S : X_1, X_2 )\):

$$\begin{aligned} MI(S : X_1, X_2 )= & {} SI(S : X_1 ; X_2 )\quad \text {shared information} \nonumber \\&+U I(S : X_1 \backslash X_2 )\quad \text {unique information of 1} \nonumber \\&+U I(S : X_2 \backslash X_1 )\quad \text {unique information of 2} \nonumber \\&+CI(S : X_1 ; X_2 )\quad \text {complementary or synergistic information}. \end{aligned}$$

Here, \(U I(S : X_1 \backslash X_2 )\) is the information that \(X_1\) has, but \(X_2\) does not have, \(SI(S : X_1 ; X_2 )\) is the information that both of them have individually. Perhaps the most interesting term is the last, \(CI(S : X_1 ; X_2)\), the information that only emerges if \(X_1\) and \(X_2\) pool their knowledge. This term is best illustrated in the XOR example discussed below.

figure a

We consider some examples. AND

\(x_1\)

\(x_2\)

s

\(p(x_1,x_2,s)\)

0

0

0

1/4

1

0

0

1/4

0

1

0

1/4

1

1

1

1/4

Here, \(x_1\) and \(x_2\) jointly determine s, but cannot be fully recovered from s.

When 1 has the value \(x_1=0\), she can exclude \(s=1\), and analogously for 2.

Thus, when they both see 0, they share the information that \(s=0\).

The mechanism loses some information. When \(X_1, X_2\) are i.i.d.,

$$\begin{aligned} H(X_1,X_2) =2 \text { bits}, \end{aligned}$$

but

$$\begin{aligned} H(S)=MI(S:X_1,X_2)=- \frac{1}{4}\log \frac{1}{4} - \frac{3}{4}\log \frac{3}{4}\approx .811 \text { bits}. \end{aligned}$$

In general, we may have both correlations between the input variables and relations created by the mechanism that computes S.

We next recall XOR from Sect. 2.2:

\(x_1\)

\(x_2\)

s

0

0

0

1

0

1

0

1

1

1

1

0

Neither 1 nor 2 can determine the value of S by herself, but the value of the other is needed for that. This is a clear case of synergistic information only.

Our approach: Unique and shared information should only depend on the marginal distribution of the pairs \((S,X_1)\) and \((S,X_2)\). This idea can be explained from an operational interpretation of unique information: Namely, if \(X_1\) has unique information about S (with respect to \(X_2\)), then there must be some way to exploit this information. More precisely, there must be a situation in which \(X_1\) can use this information to perform better at predicting the outcome of S.

In this interpretation, 1 possesses unique information about S compared with 2, if there exists a reward function for which 1 can achieve a higher expected reward based on her value \(x_1\) and her knowledge of the conditional distribution \(p(s|x_1)\) than if she knew and utilized instead the conditional distribution of 2.

Thus, unique and shared information depend only on pairwise marginals. Only the synergistic information includes higher order dependencies. In that sense, synergy becomes a measure of higher order interactions, in the sense of information geometry.

From a conceptual perspective, and independently of the way the different terms in the decomposition are quantified, it is important to understand synergy, in order to clarify discussions that have become quite sterile, like the relative importance of genes and environment in biology. For a perspective in this direction, see [27].