Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 What Is Algorithmic Statistics?

The laws of celestial mechanics allow astronomers to predict the observed motion of planets in the sky with very high precision. This was a great achievement of modern science—but could we expect to find equally precise models for all other observations? Probably not. Thousands of gamblers spent all theirs lives and their fortunes trying to discover the laws of roulette (coin tossing, other games of chance) in the same sense—but failed. Modern science abandoned these attempts. It says modestly that all we can say about the coin tossing is the statistical hypothesis (model): all trials are independent and (for a fair coin) both head and tail have probability 1/2. The task of mathematical statistics therefore is to find an appropriate model for experimental data. But what is “appropriate” in this context?

To simplify the discussion, let us assume that experimental data are presented as a bit string (say, a sequence of zeros and ones corresponding to heads and tails in the coin-tossing experiment). We also assume that a model is presented as a probability distribution on some finite set of binary strings. For example, a fair coin hypothesis for N coin tossings is a set of all strings of length N where all elements have the same probability \(2^{-N}\). Restricting ourselves to the simplest case when a hypothesis is some set A of strings with uniform distribution on it, we repeat our question:

Assume that a bit string x (data) and a set A containing x (a model) are given; when do we consider A a good “explanation” for x?

Some examples show that this question cannot be answered in the framework of classical mathematical statistics. Consider a sequence x of 100 bits (the following example is derived from random tables [18]):

01111 10001 11110 10010 00001 00011 00001 10010 00010 11101

10111 11110 10000 11100 00111 00000 01111 01100 11011 01011

Probably you would agree that the statistical hypothesis of a fair coin (the set \(A=\mathbb {B}^{100}\) of all 100-bit sequences) looks like an adequate explanation for this sequence. On the other hand, you probably will not accept the set A as a good explanation for the sequence y:

00000 00000 00000 00000 00000 00000 00000 00000 00000 00000

00000 00000 00000 00000 00000 00000 00000 00000 00000 00000

but will suggest a much better explanation \(B=\{y\}\) (the coin that always gives heads). On the other hand, set \(C=\{x\}\) does not look like a reasonable explanation for x. How can we justify this intuition?

One could say that A is not an acceptable statistical hypothesis for y since the probability of y according to A is negligible (\(2^{-100}\)). However, the probability of x for this hypothesis is the same, so why is A acceptable for x then? And if B looks like an acceptable explanation for y, why does C not look like an acceptable explanation for x?

Classical statistics, where x and y are just two equiprobable elements of A, cannot answer these questions. Informally, the difference is that x looks like a “random” element of A while y is “very special.” To capture this difference, we need to use the basic notion of algorithmic information theory, Kolmogorov complexity,Footnote 1 and say that x has high complexity (cannot be described by a program that is much shorter than x itself) while y has low complexity (one can write a short program that prints a long sequence of zeros). This answers our first question and explains why A could be a good model for x but not for y.

We asked another question: why is B an acceptable explanation for y while C is not an acceptable explanation for x? Here we need to look at the complexity of the model itself: C has high complexity (because x is complex) while B is simple (because y is simple).

Now let us consider different approaches to measuring the “quality” of statistical models; they include several parameters and a trade-off between them arises. In this way for every data string x we get a curve that reflects this trade-off. There are different ways to introduce this curve, but they are all equivalent with \(O(\log n)\) precision for n-bit strings. The goal of this chapter is to describe these approaches and equivalence results.

2 \((\alpha ,\beta )\)-Stochastic Objects

Let us start with the approach that most closely follows the scheme described above. Let x be a string and let A be a finite set of strings that contains x. The “quality” of A as a model (explanation) for x is measured by two parameters:

  • the Kolmogorov complexity \({\textit{C}}(A)\) of A;

  • the randomness deficiency \(d(x|A)\) of x in A.

The second parameter measures how “non-typical” x is in A (small values mean that x looks like a typical element of A) and is defined as

$$ d(x|A)= \log \#A - {\textit{C}}(x|A). $$

Here \(\log \) stands for binary logarithm, \(\#A\) is the cardinality of A and \({\textit{C}}(u|v)\) is the conditional complexity of u given v. Using A as the condition, we assume that A is presented as a finite list of strings (say, in lexicographical order). The motivation for this definition: for all \(x\in A\) we have \({\textit{C}}(x|A)\le \log \#A+O(1)\), since every \(x\in A\) is determined by its ordinal number in A; for most \(x\in A\) the complexity \({\textit{C}}(x|A)\) is close to \(\log \#A\) since the number of strings whose complexity is much less than \(\log \#A\) is negligible compared to \(\#A\). So the deficiency is large for strings that are much simpler than most elements of A.Footnote 2

According to this approach, a good explanation A for x should make both parameters small: A should be simple and x should be typical in A. It may happen that these two goals cannot be achieved simultaneously, and a trade-off arises. Following Kolmogorov, we say that x is \((\alpha ,\beta )\) -stochastic if there exists A containing x such that \({\textit{C}}(A)\le \alpha \) and \(d(x|A)\le \beta \). In this way we get an upward closed set

$$ S (x) =\{ \langle \alpha ,\beta \rangle \mid x \, \mathrm{is} \, (\alpha ,\beta )\, \mathrm{-stochastic}\}. $$

If x is a string of length n, the set A of all n-bit strings can be used as a description; it gives us the pair \((O(\log n), n-{\textit{C}}(x)+O(\log n))\) in S(x). Indeed, we can describe A using \(O(\log n)\) bits and the deficiency is \(n-{\textit{C}}(x|A)=n-{\textit{C}}(x|n)=n-{\textit{C}}(x)+O(\log n)\). On the other hand, there is a set \(A\ni x\) of complexity \({\textit{C}}(x)+O(1)\) and deficiency O(1) (namely, \(A=\{x\}\)). So the boundary of the set S(x) starts below the point \((0,n-{\textit{C}}(x))\) and decreases to \(({\textit{C}}(x),0)\) for an arbitrary n-bit string x, if we consider S(x) with \(O(\log n)\) precision.Footnote 3

The boundary line of S(x) can be called a stochasticity profile of x. As we will see, the same curve appears in several other situations.

3 Minimum Description Length Principle

Another way to measure the “quality” of a model starts from the following observation: if x is an element of a finite set A, then x can be described by providing two pieces of information:

  • the description of A;

  • the ordinal number of x in A (with respect to some ordering fixed in advance).

This gives us the inequality

$$ {\textit{C}}(x)\le {\textit{C}}(A)+\log \#A, $$

which is true with precision \(O(\log n)\) for strings x of length at most n.Footnote 4

The quality of the hypothesis A is then measured by the difference

$$ \delta (x,A)={\textit{C}}(A)+\log \#A-{\textit{C}}(x) $$

between the sides of this inequality. We may call it the “optimality deficiency” of A, since it shows how much we lose in the length of the description if we consider a two-part description based on A instead of the best possible one. For a given string x we can then consider the set O(x) of pairs \(\langle \alpha ,\beta \rangle \) such that x has a model of complexity at most \(\alpha \) and optimality deficiency at most \(\beta \).

Theorem 17.1

For every string x of length at most n the sets S(x) and O(x) coincide with \(O(\log n)\)-precision: each of them is contained in the \(O(\log n)\)-neighborhood of the other one.

Speaking about neighborhoods, we assume some standard distance on \(\mathbb {R}^2\) (the exact choice does not matter, since we measure the distance up to a constant factor).

Let us note now that in one direction the inclusion is straightforward. A simple computation shows that the randomness deficiency is always less than the optimality deficiency of the same model (and the difference between them equals \({\textit{C}}(A|x)\), where A is this model).

The opposite direction is more complicated: a model with small randomness deficiency may have large optimality deficiency. This may happens when \({\textit{C}}(A|x)\) is large.Footnote 5 However, in this case we can find another model and decrease the optimality deficiency as needed: for every string x and every model A for x (a finite set A that contains x) there exists another model \(A'\) for x such that \(\log \#(A')=\log \#A\) and \({\textit{C}}(A')\le {\textit{C}}(A)-{\textit{C}}(A|x)+O(\log n)\), where n is the length of x. This result looks surprising at first, but note that if \({\textit{C}}(A|x)\) is large, then there are many sets \(A'\) that are models of the same quality (otherwise A can be reconstructed from x by exhaustive search). These sets can be used to find \(A'\) with the required properties.

The definition of the set O(x) goes back to Kolmogorov [10], also refer [4, 6]; however, he used a slightly different definition: instead of O(x) he considered the function

$$ h_{x}(\alpha ) = \min _{A} \{\log \# A : x\in A,\; {\textit{C}}(A) \le \alpha \}, $$

now called the Kolmogorov structure function. Both O(x) and \(h_x\) are determined by the set of all pairs \(({\textit{C}}(A),\log \#A)\) for finite sets A containing x, though in a slightly different way (since the inequality \(\delta (x,A)\le \beta \) in the definition of O(x) combines \({\textit{C}}(A)\) and \(\log \#A\)). One can show, however, that the following statement is true with \(O(\log n)\)-precision for each n-bit string x: the pair \((\alpha ,\beta )\) is in O(x) if and only if \(h_x(\alpha )\le \beta +{\textit{C}}(x)-\alpha \). So the graph of \(h_x\) is just the boundary of O(x) in different coordinates (Fig. 17.1).

Fig. 17.1
figure 1

The pair \((\alpha ,\beta )\) lies on the boundary of O(x) since the point \((\alpha ,{\textit{C}}(x)-\alpha +\beta )\) lies on the graph of \(h_x\)

4 Lists of Simple Strings

We have seen two approaches that describe the same trade-off between the complexity of a model and its quality: for every x there is some curve (defined up to \(O(\log n)\)-precision) that shows how good a model with bounded complexity can be. Both approaches gave the same curve with logarithmic precision; in this section we give one more description of the same curve.

Let m be some integer. Consider the list of strings of complexity at most m. It can be generated by a simple algorithm: just try in parallel all programs of length at most m and enumerate all their outputs (without repetitions). This algorithm is simple (of complexity \(O(\log m)\)) since we only need to know m.

There may be several simple algorithms that enumerate all strings of complexity at most m, and they can generate them in different orders. For example, two algorithms may start by listing all the strings of length \(m-O(1)\) (they all have complexity at most m), but one does this in alphabetical order and the other uses reverse alphabetical order. So the string \(00\ldots 00\) is the first in one list and has number \(2^{m-O(1)}\) in the other. But the distance from the end of the list is much more invariant:

Theorem 17.2

Consider two programs of complexity \(O(\log m)\) that both enumerate all strings of complexity at most m. Let x be one of these strings. If there are at least \(2^k\) strings after x in the first list, then there are at least \(2^{k-O(\log m)}\) strings after x in the second list.

In this theorem we consider two algorithms that enumerate the same strings in different orderings. However, the Kolmogorov complexity function depends on the choice of the optimal decompressor (though at most by an O(1) additive term), so one could ask what happens if we enumerate the strings of bounded complexity for two different versions of the complexity function. A similar result (with similar proof) says that changing the optimal decompressor used to define Kolmogorov complexity can be compensated by an \(O(\log m)\)-change in the threshold m.

Now for every m fix an algorithm of complexity at most \(O(\log m)\) that enumerates all strings of complexity at most m. Consider a binary string x; it appears in these lists for all \(m\ge {\textit{C}}(x)\). Consider the logarithm of the number of strings that follow x in the mth list. We get a function that is defined for all \(m\ge {\textit{C}}(x)\) with \(O(\log m)\) precision. The following result shows that this function describes the stochasticity profile of x in different coordinates.

Theorem 17.3

Let x be a string of length at most n.

 

(a):

Assume that x appears in the list of strings of complexity at most m and there are at least \(2^k\) strings after x in the list. Then the pair \(((m-k)+O(\log n),m-{\textit{C}}(x))\) belongs to the set O(x).

(b):

Assume that the pair \((m-k,m-{\textit{C}}(x))\) belongs to O(x). Then x appears in the list of strings of complexity at most \(m+O(\log n)\) and there are at least \(2^{k-O(\log n)}\) strings after it.

 

By Theorem 17.1 the same statement holds for the set S(x) in place of O(x).

Ignoring the logarithmic correction and taking into account the relation between O(x) and \(h_x\), one can illustrate the statement of Theorem 17.3 by Fig. 17.2.

Fig. 17.2
figure 2

To find how many strings appear after x in the list of all strings of complexity at most m, we draw a line starting at (0, m) with slope \(-1\) and intersect it with the graph of \(h_x\); if the second coordinate of the intersection point is k, there are about \(2^{k}\) strings after x in this list

5 Time-Bounded Complexity and Busy Beavers

There is one more way to get the stochasticity profile curve. Let us bound the computation time (number of steps) in the definition of Kolmogorov complexity and define \({\textit{C}}^t(x)\) as the minimal length of a program that produces x in at most t steps. Evidently, \({\textit{C}}^t(x)\) decreases as t increases, and ultimately reaches \({\textit{C}}(x)\).Footnote 6 However, the convergence speed may be quite different for different x of the same complexity. It is possible that for some x the programs of minimal length produce x rather fast, while other x can be compressible only if we allow very long computations. Informally, the strings of the first type have some simple internal structure that allows us to encode them efficiently with a fast decoding algorithm, while the strings of the second type have “deep” internal structure that is visible only if the observer has a lot of computational power.

We use the so-called “busy beaver numbers” as landmarks for measuring the computation time. Let \(\textit{BB}(n)\) be the maximal running time of all programs of length at most n (we use the programming language that defines Kolmogorov complexity, and some fixed interpreter for it).Footnote 7 One can show that numbers \(\textit{BB}(n)\) have an equivalent definition in terms of Kolmogorov complexity: \(\textit{BB}(n)\) is the maximal integer that has complexity at most n. (More precisely, if B(n) is the maximal integer that has complexity at most n, then \(B(n-c)\le \textit{BB}(n)\le B(n+c)\) for some c and all n, and we ignore O(1)-changes in the argument of the busy beaver function.)

Now for every x we may consider the decreasing function \(i\mapsto {\textit{C}}^{\textit{BB}(i)}(x) - {\textit{C}}(x)\) (it decreases fast for “shallow” x and slowly for “deep” x; note that it becomes close to 0 when \(i={\textit{C}}(x)\), since then every program of length at most \({\textit{C}}(x)\) terminates in \(\textit{BB}({\textit{C}}(x))\) steps). The graph of this function is (with logarithmic precision) just a stochasticity profile, i.e., the set of points above the graph coincides with O(x) up to an \(O(\log n)\) error term:

Theorem 17.4

Let x be a string of length n.  

(a):

If a pair \((\alpha ,\beta )\) is in O(x), then

$$ {\textit{C}}^{\textit{BB}(\alpha +O(\log n))}(x)\le {\textit{C}}(x)+\beta +O(\log n). $$
(b):

If \({\textit{C}}^{\textit{BB}(\alpha )}(x)\le {\textit{C}}(x)+\beta \), then the pair \((\alpha +O(\log n),\beta +O(\log n))\) is in O(x).

 

By Theorem 17.1 the same statement holds for the set S(x) in place of O(x).

6 What Can the Stochasticity Profile Look Like?

We have seen four different definitions that lead to the same (with logarithmic precision) notion of stochasticity profile. We see now that not only can finite objects (strings) have different complexities, but also the strings with the same complexity can be classified according to their stochasticity profiles.

However, we do not know yet that this classification is non-trivial: what if all strings of given complexity have the same stochasticity profile? The following result answers this question by showing that every simple decreasing function appears as the complexity profile of some string.

Theorem 17.5

Assume that some integers n and \(k\le n\) are given, and h is a non-increasing function mapping \(\{0,1,\ldots ,k\}\) to \(\{0,1,\ldots ,n-k\}\). Then there exists a string x of length \(n+O(\log n)+O({\textit{C}}(h))\) and complexity \(k+O(\log n)+O({\textit{C}}(h))\) for which the set O(x) (and hence the set S(x)) coincides with the upper-graph of h (the set \(\{\langle i,j\rangle \mid j\ge h(i) \text { or } i\ge k\}\)) with \(O(\log n+{\textit{C}}(h))\) accuracy.

Note that the error term depends on the complexity of h. If we consider simple functions h, this term is absorbed by our standard error term \(O(\log n)\). In particular, this happens in two extreme cases: for the function \(h\equiv 0\) and the function h that is equal to \(n-k\) everywhere. In the first case it is easy to find such a “shallow” x: just take an incompressible string of length k and add \(n-k\) trailing zeros to get an n-bit string. For the second case we do not know a better example than the one obtained from the proof of Theorem 17.5.

Let us say informally that a string x of length n is “stochastic” if its stochasticity profile S(x) is close to the maximal possible set (achieved by the first example) with logarithmic precision, i.e., x is \((O(\log n),O(\log n))\)-stochastic. We know now that non-stochastic objects exist in the mathematical sense; a philosopher could ask whether they appear in “real life.” Is it possible that some experiment gives us data that do not have any adequate statistical model? This question is quite philosophical since given an object and a model we cannot say for sure whether the model is adequate in terms of algorithmic statistics. For example, the current belief is that coin tossing data are described adequately by a fair coin model. Still it is possible that future scientists will discover some regularities in the very same data, thus making this model unsuitable.

We discuss the properties of stochastic objects in the next section. For now let us note only that this notion remains essentially the same if we consider probability distributions (and not finite sets) as models. Let us explain what this means.

Consider a probability distribution P on a finite set of strings with rational values. This is a constructive object, so we can define the complexity of P using some computable encoding. The conditional complexity \({\textit{C}}(\cdot |P)\) can be defined in the same way. Let us modify the definition of stochasticity and say that a string x is “\((\alpha ,\beta )\)-p-stochastic” if there exists a distribution P of the described type such that

  • \({\textit{C}}(P)\) is at most \(\alpha \);

  • \(d(x|P)\), defined as \(-\log P(x) - {\textit{C}}(x|P)\), does not exceed \(\beta \).

This is indeed a generalization: if P is a uniform distribution, then the complexity of P is (up to O(1)) the complexity of its support A, the value of \(-\log P(x)\) is \(\log \#A\), and using P and A as conditions gives the same complexity up to O(1). On the other hand, this generalization leads to only a logarithmic change in the parameters:

Theorem 17.6

If some string x of length n is \((\alpha ,\beta )\)-p-stochastic, then the string x is also \((\alpha +O(\log n),\beta +O(\log n))\)-stochastic.

Since all our statements are made with \(O(\log n)\)-precision, we may identify stochasticity with p-stochasticity (as we do in the sequel) .

7 Canonical Models

Let \(\varOmega _m\) denote the number of strings of complexity at most m. Consider its binary representation, i.e., the sum

$$\varOmega _m=2^{s_1}+2^{s_2}+\ldots +2^{s_t}, \,\mathrm{where}\,\, s_1>s_2>\ldots >s_t.$$

According to this decomposition, we may split the list itself into groups: first \(2^{s_1}\) elements, next \(2^{s_2}\) elements, etc.Footnote 8 If x is a string of complexity at most m, it belongs to some group, and this group can be considered to be a model for x.

We may consider different values of m (starting from \({\textit{C}}(x)\)). In this way we get different models of this type for the same x. Let us denote by \(B_{m,s}\) the group of size \(2^s\) that appears in the mth list. Note that \(B_{m,s}\) is defined only for s that correspond to ones in the binary representation of \(\varOmega _m\). The models \(B_{m,s}\) are called canonical models in the sequel. The parameters of \(B_{m,s}\) are easy to compute: the size is \(2^s\) by definition, and the complexity is \(m-s+O(\log m)\).

Theorem 17.7

 

(a):

Every canonical model for a string x lies on the boundary of O(x) (i.e., its parameters cannot be improved by more than \(O(\log n)\) where n is the length of x).

(b):

For every point in O(x) there exists a canonical model that has the same or better parameters (with \(O(\log n)\) precision).

 

The second part of this theorem says that for every model A for x we can find a canonical model \(B_{m,s}\) that has the same (or smaller) optimality deficiency, and \({\textit{C}}(B_{m,s})\le {\textit{C}}(A)\) with logarithmic precision. In fact, the second part of this statement can be strengthened: not only \({\textit{C}}(B_{m,s})\le {\textit{C}}(A)\), but also \({\textit{C}}(B_{m,s}|A)=O(\log n)\).

This result shows that (in a sense) we may restrict ourselves to canonical models. This raises the question: what are these models? What information do they contain? The answer is a bit confusing: the information in models \(B_{m,s}\) depends on \(m-s\) only and is the same as the information in \(\varOmega _{m-s}\), the number of strings of complexity at most \(m-s\):

Theorem 17.8

For all models \(B_{m,s}\) both conditional complexities \({\textit{C}}(B_{m,s}|\varOmega _{m-s})\) and \({\textit{C}}(\varOmega _{m-s}|B_{m,s})\) are \(O(\log m)\).

One could note also that the information in \(\varOmega _k\) is a part of the information in \(\varOmega _l\) for \(l\ge k\) (i.e., \({\textit{C}}(\varOmega _k|\varOmega _l)=O(\log l)\)).Footnote 9

Now it seems that finding a good model for x does not provide any specific information about x: all we get (if we succeed) is information about the number of terminating programs of bounded length, which has nothing to do with x and is the same for all x.

It is not clear how this philosophical collision between our goals and our achievements can be resolved. One approach is to consider total conditional complexity. This approach still leaves many questions open, but let us briefly describe it nevertheless.

We have said that “strings a and b contain essentially the same information” if both \({\textit{C}}(a|b)\) and \({\textit{C}}(b|a)\) are small. This, however, does not guarantee that the properties of a and b are the same. For example, if \(x^*\) is the shortest program for some highly non-stochastic string x, the string \(x^*\) itself is perfectly stochastic.

To avoid this problem, we can consider the total conditional complexity \({\textit{CT}}(a|b)\), defined as the minimal length of a total program p such that \(p(b)=a\). Here p is called total if \(p(b')\) halts for all \(b'\), not only for b.Footnote 10 This total conditional complexity can be much bigger than the standard conditional complexity \({\textit{C}}(a|b)\). It has the following property: if both \({\textit{CT}}(a|b)\) and \({\textit{CT}}(b|a)\) are negligible, there exists a computable permutation of low complexity that maps b to a, and therefore the sets O(a) and O(b) are close to each other. (See [15] for more details.)

Using this notion, we may consider a set A to be a “strong” model if it is close to the boundary of O(x) and at the same time the total complexity \({\textit{CT}}(A|x)\) is small. The second condition is far from trivial: one can prove that for some strings x such strong models do not exist at all (except for the trivial model \(\{x\}\) and models of very small complexity) [22]. But if strong models exist, they have some nice properties: for example, the stochasticity profile of every strong sufficient statistic for x is close to the profile of the string x itself [23]. (A model is called a sufficient statistic for x if the optimality deficiency is small, i.e., the sum of its complexity and log-cardinality is close to \({\textit{C}}(x)\).) The class of all sufficient statistics for x does not have this property (for some x).

Returning to the stochasticity profile, let us mention one more non-existence result. Imagine that we want to find a place when the set O(x) touches the horizontal coordinate line. To formulate a specific task, consider for a given string of length n two numbers. The first, \(\alpha _1\), is the maximal value of \(\alpha \) such that \((\alpha ,0.1n)\) does not belong to O(x); the second, \(\alpha _2\), is the minimal value of \(\alpha \) such that \((\alpha ,10\log n)\) belongs to O(x). (Of course, the constant 10 is chosen just to avoid additional quantifiers, any sufficiently large constant would work.) Imagine that we want, given x and \({\textit{C}}(x)\), to find some point in the interval \((\alpha _1,\alpha _2)\), or even in a slightly bigger one (say, adding a margin of size 0.1n in both directions). One can prove that there is no algorithm that fulfills this task [24].

8 Stochastic Objects

The philosophical questions about non-stochastic objects in the “real world” motivate several mathematical questions. Where do they come from? can we obtain a non-stochastic object by applying some (simple) algorithmic transformation to a stochastic one? Can non-stochastic objects appear (with non-negligible probability) in a (simple) random process? What are the special properties of non-stochastic objects?

Here are several results answering these questions.

Theorem 17.9

Let f be a computable total function. If string x of length n is \((\alpha ,\beta )\)-stochastic, then f(x) is \((\alpha +{\textit{C}}(f)+O(\log n),\beta +{\textit{C}}(f)+O(\log n))\)-stochastic.

Here \({\textit{C}}(f)\) is the complexity of the program that computes f.

An important example: let f be the projection function that maps every pair \(\langle x,y\rangle \) (its encoding) to x. Then we have \({\textit{C}}(f)=O(1)\), so we conclude that each component of an \((\alpha ,\beta )\)-stochastic pair is \((\alpha +O(\log n),\beta +O(\log n))\)-stochastic.

A philosopher would interpret Theorem 17.9 as follows: a non-stochastic object cannot appear in a simple total algorithmic process (unless the input was already non-stochastic). Note that the condition of totality is crucial here: for every x, stochastic or not, we may consider its shortest program p. It is incompressible (and therefore stochastic), and x is obtained from p by a simple program (decompressor).

If a non-stochastic object cannot be obtained by a (simple total) algorithmic transformation from a stochastic one, can it be obtained (with non-negligible probability) in a (simple computable) random process? If P is a simple distribution on a finite set of strings with rational values, then P can be used as a statistical model, so only objects x with high randomness deficiency \(d(x|P)\) can be non-stochastic, and the set of all x that have \(d(x|P)\) greater than some d has negligible P-probability (an almost direct consequence of the deficiency definition).

So for computable probabilistic distributions the answer is negative for trivial reasons. In fact, a much stronger (and surprising) statement is true. Consider a probabilistic machine M without input that, being started, produces some string and terminates, or does not terminate at all (and produces nothing). Such a machine determines a semimeasure on the set of strings (we do not call it a measure since the sum of probabilities of all strings may be less than 1 if the machine hangs with positive probability). The following theorem says that a (simple) machine of this type produces non-stochastic objects with negligible probability.

Theorem 17.10

There exists some constant c such that the probability of the event

M produces a string of length at most n that is not \((d+{\textit{C}}(M)+c\log n, c\log n)\)-stochastic”

is bounded by \(2^{-d}\) for every machine M of the described type and for arbitrary integers n and d.

The following results partially explain why this happens. Recall that algorithmic information theory defines mutual information in two strings x and y as \({\textit{C}}(x)+{\textit{C}}(y)-{\textit{C}}(x,y)\); with \(O(\log n)\) precision (for strings of length at most n) this expression coincides with \({\textit{C}}(x)-{\textit{C}}(x|y)\) and \({\textit{C}}(y)-{\textit{C}}(y|x)\). Recall that by \(\varOmega _n\) we denote the number of strings of complexity at most n.

Theorem 17.11

There exists a constant c such that for every n, for every string x of length at most n and for every threshold d the following holds: if a string x of length n is not \((d+c \log n, c\log n)\)-stochastic, then

$$I(x:\varOmega _n)\ge d-c\log n.$$

This theorem says that all non-stochastic objects have a lot of information about a specific object, the string \(\varOmega _n\). This explains why they have small probability of appearing in a (simple) randomized process, as the following result shows. It guarantees that for every fixed string w the probability of getting (in a simple random process) some object that contains significant information about w is negligible.

Theorem 17.12

There exists a constant c such that for every n, for every probabilistic machine M, for every string w of length at most n and for every threshold d the probability of the event

M outputs a string x of length at most n such that \(I(x:w)>{\textit{C}}(M)+d+c\log n\)

is at most \(2^{-d}\).

The last result of this section shows that stochastic objects are “representative” if we are interested only in the complexity of strings and their combinations: for every tuple of strings one can find a stochastic tuple that is indistinguishable from the first one by the complexities of its components.

Theorem 17.13

For every k there exists a constant c such that for every n and for every k-tuple \(\langle x_1,\ldots ,x_k\rangle \) of strings of length at most n, there exists another k-tuple \(\langle y_1,\ldots , y_k\rangle \) that is \((c\log n, c\log n)\)-stochastic and for every \(I\subseteq \{1,2,\ldots ,n\}\) the difference between \({\textit{C}}(x_I\)) and \({\textit{C}}(y_I)\) is at most \(c\log n\).

Here \(x_I\) is a tuple made of strings \(x_i\) with \(i\in I\); the same for \(y_I\).

This result implies, for example, that every linear inequality for complexities that is true for stochastic tuples is true for arbitrary ones.

However, there are some results that are known for stochastic tuples but still are not proven for arbitrary ones. See [16] for details .

9 Restricted Classes: Hamming Distance and Balls as Descriptions

Up to now we considered arbitrary sets as statistical models. However, sometimes we have some external information that suggests a specific class of models (and it remains to choose the parameters that define some model in this class). For example, if the data string is a message sent through a noisy channel that can change some bits, we consider Hamming balls as models, and the parameters are the center of this ball (the original message) and its radius (the maximal number of changed bits).

So let us consider some family \(\mathcal {B}\) of finite sets. To get a reasonable theory, we need to assume some properties of this family:

  • The family \(\mathcal {B}\) is computably enumerable: there exists an algorithm that enumerates all elements of \(\mathcal {B}\) (finite sets are here considered as finite objects, encoded as lists of their elements).

  • For each n the set of all n-bit strings belongs to \(\mathcal {B}\).

  • There exists a polynomial p such that the following property holds: for every \(B\in \mathcal {B}\), for every positive integer n and for every \(c<\#B\) the set of all n-bit strings in B can be covered by \(p(n)\#B\) \(/\) \(c\) sets from \(\mathcal {B}\) and each of the covering sets has cardinality at most c.

Here \(\#B\) stands for the cardinality of B. A counting argument shows that in the last condition we need at least \(\#B\) \(/\) \(c\) covering sets; the condition says that a polynomial overhead is enough here.

One can show (using simple probabilistic arguments) that the family of all Hamming balls (for all string lengths, centers, and radii) has all three properties. This family is a main motivating example for our considerations.

Now we can define the notion of a \(\mathcal {B}\)-\((\alpha ,\beta )\)-stochastic object: a string x is \(\mathcal {B}\) -\((\alpha ,\beta )\)-stochastic if there exists a set \(B\in \mathcal {B}\) containing x such that \({\textit{C}}(B)\le \alpha \) and \(d(x|B)\le \beta \). (The original notion of \((\alpha ,\beta )\)-stochasticity corresponds to the case when \(\mathcal {B}\) contains all finite sets.) For every x we get a set \(S_\mathcal {B}(x)\) of pairs \((\alpha ,\beta )\) for which x is \(\mathcal {B}\)-\((\alpha ,\beta )\)-stochastic. We can also define the set \(O_\mathcal {B}(x)\) using optimality deficiency instead of randomness deficiency. The \(\mathcal {B}\)-version of Theorem 17.1 is still true (though the proof needs a much more ingenious construction):

Theorem 17.14

Let \(\mathcal {B}\) be the family of finite sets that has the properties listed above. Then for every string x of length at most n the sets \(S_\mathcal {B}(x)\) and \(O_\mathcal {B}(x)\) coincide up to an \(O(\log n)\) error term.

The proof is more difficult (compared to the proof of Theorem 17.1) since we now need to consider sets in \(\mathcal {B}\) instead of arbitrary finite sets. So we cannot construct the required model for a given string x ourselves and have to select it among the given sets that cover x. This can be done by a game-theoretic argument.

It is interesting to note that a similar argument can be used to obtain the following result about stochastic finite sets (Epstein–Levin theorem):

Theorem 17.15

If a finite set X is \((\alpha ,\beta )\)-stochastic and the total probability

$$\sum _{x\in X} 2^{-{ K }(x)}$$

of its elements exceeds \(2^{-k}\), then X contains some element x such that

$${ K }(x)\le k+{ K }(k)+\log { K }(k)+\alpha +O(\log \beta )+O(1).$$

Here \({ K }(u)\) stands for the prefix complexity of u (see, e.g., [14] for the definition). To understand the meaning of this theorem, let us recall one of the fundamental results of algorithmic information theory: the (prefix) complexity of a string x equals the binary logarithm of its a priori probability. If we consider a set X of strings instead of one string x, we can consider the a priori probability of X (expressing how difficult it is to get some element of x in a random process) and the minimal complexity of elements of X (saying how difficult it is to specify an individual element in X). The fundamental result mentioned above says that for singletons these two measures are closely related; for arbitrary finite sets this is no longer the case, but Theorem 17.15 guarantees this for the case of stochastic finite sets.

Returning to our main topic, let us note that for Hamming balls the boundary curve of \(O_\mathcal {B}(x)\) has a natural interpretation. To cover x of length n with a ball B with center y having cardinality \(2^\beta \) and complexity at most \(\alpha \) means (with logarithmic precision) finding a string y of complexity at most \(\alpha \) in the r-neighborhood of x, where r is chosen in such a way that balls of radius r have about \(2^\beta \) elements. So this boundary curve represents a trade-off between the complexity of y and its distance to x.

Again one can ask what kind of boundary curves may appear. As in Theorem 17.5, we can get an essentially arbitrary non-increasing function. However, here the precision is worse: the \(O(\log n)\) term is now replaced by \(O(\sqrt{n\log n})\).

Theorem 17.16

Assume that some integers n and \(k\le n\) are given, and h is a non-increasing function mapping \(\{0,1,\ldots ,k\}\) to \(\{0,1,\ldots ,n-k\}\). Then there exists a string x of length \(n+O(\sqrt{n\log n})+O({\textit{C}}(h))\) and complexity \(k+O(\sqrt{n\log n})+O({\textit{C}}(h))\) for which the set \(O_\mathcal {B}(x)\) coincides with the upper-graph of h (the set \(\{\langle i,j\rangle \mid j\ge h(i) \text { or } i\ge k\}\)) with \(O(\sqrt{n\log n}+{\textit{C}}(h))\)-precision.

Unlike the general case where non-stochastic objects (for which the curve is far from zero) exists but are difficult to describe, for the case of Hamming balls one can give more explicit examples. Consider some explicit error correction code that has distance d. Then every string that differs in at most d/2 positions from some codeword x has almost the same complexity as x (since x can be reconstructed from it by error correction). So balls of radius less than \(d\) \(/\) \(2\) containing some codeword have almost the same complexity as the codeword itself (and the ball of zero radius containing it).

Let x be a typical codeword of this binary code (its complexity is close to the logarithm of the number of codewords). For values of \(\alpha \) slightly less than \({\textit{C}}(x)\) we need a large \(\beta \) (at least the logarithm of the cardinality of a ball of radius \(d\) \(/\) \(2\)) to make such a codeword \((\alpha ,\beta )\)-stochastic.

10 Historical Remarks

The notion of \((\alpha ,\beta )\)-stochasticity was mentioned by Kolmogorov in his talks at the seminar he initiated at the Moscow State University in the early 1980s (see [21]). The equivalence between this notion and the optimality deficiency (Theorem 17.1) was discovered in [24].

The connections between the existence of adequate models and the position in the list of strings of bounded complexity was discovered by Gács, Tromp, and Vitányi in [7], though this paper considered only the position of x in the list of strings of complexity at most \({\textit{C}}(x)\). Theorems 17.2 and 17.3 appeared in [24]. The paper [7] also considered canonical models (called “nearly sufficient statistics” in this paper) for the case \(m={\textit{C}}(x)\). In the general case canonical models were considered in [24] (Sect. V, Realizing the structure function), where Theorems 17.7 and 17.8 were proven.

The minimal description length principle goes back to Rissanen [19]; as he wrote in that paper, “If we work with a fixed family of models, \(\ldots \) the cost of the complexity of a model may be taken as the number of bits it takes to describe its parameters. Clearly now, when adding new parameters to the model, we must balance their own cost against the reduction they permit in the ideal code length, \(-\log P(x|\theta )\), and we get the desired effect in a most natural manner. If we denote the total number of bits required to encode the parameters \(\theta \) by \(L(\theta )\), then we can write the total code length as \(L(x,\theta )=-\log P(x|\theta )+L(\theta )\), which we seek to minimize over \(\theta \).” The set denoted by O(x) in our survey was considered in 1974 by Kolmogorov (see [10], also refer [4, 6]); later it also appeared in the literature also under the names of “sophistication” and “snooping curves.”

The notion of sophistication was introduced by Koppel in [11]. Let \(\beta \) be a natural number; the \(\beta \) -sophistication of a string x is the minimal length of a total program p such that there is a string y with \(p(y)=x\) and \(|p|+|y|\le {\textit{C}}(x)+\beta \). In our terms p defines a model that consists of all p(y) for all strings y of a given length. It is not hard to see that with logarithmic precision we get the same notion: the \(\beta \)-sophistication of x is at most \(\alpha \) if and only if the pair \((\alpha ,\beta )\) is in the set O(x).

The notion of a snooping curve \(L_x(\alpha )\) of x was introduced by V’yugin in [27]. In that paper he considered strategies that read a bit sequence from left to right and for each next bit provide a prediction (a rational-valued probability distribution on the set \(\{0,1\}\) of possible outcomes). After the next bit appears, the loss is computed depending on the prediction and actual outcome. The goal of the predictor is to minimize the total loss, i.e., the sum of losses at all n stages (for an n-bit sequence). V’yugin considered different loss functions, and for one of them, called the logarithmic loss function, we get a notion equivalent to O(x). For the logarithmic loss function, we account for loss \(-\log p\) if the predicted probability of the actual outcome was p. It is easy to see that for a given x the following statement is true (with logarithmic precision): there exists a strategy of complexity at most \(\alpha \) with loss at most l if and only if \(l\ge h_x(\alpha )\). (Indeed, prediction strategies are just a bit-by-bit representation of probability distributions on the set of n-bit strings in terms of conditional probabilities.)

Theorem 17.4 (Sect. 17.5) is due to Bauwens [2]. The idea to consider the difference between the time-bounded complexity of x and the unbounded one goes back to Chaitin [3]. Later the subject was studied by Bennett who introduced the notion of logical depth: the depth of x at significance level \(\beta \) is the minimal time t such that \({\textit{C}}^t(x)\le {\textit{C}}(x)+\beta \). The string is called \((\beta ,t)\)-deep if its depth at significance level \(\beta \) is larger than t. A closely related notion of computational depth was introduced in [1]: the computational depth of x with time bound t is \({\textit{C}}^t(x)-{\textit{C}}(x)\). Obviously, the computational depth of x with time bound t is more than \(\beta \) if and only if x is \((\beta ,t)\)-deep. Theorem 17.4 relates both notions of depth to the stochasticity profile (with logarithmic precision): a string is \((\beta ,B(\alpha ))\)-deep if and only if the pair \((\alpha ,\beta )\) is outside the set O(x).

Theorem 17.5 was proved in [24]. Long before this paper (in 1987) V’yugin established that the set S(x) can assume all possible shapes (within the obvious constraints) but only for \(\alpha = o(|x|)\). Also, according to Levin [12]: “Kolmogorov told me about \(h_x (\alpha )\) and asked how it could behave. I proved that \(h_x (\alpha ) + \alpha +O(\log \alpha )\) is monotone but otherwise arbitrary within \(\pm O(p \log \alpha )\) accuracy where p is the number of “jumps” of the arbitrary function imitated; it stabilizes on \({\textit{C}}(x)\) when \(\alpha \) exceeds \(I(\chi : x)\) [the information in the characteristic sequence \(\chi \) of the “halting problem” about x]. The expression for accuracy was reworded by Kolmogorov to \(O(\sqrt{\alpha \log \alpha })\) [square root accuracy]; I gave it in the above, less elegant, but equivalent, terms. He gave a talk about these results at a meeting of Moscow Mathematical Society [9].” This claim of Levin implies Theorem 17.11, which was published in [24].

Theorem 17.6 (mentioned in [21]) is easy and Theorem 17.9 easily follows from Theorem 17.5.

The existence of non-\((\alpha ,\beta )\)-stochastic strings (for small \(\alpha , \beta \)) was mentioned in [21]. Then V’yugin [26] and Muchnik [17] showed that their a priori measure is about \(2^{-\alpha }\), a direct corollary of which is our Theorem 17.10.

Theorems 17.11 and 17.12 are essentially due to Levin (see [12, 13]).

Theorem 17.13 is easy to prove using A. Romashchenko’s “typization” trick (see [8, 20]).

Theorems 17.14 and 17.16 appeared in [25]; Theorem 17.15 appeared in [5].