Abstract
We combine Solomonoff’s approach to universal prediction with algorithmic statistics and suggest to use the computable measure that provides the best “explanation” for the observed data (in the sense of algorithmic statistics) for prediction. In this way we keep the expected sum of squares of prediction errors bounded (as it was for the Solomonoff’s predictor) and, moreover, guarantee that the sum of squares of prediction errors is bounded along any Martin-Löf random sequence. An extended abstract of this paper was presented at the 16th International Computer Science Symposium in Russia (CSR 2021) (Milovanov 2021).
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
We consider probability distributions (or measures) on the binary tree, i.e., non-negative functions \(P: \{0,1\}^* \rightarrow \mathbb {R}\) such that \(P(\text {empty word}) = 1\) and \(P(x0) + P(x1) = P(x)\) for every string x. We assume that all the values P(x) are rational; P is called computable if there exists an algorithm that on input x outputs P(x).
Consider the following prediction problem. Imagine a black box that generates bits according to some unknown computable distribution P on the binary tree. Let \(x=x_1\ldots x_n\) be the current output of the black box. The predictor’s goal is to guess the probability that the next bit is 1, i.e., the ratio \(P(1\hspace{0.55542pt}|\hspace{0.55542pt}x)=P(x1)/P(x)\).
Ray Solomonoff suggested to use the universal semi-measure \(\textrm{M}\) (called also the a priori probability) for prediction. Recall that a semi-measure S on the binary tree (a continuous semi-measure) is a non-negative function \(S: \{0,1\}^* \rightarrow \mathbb {R}\) such that \(S(\text {empty word}) \leqslant 1\) and \(S(x0) + S(x1) \leqslant S(x)\) for every string x. Semi-measures correspond to probabilistic processes that output a bit sequence but can hang forever, so an output may be some finite string x; the probability of this event is \(S(x)-S(x0)-S(x1)\). A semi-measure S is called lower semi-computable, or enumerable, if the set \(\{(x,r) : r<S(x)\}\) is (computably) enumerable. Here x is a string and r is a rational number. Finally, a lower semi-computable semi-measure \(\textrm{M}\) is called universal if it is maximal among all semimeasures up to a constant factor, i.e., if for every lower semi-computable semi-measure S there exists \(c>0\) such that \(\textrm{M}(x) \geqslant c S(x)\) for all x. Such a universal semi-measure exists [7, 11, 12].Footnote 1
Solomonoff suggested to use the ratio \(\textrm{M}(1\hspace{0.55542pt}|\hspace{0.55542pt}x):= \textrm{M}(x1)/\textrm{M}(x)\) to predict \(P(1\hspace{0.55542pt}|\hspace{0.55542pt}x)\) for an unknown computable measure P. He proved the following bound for the prediction errors.
Theorem 1
([13]) For every computable distribution P and for every \(b \in \{0,1\}\) the following sum over all binary strings is finite:
Moreover, this sum is bounded by \(O(\textrm{K}(P))\), where \(\textrm{K}(P)\) is the prefix complexity of the computable measure P (the minimal length of a prefix-free program corresponding P).
Note that for semi-measure the probabilities to predict 0 and 1 do not sum up to 1, so the statements for \(b=0\) and \(b=1\) are not equivalent (but both are true).
The sum from Theorem 1 can be rewritten as the expected value of the function D on the infinite binary sequences with respect to P, where \(D(\omega )\) is defined as
This expectation is finite, therefore for P-almost all \(\omega \) the value \(D(\omega )\) is finite and
when x is an increasing prefix of \(\omega \). One would like to have this convergence for all Martin-Löf random sequences \(\omega \) (with respect to measure P), but this is not guaranteed, since the null set provided by the argument above may not be an effectively null set. An example from [5] shows that this is indeed the case.
Theorem 2
([5]) There exist a specific universal semi-measure \(\textrm{M}\), computable distribution P and Martin-Löf random (with respect to P) sequence \(\omega \) such that
for increasing prefixes x of \(\omega \).
Lattimore and Hutter generalized Theorem 2 by proving the same statement for a wide class of universal semi-measures [8].
Trying to overcome this problem and get a good prediction for all Martin-Löf random sequences, we suggest the following approach to prediction. For a finite string x we find a distribution Q on the binary tree that is the best (in some sense) explanation for x. The probabilities of the next bits are then predicted as \(Q(0\hspace{0.55542pt}|\hspace{0.55542pt}x)\) and \(Q(1\hspace{0.55542pt}|\hspace{0.55542pt}x)\).
This approach combines two advantages. The first is that the series of type (1) also converges, though the upper bound for it (at least the one that we are able to prove) is much greater than \(O(\textrm{K}(P))\). The second property is that the prediction error (defined as in Theorem 2) converges to zero for every Martin-Löf random sequence.
Let us give formal definitions. The quality of the computable distribution Q on the binary tree, considered as an “explanation” for a given string x, is measured by the value \(3\textrm{K}(Q) - \log Q(x)\): the smaller this quantity is, the better is the explanation. One can rewrite this expression as the sum
Here the expression in the square brackets can be interpreted as the length of the two-part description of x using Q (first, we specify the hypothesis Q using its shortest prefix-free program, and then, knowing Q, we specify x using arithmetic coding; the second part requires about \(-\log Q(x)\) bits). The first term \(2\textrm{K}(Q)\) is added to give extra preference to simple hypotheses; the factor 2 is needed for technical reasons (in fact, any constant greater than 1 will workFootnote 2).
For a given x we select the best explanation that makes this quality minimal.
Formally, we find
So, \(Q_x\) is the best explanation for string x (or one of the best explanations if there are several).
Then we predict the probability that the next bit after x is b:
In this paper we prove the following results:
Theorem 3
For every computable distribution P the following sum over all binary strings x is finite:
Theorem 4
Let P be a computable measure and let \(\omega \) be a Martin-Löf random sequence with respect to P. Then
for prefixes x of \(\omega \) as the length of prefix goes to infinity.
We speak about the probabilities of zeros, but both P and Q are measures, so this implies the same results for the probabilities of ones.
We prove that
(Theorem 7) that is the strengthening of Theorem 4.
Remark 1
Note that function H as well as Solomonoff’s predicator is uncomputable. Indeed, consider some computable predictor H. Then there exists a computable distribution P such that \(|H(0| x) - P(0| x)| \geqslant \frac{1}{2}\) for every x. Of course this predictor does not satisfy Theorem 3.
In [3] Hutter suggested a similar approach but without coefficient 3 for \(\textrm{K}(Q)\) (see also [2, 4]). For this approach he proved an analogue of Theorem 3 with different proof technique.
In [5] the existence of a semi-computable measure satisfying Theorem 4 was proved. In [6] authors prove an analogue of Theorem 3 for the Hellinger distance with double exponential in \(\textrm{K}(P)\) bound.
In the next section we prove Theorem 4.
In Section 3 we prove Theorem 3.
Finally, in Section 4 we consider the case when we know some information about P. More precisely, we know that P belongs to some enumerable set of computable measures. We suggest a similar approach for prediction in this case. We prove analogues of Theorems 4 and 3 (Theorems 9 and 10) for this prediction method. We achieved better (polynomial in complexity of P) error estimations in these theorems.
2 Prediction on Martin-Löf Random Sequences
Recall the Schnorr–Levin theorem [11, ch.5] that says that a sequence \(\omega \) is random with respect to a computable probability measure P if and only if the ratio \(\textrm{M}(x)/P(x)\) is bounded for x that are prefixes of \(\omega \).
The same result can be reformulated in the logarithmic scale. Let us denote by \(\textrm{KM}(x)\) the a priori complexity of x, i.e., \(\lceil -\log \textrm{M}(x)\rceil \) (the rounding is chosen in this way to ensure upper semicomputability of \(\textrm{KM}\)). We have
for every computable probability measure P, where O(1) depends on P but not on x. Indeed, since \(\textrm{M}\) is maximal, the ratio \(P(x)/\textrm{M}(x)\) is bounded. Moreover, since P(x) can be included in the mix for \(\textrm{M}(x)\) with coefficient \(2^{-\textrm{K}(P)}\), we have
with some constant in O(1) that does not depend on P (and on x). As we have discussed in the previous section, the right-hand side includes the length of the two-part description of x.
Let us call
the randomness deficiency of a string x with respect to a computable measure P. (There are several notions of deficiency, but we need only this one.) Then we get
so the deficiency is almost non-negative. The Schnorr–Levin theorem characterizes Martin-Löf randomness in terms of deficiency:
Theorem 5
(Schnorr–Levin)
(a) If a sequence \(\omega \) is Martin-Löf random with respect to a computable disctibution P, then \(d(x\hspace{0.55542pt}|\hspace{0.55542pt}P)\) is bounded for all prefixes x of \(\omega \).
(b) Otherwise (if \(\omega \) is not random with respect to P), then \(d(x\hspace{0.55542pt}|\hspace{0.55542pt}P)\rightarrow \infty \) as the length of a prefix x of \(\omega \) increases.
Note that there is a dichotomy: the values \(d(x\hspace{0.55542pt}|\hspace{0.55542pt}P)\) for prefixes x of \(\omega \) either are bounded or converge to infinity (as the length of x goes to infinity). We can define randomness deficiency for infinite sequence \(\omega \) as
it is finite if and only if \(\omega \) is random with respect to P.
Let us also recall the following result of Vovk:
Theorem 6
([15]) Let P and Q be two computable distributions. Let \(\omega \) be a Martin-Löf random sequence with respect both to P and Q. Then
for prefixes x of \(\omega \) as the length of prefix goes to infinity.
We will prove this theorem (and even more exact statement) in the next section.
Proof of Theorem 4
Now we have a sequence \(\omega \) that is Martin-Löf random with respect to some computable measure P, so \(D=d(\omega \hspace{0.55542pt}|\hspace{0.55542pt}P)\) is finite. For each prefix x of \(\omega \) we take the best explanation Q that makes the expression
minimal. Note that P is among the candidates for Q, so this expression should not exceed
Since \(\omega \) is random with respect to P and x is a prefix of \(\omega \), Schnorr–Levin theorem guarantees that the latter expression
where constant in O depends on P and \(\omega \) but not on x. On the other hand, the inequality \(\textrm{KM}(x)\leqslant \textrm{K}(Q)-\log Q(x)+O(1)\) implies that
So measures Q with large \(\textrm{K}(Q)\) cannot compete with P, and there is only a finite list of candidate measures for the best explanation Q. For some of these Q the sequence \(\omega \) is Q-random with respect to Q, so one can use Vovk’s theorem to get the convergence of predicted probabilities when these measures are used.
Still we may have some “bad” Q in the list of candidates for which \(\omega \) is not Q-random. However, the Schnorr–Levin theorem guarantees that for a bad Q we have
if x is a prefix of \(\omega \) of increasing length. So the difference between two sides of (2) goes to infinity as the length of x increases, so Q loses to P for large enough x (is worse as an explanation of x). Therefore, only good Q will be used for prediction after sufficiently long prefixes, and this finishes the proof of Theorem 4.\(\square \)
3 On the Expectation of Squares of Errors
In this section we prove Theorem 3. First we will prove some strengthening of Theorem 6
Lemma 1
Let P and Q be computable distributions. and let \(\textrm{M}\) be a universal semi-measure. Assume that for string \( x=x_1 \ldots x_n\) and \(C > 0\) it holds that \(P(x),Q(x) \geqslant \textrm{M}(x)/C\). Then:
Proof of Theorem 6 from Lemma 1
According to one of definitions of Martin-Löf randomness the values \(\textrm{M}(x) / P(x) \) and \(\textrm{M}(x) / Q(x)\) are bounded by a constant. It reminds to use Lemma 1.\(\square \)
Proof of Lemma 1
Denote
Note that
Now consider the “intermediate” measure R for which the probability of 0 (or 1) after some x is the average of the same conditional probabilities for P and Q:
The corresponding \(r_i=R(x_i\hspace{0.55542pt}|\hspace{0.55542pt}x_1\ldots x_{i-1})\) are equal to \((p_i+q_i)/2\).
Probability distribution R is computable and \(\textrm{K}(R) \leqslant \textrm{K}(P,Q) + O(1)\). Hence, it holds that \(R(x) \leqslant 2^{\textrm{K}(P,Q)} \textrm{M}(x) \leqslant 2^{\textrm{K}(P,Q)} \cdot C\cdot P(x)\). The similar inequality holds for distribution Q. Therefore:
and
Multiplying we obtain:
These two inequalities show that the product of arithmetical means of \(p_i\) and \(q_i\) is not much bigger than the product of their geometrical means, and this is only possible if \(p_i\) is close to \(q_i\) (logarithm is a strictly convex function).
To make the argument precise, recall the bound for the logarithm function:
Lemma 2
For \(p,q\in (0,1]\) we have
Proof
Let us replace the binary logarithms by the natural ones; then the factor \(\ln 2\) disappears. Note that the left hand side remains the same if p and q are multiplied by some factor \(c\geqslant 1\) while the right side can only increase. So it is enough to prove this for \(p=1-h\) and \(q=1+h\) for some \(h\in (0,1)\), and this gives
and this happens because \(\ln (1-h)+\ln (1+h)=\ln (1-h^2)\leqslant -h^2\).\(\square \)
For the product of n terms we get the following bound:
Lemma 3
If for \(p_1,\ldots ,p_n,q_1,\ldots ,q_n\in (0,1]\) we have
then \(\sum _i (p_i-q_i)^2 \leqslant O(\log c)\), with some absolute constant hidden in \(O(\cdot )\)-notation.
Proof
Taking logarithms, we get
and therefore
It remains to use Lemma 2 to get the desired inequality.
To complete the proof of Lemma 1 it remains to use in inequality (3) Lemma 3 for \(c:=C^2 \cdot 2^{2\textrm{K}(P,Q)}\).\(\square \)
Now we prove a strengthening of Theorem 4.
Theorem 7
Let P be a computable measure, let \(\omega \) be a Martin-Löf random sequence with respect to P such that \(d(\omega |P)= D\).
Then
Proof
Consider \(Q_x\) (the best distribution) for some \(x =x_1\ldots x_n\). Then
Since \(d(\omega |P)=D\) we obtain that
Therefore,
We want to estimate \(\sum _{i=1}^n (Q(0| x_1 \ldots x_i)- P(0| x_1 \ldots x_i))^2\) by Lemma 1. We can use this lemma for \(C=2^{3\textrm{K}(P)+ D}\).
From (4), (5) and \(\textrm{KM}(x)\leqslant -\log Q(x)+\textrm{K}(Q)\) it follows that
Therefore by Lemma 1 we obtain
In fact, we can not use this lemma for the last term \((Q(0| x) - P(0| x))^2\). This term we just bound by 1.
So, every probability distribution that is the best for some x “contributes” \(O(\textrm{K}(P) + D)\) in the sum \(\sum _{x \text { is a prefix of }\omega } (H(0\hspace{0.55542pt}|\hspace{0.55542pt}x) - P(0 \hspace{0.55542pt}|\hspace{0.55542pt}x))^2\).
There are at most \(2^{\frac{3 \textrm{K}(P) + D + O(1)}{2}}\) such distribution by (6), so we obtain the required estimation. \(\square \)
Recall the following well-known statement
Proposition 8
Let P be a computable distribution. Then the P-measure of all sequences \(\omega \) such that \(d( \omega \hspace{0.55542pt}|\hspace{0.55542pt}P) \geqslant D\) is not greater than \(2^{-D}\).
Proof of Theorem 3
Denote by \(\Omega \) the set of all infinite sequences with zeros and ones. Note that
By Theorem 7 we can estimate the sum in the integral for sequence \(\omega \) with \(d(x|\omega )=D\) as \(O((\textrm{K}(P) + D)\cdot 2^{\frac{3 \textrm{K}(P) + D + O(1)}{2}})\). By Proposition 8 the measure of sequences with such randomness deficiency is at most \(2^{-D}\). So we can estimate the integral as
(Recall that the P-measure of sequences that are not Martin-Łöf random with respect to P is equal to 0, so they do not affect to the integral.)\(\square \)
4 Prediction for Enumerable Classes of Hypotheses
Assume that we have some information about distribution P. We know that P belongs to some enumerable set \(\mathcal {A}\) of computable distributions, (i.e. there is an algorithm that enumerate programs that generate distributions from \(\mathcal {A}\)). For this case it is natural to consider the following measure of complexity in \(\mathcal {A}\):
If P appears multiple times in an enumeration, we select \(i_P\) with the lowest complexity. (This definition depends on the choice of a computable enumeration, but the influence of this dependence is constrained by an additive constant.) Clearly, \( \textrm{K}_\mathcal {A}(P) \geqslant \textrm{K}(P) + O(1)\).
We can now extend our prediction method. To predict the next bit of x, we choose \(Q \in \mathcal {A}\) with the lowest value of \(3 \textrm{K}_\mathcal {A}(Q) - \log Q(x)\) and base our next-bit prediction on Q :
In this section, we demonstrate that if set \(\mathcal {A}\) possesses certain favorable properties, analogous results to previous theorems emerge. Furthermore, we can achieve an improved error estimation.
We assume that enumerable set \(\mathcal {A}\) has the following property: if \(P_1, \ldots , P_k \in \mathcal {A}\) then their mixture \(\frac{P_1 + \ldots P_k}{k}\) belongs to \(\mathcal {A}\). Moreover there exists an algorithm that for given numbers of \(P_1, \ldots , P_k\) (in some enumeration of \(\mathcal {A}\)) outputs the number of their mixture.
Further everywhere \(\mathcal {A}\) is an enumerable set of computable distributions with this property.
Remark 2
Consider the following example of set \(\mathcal {A}\): the set of all provable (in some proof system) computable distributions on the binary tree. For every program \(p \in \mathcal {A}\), there exists a proof that p(x) halts for every x, \(p(x) = p(x0) + p(x1)\), and \(p(\text {empty word}) = 1\). We guess that all using in practice computable distributions are provable computable, so, in some sense we get better error estimation “almost free”. Our discussion about practice might look unsuitable because our prediction method is not computable. However, it can be considered as the ideal version of the MDL principle prediction which is approximated in practice.
Theorem 9
Let \(P \in \mathcal {A}\) be a computable measure, let \(\omega \) be a Martin-Löf random sequence with respect to P such that \(d(\omega |P)= D\).
Then
Theorem 10
For every computable distribution \(P \in \mathcal {A}\) the following sum over all binary strings x is finite:
The proofs of these theorems are in general the same as the Proofs of Theorems 7 and 3, however some new tools are added. The difference is that we can get better estimation on the number of possible best explanations for prefixes of some sequence.
Lemma 4
Let x be a finite string. Assume that there are \(2^k\) probability distributions
\(Q_1, \ldots Q_{2^k} \in \mathcal {A}\) such that for every i it holds \(\textrm{K}_\mathcal {A}(Q_i) \leqslant a\) and \(Q_i(x) \geqslant 2^{-b}\). Then there is probability distribution \(Q \in \mathcal {A}\) such that
\(\textrm{K}_\mathcal {A}(Q) \leqslant a - k + O(\log (a + k))\) and \(Q(x) \geqslant 2^{-b-k}\).
Note that \(3 \textrm{K}_\mathcal {A}(Q) - \log Q(x) \leqslant 3 \cdot (a - k) + O(\log (a + k)) + b + k \leqslant 3 \cdot a - b\) for big enough k. This means that string x can not has many “best” explanations.
Proof of Lemma 4
Let enumerate all distributions of \(\mathcal {A}\) with complexity at most a by groups of size \(2^{k-1}\) (the last group can be incomplete). The number of such groups is \(O(2^{a-k})\). The complexity of every group is at most \(a - k + O(\log (a + k))\). Indeed, to describe a group we need its ordinal number in an enumeration and describe this enumeration (we need to know k, a and some enumeration of \(\mathcal {A}\)).
One of these complete group contains some \(Q_i\). Define Q as the mixture of the distributions in this group. Since the group has complexity at most \(a - k + O(\log a + k)\) the same estimation holds for the complexity of Q. Since some \(Q_i\) belongs to the mixture it holds that \(Q(x) \geqslant 2^{-b-k+1}\). Recall that Q belongs to \(\mathcal {A}\) because every mixture of distributions from \(\mathcal {A}\) belongs to \(\mathcal {A}\).\(\square \)
Also we need the following lemma.
Lemma 5
Let string s be a prefix of string h and let P be a computable distribution such that \(d(s \hspace{0.55542pt}|\hspace{0.55542pt}P) =D\). Then \(d(h \hspace{0.55542pt}|\hspace{0.55542pt}P) \geqslant D - 2 \log D + O(1)\).
(So, a prefix oconf a string that has small deficiency, has (almost as) small deficiency).
In fact the proof of this lemma is the same as the proof of Theorem 124 in [11].
Proof of Lemma 5
For each k consider the enumerable set of all finite sequences that have deficiency greater than k. All the infinite continuations of these sequences form an open set \(S_k\), and P-measure of this set does not exceed \(2^{-k}\). Now consider the measure \(P_k\) on \(\Omega \) that is zero outside \(S_k\) and is equal to \(2^kP\) inside \(S_k\). That means that for every set U the value \(P_k(U)\) is defined as \(2^k P(U \cap P_k)\). Actually, \(P_k\) is not a probability distribution according to our definition, since \(P_k(\Omega )\) is not equal to 1. However, \(P_k\) can be considered as a lower semi-computable semi-measure, if we change it a bit and let \(P_k(\Omega )=1\) (this means that the difference between 1 and the former value of \(P_k(\Omega )\) is assigned to the empty string).
It is clear that \(P_k\) is lower semi-computable since P is computable and \(S_k\) is enumerable.
Now consider the sum
It is a lower semi-computable semi-measure again Then we have
for every string x that has a prefix with deficiency greater than k. Since S does not exceed a priori probability (up to O(1)-factor), we get the desired inequality.\(\square \)
Proof of Theorem 9
Part 1. We claim that there are only \(\text {poly}(D + \textrm{K}_{\mathcal {A}}(P))\) different distributions that are the best for some prefix of \(\omega \).
Let x be a prefix of \(\omega \) and Q is the best distribution for x. As in the proof of Theorem 7 (see (6)) we obtain
and hence
Let \(Q_1, \ldots , Q_m\) be different and the best distribution for prefixes \(x_1, \ldots , x_m\) of \(\omega \).
We need to prove that \(m=\text {poly}(D + \textrm{K}_{\mathcal {A}}(P))\).
Fix some natural a and b. Consider all \(Q_i\) such that \(\textrm{K}(Q_i) = a\) and
It is enough to we prove that there are only \(\text {poly}(D + \textrm{K}(P))\) best distributions with the such complexity and randomness deficiency. Indeed, the honest estimation of m will be multiplied by \(\text {poly}(D + \textrm{K}(P))\) because these parameters are bounded by polynomials by (7) and (8). So, further we consider only \(Q_i\) with such parameters.
Let \(x_i\) be the shortest prefix among \(x_1, \ldots x_m\).
By Lemma 5 every \(Q_j\) is “rather good” distribution for \(x_i\): \(d (x_i \hspace{0.55542pt}|\hspace{0.55542pt}Q_j) \leqslant 2b + O(\log b)\) and hence \(Q_j(x_i) \geqslant Q_i(x) \cdot 2^{- O(\log b)}\). By Lemma 4 there exists a distribution from \(R \in \mathcal {A}\) such that
Since \(Q_j\) is not worse distribution then R for x we have:
Therefore:
That is proved our claim.
Part 2. To complete the proof we do the same things as in the proof of Theorem 7.
If \(x=x_1\ldots x_n\) is a prefix of \(\omega \) and Q is the best distribution for x then by Lemma 1
(In the last equation we use \(\textrm{K}(P,Q) = O(\textrm{K}(P)+\textrm{K}(Q)) = O(\textrm{K}_{\mathcal {A}}(P)+\textrm{K}_{\mathcal {A}}(Q))\).) So, every probability distribution that is the best for some x “contributes” \(O(\textrm{K}_{\mathcal {A}}(P) + D)\) in the sum \(\sum _{x{ isaprefixof}\omega } (H_{\mathcal {A}}(0| x) - P(0| x))^2\). There are \(\text {poly}(D + \textrm{K}_{\mathcal {A}}(P))\) such distributions, so we obtain the required estimation. \(\square \)
Proof of Theorem 10
The proof is the same as the proof of Theorem 3 but with using Theorem 9 instead of Theorem 7.\(\square \)
5 Open Questions
A natural question arises: can we get a better estimation in the last theorem than \(O(\textrm{K}(P) 2^{\frac{3 \textrm{K}(P)}{2}} )\)? The exponential (in \(\textrm{K}(P)\)) estimation arises from our attempt to estimate the number of distributions that are optimal for some x. However, the author is not aware of an example of P-random sequence \(\omega \) such that there are exponentially many (in terms of \(\textrm{K}(P)\) and \(d(\omega |P)\)) different best distributions for prefixes of \(\omega \).
Algorithmic statistics [1, 11, 14] studies good distributions for strings among distributions on finite sets. There exists a family of “standard statistics” that cover all the best distributions for finite strings. It is interesting: are these the same for distributions on the binary tree?
Notes
One may even require that the probabilities for finite outputs, i.e., the differences \(S(x)-S(x0)-S(x1)\) are maximal, but we do not require this.
One can consider a similar prediction method with factor \(1 + \alpha \) for arbitrary positive \(\alpha \) instead of factor 2. However it does not give an significant improvement so we do not do it.
References
Gács, P., Tromp, J., Vitányi, P.M.B.: Algorithmic statistics. IEEE Trans. Inf. Theory 47(6), 2443–2463 (2001)
Hutter, M., Poland, J.: Asymptotics of Discrete MDL for Online Prediction. IEEE Trans. Inf. Theory 51(11), 3780–3795 (2005)
Hutter, M.: Discrete MDL Predicts in Total Variation Advances in Neural Information Processing Systems 22 (NIPS–2009) 817-825
Hutter, M.: Sequential predictions based on algorithmic complexity. J. Comput. Syst. Sci. 72, 95–117 (2006)
Hutter, M., Muchnik, A.: Universal convergence of semimeasures on individual random sequences. In: Ben-David, S., Case, J., Maruoka, A. (eds.) ALT 2004. LNCS (LNAI), vol. 3244, pp. 234–248. Springer, Heidelberg (2004)
Hutter, M., Muchnik, A.: On semimeasures predicting Martin-Löf random sequences. In Theoretical Computer Science Volume 382, Issue 3, 6 September 2007, Pages 247–261. https://www.sciencedirect.com/science/article/pii/S0304397507002393
Li M., Vitányi, P.: An Introduction to Kolmogorov complexity and its applications, 3rd ed., Springer, 2008 (1 ed., 1993; 2 ed., 1997), xxiii+790 pp. ISBN 978-0-387-49820-1
Lattimore, T., Hutter, M.: On Martin-Löf Convergence of Solomonoff’s Mixture. In: Chan TH.H., Lau L.C., Trevisan L. (eds.) Theory and Applications of Models of Computation. TAMC 2013. Lecture Notes in Computer Science, vol 7876. Springer, Berlin, Heidelberg (2013)
Milovanov, A.: Algorithmic Statistics and Prediction for Polynomial Time-Bounded Algorithms, in: Sailing Routes in the World of Computation. Springer, 2018 p. 287–296. https://springerlink.bibliotecabuap.elogim.com/chapter/10.1007/978-3-319-94418-0_29
Milovanov, A.: Predictions and Algorithmic Statistics for Infinite Sequences. In: Computer Science - Theory and Applications: 16th International Computer Science Symposium in Russia (CSR 2021) Proceedings, pp. 283–295. Springer, (2021)
Shen, A., Uspensky, V., Vereshchagin, N.: Kolmogorov Complexity and Algorithmic Randomness, ACM, (2017)
Solomonoff, R.J.: A formal theory of inductive inference: Parts 1 and 2. Information and Control, 7(1)–22, 224–254 (1964)
Solomonoff, R.J.: Complexity-based induction systems: Comparisons and convergence theorems. IEEE Transactions on Information Theory, IT-(24), 422–432, (1978)
Vereshchagin, N.,K., Shen, A.: Algorithmic Statistics: Forty Years Later. Computability and Complexity: 669-737 (2017)
Vovk, V.G.: On a criterion for randomness. Dokl. Akad. Nauk SSSR 294(6), 1298–1302 (1987)
Acknowledgements
I would like to thank Alexander Shen and Nikolay Vereshchagin discussions, advice and remarks. I am thankful to the anonymous referees for their thorough reading and numerous suggestions for improving the paper.
Funding
Open access funding provided by FCT|FCCN (b-on). Funded by the European Union (ERC, HOFGA, 101041696). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. Also supported by FCT through the LASIGE Research Unit, ref. UIDB/00408/2020 and ref. UIDP/00408/2020.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Milovanov, A. Prediction and MDL for infinite sequences. Theory Comput Syst (2024). https://doi.org/10.1007/s00224-024-10180-0
Accepted:
Published:
DOI: https://doi.org/10.1007/s00224-024-10180-0