1 Introduction

Neural networks and associative memories have been a highly active research area in computer science, physics and probability theory for more than thirty years. The standard model of an associative memory was developed in the seminal paper [6]. His model is based on N neurons, each of which can only take the values \(\pm 1\). Each pair of these neurons is connected and thus interacts. We want to store a set of input data \((\xi ^\mu )_{\mu =1}^M\), so called patterns or images, where M may and will depend on the system size N, in the model. Each of the patterns \(\xi ^\mu \) is itself a bit string of length N, hence \(\xi ^\mu =(\xi _i^\mu )_{i=1}^N\) where \(\xi _i^\mu \in \{-1, +1\}\) for all i and \(\mu \). The strength at which two neurons i and j interact depends on the images and is given by the so-called synaptic efficacy

$$\begin{aligned} J_{ij}= \sum _{\mu =1}^M \xi _i^\mu \xi _j^\mu . \end{aligned}$$

With this set of \((J_{ij})\)’s we associate a dynamics or updating rule \(T=(T_i)_{i=1}^N\) on \(\{-1,+1\}^N\) such that

$$\begin{aligned} T_i(\sigma ):= \mathrm {sgn}\left( \sum _{j=1}^N J_{ij} \sigma _j \right) \qquad \sigma =(\sigma _i) \in \{-1,+1\}^N \end{aligned}$$

and the indices i are either updated uniformly at random or in a given order. One of the patterns \((\xi ^\mu )\) is considered to be stored, if and only if it is stable under the (retrieval) dynamics T, i.e. if and only if \(T_i(\xi ^\mu )=\xi _i^\mu \) for all \(i=1, \ldots , N\).

The central question is now: How many patterns can be stored in the above model? Of course, this sensitively depends on the way we choose these patterns. Much in agreement with the choice of messages in information theory in most of the test scenarios for associative memories the patterns are chosen independent and identically distributed (even though, other choices may be considered as well, see e.g. [9,10,11] or [12]). More precisely, it is assumed that

$$\begin{aligned}{\mathbb {P}}(\xi _i^\mu =1)= {\mathbb {P}}(\xi _i^\mu =-1)=\frac{1}{2} \quad \text{ for } \text{ all } \text{ i } \text{ and } \mu \end{aligned}$$

and that all \((\xi _i^\mu )_{i,\mu }\) are i.i.d. Under these assumptions it was shown in [13] that the Hopfield model described above can store \(M= C \frac{N}{\log N}\) patterns with \(C<\frac{1}{2}\), if we require that a fixed (but arbitrary) pattern is stable, and \(C <\frac{1}{4}\) if we ask for stability of all patterns simultaneously (also see [3] for a matching upper bound). However, non-rigorous computations involving the so-called replica trick, suggest that we may even achieve a capacity of \(M= \alpha N\) if we allow for a small fraction of errors in the retrieval and \(\alpha < 0.138\) (see [1, 2]). This prediction was mathematically confirmed (yet with smaller values of \(\alpha \)) in [8, 14], and [15].

In a very recent contribution Krotov and Hopfield suggest a generalized version of the Hopfield model (see [7]). There, the authors replace the updating dynamics \(T=(T_i)\) by a more general, still asynchronous one, namely:

$$\begin{aligned} T_i(\sigma ) := \mathrm {sgn}\Bigl [ \sum \limits _{\mu =1}^M \bigl ( F\bigl ( 1 \cdot \xi ^\mu _i + \sum \limits _{j\ne i} \xi ^\mu _j \sigma _j\bigr ) - F\bigl ( (-1) \cdot \xi ^\mu _i + \sum \limits _{j \ne i} \xi ^\mu _j \sigma _j\bigr ) \bigr ) \Bigr ] \end{aligned}$$
(1)

where \(F: {\mathbb {R}}\rightarrow {\mathbb {R}}\) is some smooth function. The case of \(F(x) = x^2\) reduces to the standard Hopfield model with the quadratic crosstalk-terms introduced above. Indeed, in this case the argument in the sign-function is given by the difference

$$\begin{aligned}&\sum \limits _{\mu =1}^M F\bigl ( 1 \cdot \xi ^\mu _i + \sum \limits _{j\ne i} \xi ^\mu _j \sigma _j\bigr ) - F\bigl ( (-1) \cdot \xi ^\mu _i + \sum \limits _{j \ne i} \xi ^\mu _j \sigma _j\bigr )\\&\quad = \sum \limits _{\mu =1}^M 1+ 2\sum \limits _{j\ne i} \xi _i^\mu \xi ^\mu _j \sigma _j+ \left( \sum \limits _{j\ne i} \xi ^\mu _j \sigma _j \right) ^2- 1+ 2\sum \limits _{j\ne i} \xi _i^\mu \xi ^\mu _j \sigma _j-\left( \sum \limits _{j\ne i} \xi ^\mu _j \sigma _j \right) ^2\\&\quad = 4\sum \limits _{\mu =1}^M\sum \limits _{j\ne i} \xi _i^\mu \xi ^\mu _j \sigma _j \end{aligned}$$

and its sign is of course the same as that of \(\sum \limits _{\mu =1}^M\sum \limits _{j\ne i} \xi _i^\mu \xi ^\mu _j \sigma _j\), therefore the dynamics are equal.

The reason for this more general choice of the interaction function F is the following insight: in the standard Hopfield model there is an energy associated with the dynamics T (i.e. the energy of a configuration decreases by an application of T). This energy of a spin configuration \(\sigma \) is given by \(H(\sigma )= -\frac{1}{N}\sum _\mu \sum _{i,j} \xi _i^\mu \xi _j^\mu \sigma _i \sigma _j\) and it decreases ”too slowly” as the configuration \(\sigma \) approaches pattern to allow for a superlinear storage capacity

The authors in [7] therefore in particular analyze what they call “polynomial interaction” (or, as they put it, energy) functions, i.e. \(F(x)=x^n\). Krotov and Hopfield state the following assertion

Theorem 1

([7], formulas (5) and (6))

  1. 1.

    In the generalized Hopfield model with interaction function \(F(x)=x^n\) one can store up to \(M=\alpha _n N^{n-1}\) patterns, if small errors in the retrieval are tolerated.

  2. 2.

    In the same model, one can store \(M= \frac{N^{n-1}}{c_n \log N}\) patterns for \(c_n > 2 \,(2n-3)!!\), if one wants a fixed pattern to be a fixed point of the dynamics T introduced in (1) with a probability converging to 1.

A proof of this theorem could probably be rather involved. This is due to the fact that the energy function of the model described in Theorem 1 is of a polynomial form. As a matter of fact, up to normalization, the energy of the model in these cases is \(H(\sigma )= \sum _\mu P(m^\mu )\), where \(m^\mu := \sum _i \xi _i^\mu \sigma _i\) is the overlap of the configuration \(\sigma \) with the \(\mu '\)th pattern and P is a polynomial, or, in other words, with \(F(x)=x^n\) and n even

$$\begin{aligned} F\left( 1 \cdot \xi ^\mu _i + \sum \limits _{j\ne i} \xi ^\mu _j \sigma _j\right) - F\left( (-1) \cdot \xi ^\mu _i + \sum \limits _{j \ne i} \xi ^\mu _j \sigma _j\right) \end{aligned}$$

consists of many summands of the form \(\xi _i^\mu \left( \sum \limits _{j \ne i} \xi ^\mu _j \sigma _j \right) ^m\) where m is an even number smaller than n. There is, however, a closely related model, where the above statement can be proven. To this end consider the dynamics \(\widehat{T}=(\widehat{T}_i)\) on \(\{-1,+1\}^N\) defined by

$$\begin{aligned} \widehat{T}_i(\sigma ):= \mathrm {sgn}\left( \sum _{1=j_1, \ldots j_{n-1}}^N \sigma _{j_1} \cdots \sigma _{j_{n-1}} W_{i,j_1 \ldots j_{{n-1}}}\right) . \end{aligned}$$
(2)

with

$$\begin{aligned} W_{i_1, \ldots , i_n}= \frac{1}{N^{n-1}} \sum _{\mu =1}^M \xi _{i_1}^\mu \xi _{i_2}^\mu \cdots \xi _{i_n}^\mu . \end{aligned}$$
(3)

Then the following theorem can be shown

Theorem 2

  1. 1.

    In the generalized Hopfield model with dynamics \(\widehat{T}\) defined in (2) and (3) one can store up to \(M=\alpha _n N^{n-1}\) patterns, if small errors in the retrieval are tolerated.

  2. 2.

    In the same model, one can store \(M= \frac{N^{n-1}}{c_n \log N}\) patterns for \(c_n > 2 \,(2n-3)!!\), if one wants a fixed pattern to be a fixed point of the dynamics \(\widehat{T}\) with a probability converging to 1.

While part 1 of the above theorem was proved in [14], we will give a proof of the second part (including a computation of the constants \(c_n\)) in Sect. 2 below. Note that the thermodynamics of this model was analyzed in [4].

More interesting than these polynomial models is, however, the question what happens if we formally send n to infinity. One could conjecture from above that this would lead to an interaction function of the form \(e^{x}\), on the one hand and to a super-polynomial storage capacity, on the other. This is indeed what we will show. Actually, we will even prove slightly more: In general, one could imagine that an increase in capacity goes to expense of associativity, such that in the extreme case, one could store \(2^N\) patterns but none of them has a positive radius of attraction. We will show that this is not the case for our model: The dynamics is even able to repair an amount of random errors of order N.

Theorem 3

Consider the generalized Hopfield model with the dynamics described in (1) and interaction function F given by \(F(x)= e^{x}\). For a fixed \(0< \alpha < \log (2)/2\) let \(M=\exp \left( \alpha N \right) +1\) and let \(\xi ^1, \ldots , \xi ^M\) be M patterns chosen uniformly at random from \(\{-1,+1\}^N\). Moreover fix \(\varrho \in [0, 1/2)\). For any \(\mu \) and any \(\widetilde{\xi }^\mu \) taken uniformly at random from the Hamming sphere with radius \(\varrho N\) centered in \(\xi ^\mu \), \(\mathcal {S}(\xi ^\mu ,\varrho N)\), where \(\varrho N\) is assumed to be an integer, it holds that

$$\begin{aligned}{\mathbb {P}}\left( \exists \mu \; \exists i : T_i\left( \widetilde{\xi }^\mu \right) \ne \xi ^\mu _i \right) \rightarrow 0,\end{aligned}$$

if \(\alpha \) is chosen in dependence of \(\varrho \) such that

$$\begin{aligned} \alpha < \frac{I(1-2\varrho )}{2} \end{aligned}$$

with \(I:x \mapsto \frac{1}{2}\left( (1+x) \log (1+x) + (1-x) \log (1-x)\right) \).

Remark 1

Note that Theorem 3 in particular implies that

$$\begin{aligned} {\mathbb {P}}\left( \exists \mu \; \exists i : T_i\left( \xi ^\mu \right) \ne \xi ^\mu _i \right) \rightarrow 0 \end{aligned}$$

as \(N \rightarrow \infty \), i.e. with a probability converging to 1, all the patterns are fixed points of the dynamics. In this case we can even take \(\alpha < \frac{I(1)}{2}\).

Remark 2

Theorem 3 can be proven analogously if the configuration \(\widetilde{\xi }^\mu \) is drawn uniformly at random from a Hamming ball \(B(\xi ^{\mu }, \rho N)\). Indeed, the probability of correcting an arbitrary pattern of the sphere \(S(\xi ^\mu , \rho N)\) can be used as a bound for the probability of correcting an arbitrary pattern of lower spheres.

Theorem 2 and Theorem 3 will be proven in the following section.

2 Proofs

We start with the proof of Theorem 2.

Proof of Theorem 2

As already mentioned the first part of the Theorem has already been proven in [14].

We are interested in bounding the following probability

$$\begin{aligned} {\mathbb {P}}(\exists i\le N: \; \widehat{T}_i(\xi ^1) \ne \xi ^1_i)&= {\mathbb {P}}\Bigl (\exists i\le N: - \sum \limits _{\mu = 2}^M \xi _i^1 \xi _i^\mu \; \bigl (\sum \limits _{j} \xi _{j}^1 \xi _{j}^\mu \bigr )^{n-1} > N^{n-1} \Bigr ). \end{aligned}$$

With the exponential Chebyshev inequality and the independence of the random variables \((\xi _i^\mu )\) for different \(\mu \) it follows that

$$\begin{aligned} {\mathbb {P}}(\exists i\le N: \; \widehat{T}_i(\xi ^1) \ne \xi ^1_i) \le N e^{-t N^{n-1}} {\mathbb {E}}\Bigl [\exp \bigl (-t \, \xi ^1_i \xi ^2_i \bigl (\sum \limits _{j} \xi ^1_j \xi ^2_j \bigr )^{n-1}\bigr ) \Bigr ]^{M-1}. \end{aligned}$$

For the moment generating function we condition on the values of \(\xi ^1_i \xi ^2_i\) to get the upper bound

$$\begin{aligned}&{\mathbb {P}}(\exists i\le N: \; \widehat{T}_i(\xi ^1) \ne \xi ^1_i) \le N e^{-t N^{n-1}} {\mathbb {E}}\left[ \cosh \left( t \left( \sum \limits _{j} \xi ^1_j \xi ^2_j\right) ^{n-1} \right) \right] ^{M-1}. \end{aligned}$$

The random variables \((\xi ^1_j \xi ^2_j)_{j}\) are i.i.d. and distributed like \(\xi ^1_1\). Define \(m = \frac{1}{\sqrt{N}} \sum _{j} \xi ^1_j \xi ^2_j\) and write the expectation as the sum over all possible values x of m:

$$\begin{aligned} {\mathbb {E}}\left[ \cosh \left( t \left( \sum \limits _{j} \xi ^1_j \xi ^2_j\right) ^{n-1} \right) \right]&= \sum \limits _{x} \cosh \Big (t N^{\frac{n-1}{2}} x^{n-1} \Big ) \cdot {\mathbb {P}}( m = x). \end{aligned}$$

The sum is over all \(x\in \{0, \pm \frac{1}{\sqrt{N}}, \ldots , \pm \sqrt{N}\}\). First we want to eliminate the tail events and use the fact that the probability vanishes fast enough if we restrict x away from zero. To this end we fix \(\beta > \frac{1}{2}\) and split the sum at \(\log (N)^\beta \). Additionally observe that x cannot grow faster than \(\sqrt{N}\) and set \(t = a_n/M\) for \(a_n > 0\) independent of N. Thus

$$\begin{aligned}&\sum \limits _{x: \log (N)^\beta < |x| \le \sqrt{N}} \cosh \left( t N^{\frac{n-1}{2}} x^{n-1} \right) \cdot {\mathbb {P}}(m = x)\\&\quad \le 2 \cosh \left( t N^{n-1} \right) {\mathbb {P}}(m > \log (N)^\beta ) \le 2 \; \exp \left( t N^{n-1} \right) \exp \left( -\frac{1}{2} \log (N)^{2\beta } \right) \\&\quad = 2 \; \exp \left( \left[ a_n c_n -\frac{1}{2} \log (N)^{2\beta - 1} \right] \log (N) \right) . \end{aligned}$$

Here we used a standard large deviations estimate and \(\cosh (z) \le \exp (|z|)\). This part of the moment generating function converges to zero for \(N \rightarrow \infty \) because for N large enough the term in brackets can be bounded by a negative value.

For the critical values of m we use the inequality \(\cosh (z) \le \exp (\frac{z^2}{2})\) for all z and write the exponential function in its Taylor expansion:

$$\begin{aligned}&\sum \limits _{x: |x| \le \log (N)^\beta } \cosh \left( t N^{\frac{n-1}{2}} x^{n-1} \right) {\mathbb {P}}(m = x) \le \sum \limits _{x:|x| \le \log (N)^\beta } e^{\frac{t^2}{2} N^{n-1} x^{2(n-1)}}{\mathbb {P}}(m = x) \\&\quad = \sum \limits _{x: |x| \le \log (N)^\beta } \left( 1 + \frac{t^2}{2} N^{n-1} x^{2(n-1)} + \sum \limits _{k=2}^\infty \frac{1}{2^k} \frac{(t^2 N^{n-1} x^{2(n-1)})^k}{k!} \right) {\mathbb {P}}(m = x). \end{aligned}$$

The distribution of m converges to a standard normal distribution and its moments are bounded by the moments of the latter. For \(l\in {\mathbb {N}}\) let \(\kappa _{2l}\) be the 2l-th moment of a standard normal distribution. The sum of probabilities can be bounded by one. So for N large enough, \(t= a_n/M\), and \(M = \mathrm {const.} \frac{N^{n-1}}{\log N}\) we derive the following upper bound for the moment generating function:

$$\begin{aligned}&\sum \limits _{x: |x| \le \log (N)^\beta } \cosh \left( t N^{\frac{n-1}{2}} x^{n-1} \right) {\mathbb {P}}(m = x) \\&\quad \le 1+ \frac{t^2}{2} N^{n-1} \kappa _{2(n-1)} + \sum \limits _{x:\; |x| \le \log (N)^\beta } \sum \limits _{k=2}^\infty \frac{1}{2^k} \frac{\left( t^2 N^{n-1} x^{2(n-1)}\right) ^k}{k!} \cdot {\mathbb {P}}(m = x) \\&\quad \le 1+ \frac{t^2}{2} N^{n-1} \kappa _{2(n-1)}+ \sum \limits _{k=2}^\infty \frac{1}{2^k} \frac{(t^2 N^{n-1} \log (N)^{2 \beta (n-1)})^k}{k!} \\&\quad \le 1+ \frac{t^2}{2} N^{n-1} \kappa _{2(n-1)}+ \frac{t^4}{4} N^{2n-2} \log (N)^{4 \beta (n-1)}(e-2)\\&\quad \le \exp \left( \frac{t^2}{2} N^{n-1} \kappa _{2(n-1)} + t^4 N^{2(n-1)} \log (N)^{4\beta (n-1)} \right) \end{aligned}$$

Inserting \(t= a_n/M\) into this result we obtain

$$\begin{aligned}&{\mathbb {P}}(\exists i\le N: \; \widehat{T}_i(\xi ^1) \ne \xi ^1_i) \\&\quad \le \exp \left( \log (N)-t N^{n-1}\right) \exp \left( \frac{t^2}{2} N^{n-1} \kappa _{2(n-1)} M + t^4 N^{2(n-1)} \log (N)^{4\beta (n-1)} M \right) \\&\quad = \exp \left( \left[ 1 - a_n c_n \left( 1 - \frac{a_n \, \kappa _{2(n-1)}}{2} \right) \right] \log (N) \right) \cdot (1+o(1)). \end{aligned}$$

for our choice of t and M.

The moments of the standard normal distribution is given by \(\kappa _{2(n-1)} = (2n-3)!!\) for all \(n\in {\mathbb {N}}\). Choose \(a_n = (\kappa _{2(n-1)})^{-1}\). The term in brackets can be bounded by negative value if \(c_n\) satisfies

$$\begin{aligned} 1- a_n c_n \left( 1 - \frac{ a_n \, \kappa _{2(n-1)}}{2} \right) < 0 \end{aligned}$$

which is the case if and only if \(c_n>2 (2n-3)!!\). This is exactly the memory capacity proposed by Hopfield and Krotov in Theorem 1.

We continue with the proof of our central result.

Proof of Theorem 3

We start by slightly reformulating the dynamics of the model. Indeed, an (almost) equivalent formulation is to say that a neuron i will remain unchanged after an application of the update rule if

$$\begin{aligned} \Delta _i E(\sigma ) := \sum \limits _{\mu =1}^M \left( F\left( \sigma _i \xi ^\mu _i + \sum \limits _{j\ne i} \xi ^\mu _j \sigma _j\right) - F\left( -\sigma _i \xi ^\mu _i + \sum \limits _{j \ne i} \xi ^\mu _j \sigma _j\right) \right) > 0 \end{aligned}$$
(4)

and the spin of neuron i will be changed if \(\Delta _i E(\sigma )\) is less than zero. In the limit \(N\rightarrow \infty \) the case \(\Delta _i E(\sigma ) = 0\) is negligible for the later purposes.

Starting in one of the patterns (without loss of generality the first one \(\xi ^1\)) we want to show that it is an attractive fixed point of the update dynamics, i.e. we need to show that \(T_i(\xi ^\mu )=\xi _i^\mu \) and we want the model to correct up to \(\varrho N\) random errors by updating each of the neurons once. Without loss of generality we focus on the pattern \(\xi ^1\), taking a corrupted version \(\tilde{\xi }^1\) uniformly at random from \(S(\xi ^1, \rho N)\) for a fixed \(\rho \in [0, \frac{1}{2})\), and the neuron i. Then we can interpret the summand for \(\mu = 1\) in (4):

$$\begin{aligned} E_\text {signal} = F\left( \sum \limits _{j=1}^N \xi ^1_j \widetilde{\xi }^1_j \right) - F\left( -2 \, \xi ^1_i \widetilde{\xi }^1_i + \sum \limits _{j=1}^N \xi ^1_j \widetilde{\xi }^1_j \right) \end{aligned}$$

as signal term and the rest of the sum in (4):

$$\begin{aligned} E_\text {noise} = \sum \limits _{\mu =2}^M \left( F\left( \sum \limits _{j=1}^N \xi ^\mu _j \widetilde{\xi }_j\right) - F\left( -2\widetilde{\xi }_i \xi ^\mu _i + \sum \limits _{j =1}^N \xi ^\mu _j \widetilde{\xi }_j\right) \right) \end{aligned}$$

as noise term. As we will show, in order to have neuron i not updated correctly, the noise term needs to have a bigger impact in (4) than the signal term. We want to show that the probability for this event vanishes for \(N \rightarrow \infty \).

We need to distinguish two cases: first the neuron i can be correct, i.e. \(\xi _i^1= \widetilde{\xi }_i^1\), and we want the associative memory not to change this value (this means \(\Delta _i E(\widetilde{\xi }^1) > 0\)). In the other case the neuron is wrong, i.e. \(\xi _i^1 = -\widetilde{\xi }_i^1\), and the network needs to correct this neuron (this means \(\Delta _i E(\widetilde{\xi }^1)<0\)). In both cases the signal term supports the desired behavior. Indeed, we have:

$$\begin{aligned}&F\left( \sum _{j=1}^N \xi ^1_j \widetilde{\xi }^1_j \right) - F\left( \sum _{j=1}^N \xi ^1_j \widetilde{\xi }^1_j -2 \right) > 0 \text { on the one hand, and }\\&F\left( \sum _{j=1}^N \xi ^1_j \widetilde{\xi }^1_j \right) - F\left( \sum _{j=1}^N \xi ^1_j \widetilde{\xi }^1_j +2 \right) < 0 \text { on the other } \end{aligned}$$

depending on whether neuron i is correct or incorrect (since \(\sum _{j=1}^N \xi ^1_j \widetilde{\xi }^1_j \ge (1-2\varrho )N> 2\) and \(F=\exp \)). This means that in order for neuron i to update correctly, it must be that \(\mathrm {sgn}\left( E_\text {signal}+E_\text {noise}\right) =\mathrm {sgn}\left( E_\text {signal}\right) \), which is fulfilled as soon as \( |E_\text {noise}| < |E_\text {signal}|\). Thus, a necessary condition for the event \(\{T_i(\widetilde{\xi }^1) \ne \xi ^1_i\}\) is that \( | E_\text {noise} | \ge | E_\text {signal} |. \) Therefore:

$$\begin{aligned}&{\mathbb {P}}\left( \exists \mu \; \exists i : T_i\bigl (\widetilde{\xi }^\mu \bigr ) \ne \xi ^\mu _i \right) \le N \cdot M \cdot {\mathbb {P}}\bigl (T_i(\widetilde{\xi }^1) \ne \xi ^1_i \bigr ) \\&\quad \le N \cdot M \cdot {\mathbb {P}}\bigl (|E_\text {noise}| \ge |E_\text {signal}|\bigr ). \end{aligned}$$

By using the straightforward fact that \(|e^{a\pm 1}-e^{a\mp 1}| \le [1-e^{-2}] e^2 \cdot e^{a}\):

$$\begin{aligned} |E_\text {noise}|&\le \sum \limits _{\mu =2}^M \left| \exp (\xi ^\mu _i \xi ^1_i + \sum \limits _{j\ne i} \xi ^\mu _j \widetilde{\xi }^1_j)- \exp (-\xi ^\mu _i \xi ^1_i + \sum \limits _{j\ne i} \xi ^\mu _j \widetilde{\xi }^1_j)\right| \\&\le [1-e^{-2}]e^2\sum \limits _{\mu =2}^M \exp \left( \langle \xi ^\mu | \widetilde{\xi }^1 \rangle \right) \end{aligned}$$

where \(\langle x | y \rangle \) is the inner product on \(\{\pm 1\}^N\). At the same time

$$\begin{aligned}| E_\text {signal} | > e^{N(1-2\varrho )}[1-e^{-2}].\end{aligned}$$

It follows that:

$$\begin{aligned} {\mathbb {P}}\left( \exists \mu \; \exists i : T_i\bigl (\widetilde{\xi }^\mu \bigr ) \ne \xi ^\mu _i \right)&\le N \cdot M \cdot {\mathbb {P}}\bigl (T_i(\widetilde{\xi }^1) \ne \xi ^1_i \bigr ) \nonumber \\&\le N \cdot M \cdot {\mathbb {P}}\left( \sum \limits _{\mu =2}^M e^2\exp (\langle \xi ^\mu | \widetilde{\xi }^1 \rangle ) > e^{N(1-2\varrho )} \right) . \end{aligned}$$
(5)

We will use two standard estimates from the theory of large deviations [5]: for a Binomially distributed random variable \(\bar{S}_{m,p}\) with parameters m and p and for \(\varepsilon >0\), we have:

$$\begin{aligned} {\mathbb {P}}\left( \bar{S}_{m,p} \ge m(p+\varepsilon ) \right) \le \exp \left( -m \frac{\varepsilon ^2}{2(p+\varepsilon )} \right) \end{aligned}$$
(6)

and for a sum \(S_m\) of m i.i.d. random variables \(X_i\) with \({\mathbb {P}}(X_1=1)={\mathbb {P}}(X_1=-1)=\frac{1}{2}\) and \(x\in (0,1)\) we have

$$\begin{aligned} {\mathbb {P}}\left( S_m \ge m x\right) \le \exp \left( - m I(x) \right) \end{aligned}$$
(7)

as well as

$$\begin{aligned} \lim \limits _{m\rightarrow \infty } \frac{1}{m} \log ({\mathbb {P}}\left( S_m > mx\right) ) = - I(x) \end{aligned}$$
(8)

where again \( I(x)=\frac{1}{2} \left( (1+x) \log (1+x) +(1-x) \log (1-x) \right) \). In fact, (8) is nothing but Cramérs theorem for fair, \(\pm 1\)-Bernoullis.

Now let \(\alpha < \frac{1}{2}I(1-2\varrho )\), \(M = \exp \left( \alpha N\right) + 1\) and \(\beta _o\) be such that \(I(\beta _o)=\alpha \). By continuity of I there exists an \(\varepsilon >0\), such that for all \(x\in (1-2\varrho -\varepsilon ,1-2\varrho ]\) we have that \(\alpha <\frac{1}{2}I(x)\le \frac{1}{2}I(1-2\varrho )\). Again by continuity of I we can choose \(\beta <\beta _o\) such that:

$$\begin{aligned}\alpha -\frac{\varepsilon }{2}=I(\beta _o)-\frac{\varepsilon }{2}<I(\beta )< I(\beta _o)=\alpha . \end{aligned}$$

Let us define

$$\begin{aligned} A= & {} \{ \mu \in \{2 \ldots M \} | \langle \xi ^\mu | \widetilde{\xi }^1\rangle \ge \beta N \}, \\ p= & {} {\mathbb {P}}\left( \langle \xi ^2 |\widetilde{\xi }^1\rangle \ge \beta N\right) . \end{aligned}$$

By (7), we have that: \(p < \exp (-N I(\beta ) )\). On the other hand, by (8) we can conclude that for \(\eta = \frac{1}{2} (\alpha -I(\beta )) > 0\) and N sufficiently large we have \(p> \exp (-N(I(\beta )+\eta ))\).

Now we compute the probability in (5) by picking those patterns \(\xi ^\mu \) with a significant overlap with \(\xi ^1\) (i.e. \(\mu \in A\)).

$$\begin{aligned}&{\mathbb {P}}\left( \sum \limits _{\mu =2}^M e^2\exp (\langle \xi ^\mu | \widetilde{\xi }^1 \rangle )> e^{N(1-2\varrho )} \right) \\&\quad = \sum _{X\subset \{2 \ldots M \}}{\mathbb {P}}\left( \left( \sum \limits _{\mu =2}^M \exp (\langle \xi ^\mu | \widetilde{\xi }^1 \rangle )> e^{N(1-2\varrho )-2}\right) \cap (A=X)\right) \\&\quad \le \sum _{k=0}^{M-1}\sum _{X\in P_k (\{2 \ldots M \})}p^k (1-p)^{M-1-k}\cdot \\&{\mathbb {P}}\left( \sum \limits _{\mu \in X}\exp (\langle \xi ^\mu | \widetilde{\xi }^1 \rangle ) + (M-1-k) e^{\beta N} > e^{N(1-2\varrho )-2} | A=X\right) \end{aligned}$$

where \(P_k\) denotes the subsets of size k. Additionally we used that every overlap in \(A^c\) can be bounded by \(\exp (\beta N)\). By the identical distribution of the patterns this is equal to

$$\begin{aligned}&= \sum _{k=0}^{M-1}\left( {\begin{array}{c}M-1\\ k\end{array}}\right) p^k (1-p)^{M-1-k}\cdot \\&\quad {\mathbb {P}}\left( \sum \limits _{\mu =2}^{k+1} \exp (\langle \xi ^\mu | \widetilde{\xi }^1 \rangle ) > e^{N(1-2\varrho ) -2} -(M-1-k) e^{\beta N} | A=\{2, \ldots , k+1\}\right) . \end{aligned}$$

By using the maximal summand as an upper bound, a standard union bound and the identical distribution for all \(\mu \) we arrive at the following term

$$\begin{aligned}&\le \sum _{k=0}^{M-1}\left( {\begin{array}{c}M-1\\ k\end{array}}\right) p^k (1-p)^{M-1-k}\cdot \\&\quad {\mathbb {P}}\left( \max \limits _{\mu \in \{2,\ldots , k+1\}} \exp (\langle \xi ^\mu | \widetilde{\xi }^1 \rangle )> \frac{1}{k}( e^{N(1-2\varrho )-2} -(M-1-k) e^{\beta N})| A=\{2, \ldots , k+1\}\right) \\&\le \sum _{k=0}^{M-1}k\left( {\begin{array}{c}M-1\\ k\end{array}}\right) p^k (1-p)^{M-1-k}\cdot \\&\quad {\mathbb {P}}\left( \exp (\langle \xi ^2 | \widetilde{\xi }^1 \rangle ) > \frac{1}{k}( e^{N(1-2\varrho )-2} -(M-1-k) e^{\beta N})| A=\{2, \ldots , k+1\}\right) . \end{aligned}$$

Denote by

$$\begin{aligned} r = r(k)= {\mathbb {P}}\left( \exp \left( \langle \xi ^2 | \widetilde{\xi }^1 \rangle \right) \ge \frac{1}{k}( e^{N(1-2\varrho )-2} -(M-1-k) e^{\beta N})\right) . \end{aligned}$$

We then arrive at

$$\begin{aligned}&{\mathbb {P}}\left( \left( \exp (\langle \xi ^2 | \widetilde{\xi }^1 \rangle )> \frac{1}{k}( e^{N(1-2\varrho )-2} -(M-1-k) e^{\beta N})\right) | A=\{2, \ldots , k+1\}\right) \\&= \frac{{\mathbb {P}}\left( \left( \exp (\langle \xi ^2 | \widetilde{\xi }^1 \rangle ) > \frac{1}{k}( e^{N(1-2\varrho )-2} -(M-1-k) e^{\beta N})\right) ; 2 \in A \right) }{{\mathbb {P}}\left( 2 \in A \right) } \le \frac{r(k)}{p} \end{aligned}$$

because \({\mathbb {P}}\left( 2 \in A\right) = p\).

Thus an upper bound is:

$$\begin{aligned}&{\mathbb {P}}\Bigl ( \sum \limits _{\mu =2}^M \exp (-\xi _i^\mu \xi ^1_i + \sum \limits _{j\ne i} \xi _j^\mu \widetilde{\xi }^1_j)- \exp (\xi _i^\mu \xi ^1_i + \sum \limits _{j\ne i} \xi _j^\mu \widetilde{\xi }^1_j)> c e^{N(1-2\varrho )}\Bigr )\\&\quad \le \sum _{k=0}^{M-1}k\left( {\begin{array}{c}M-1\\ k\end{array}}\right) r(k) p^{k-1}(1-p)^{M-1-k}. \end{aligned}$$

Now let us split this sum into two parts: the first one for \(k \in \{0, \ldots , \lfloor 2p(M-1) \rfloor \}\) and the second one the remaining part. We start with the second part. By using the identity \(k\left( {\begin{array}{c}M-1\\ k\end{array}}\right) =(M-1)\left( {\begin{array}{c}M-2\\ k-1\end{array}}\right) \) and the trivial fact that \(r(k)\le 1\), we get:

$$\begin{aligned}&\sum _{k=\lfloor 2p(M-1)\rfloor +1}^{M-1} k\left( {\begin{array}{c}M-1\\ k\end{array}}\right) r(k) p^{k-1}(1-p)^{M-1-k}\\&\quad \le (M-1)\sum _{k=\lfloor 2p(M-1)\rfloor +1}^{M-1}\left( {\begin{array}{c}M-2\\ k-1\end{array}}\right) p^{k-1} (1-p)^{M-2-(k-1)}\\&\quad = (M-1) {\mathbb {P}}\left( S_{M-2}\ge \lfloor 2p(M-1)\rfloor \right) \le (M-1) {\mathbb {P}}\left( S_{M-2}\ge \frac{3}{2}p(M-2)\right) . \end{aligned}$$

We used the bound \(\lfloor 2p(M-1)\rfloor > \frac{3}{2}p(M-2)\), which is a consequence of the fact that \(p(M-1)\) goes to infinity. This will be shown at the end of the proof. Then, by using (6), with \(\varepsilon =\frac{p}{2}\) we obtain:

$$\begin{aligned} \sum _{k=\lfloor 2p(M-1)\rfloor +1}^{M-1} k\left( {\begin{array}{c}M-1\\ k\end{array}}\right) r(k) p^{k-1}(1-p)^{M-1-k} \le (M-1) \exp \left( -\frac{p(M-2)}{12}\right) . \end{aligned}$$

We consider now the first part. Since:

$$\begin{aligned} \frac{1}{k}( e^{N(1-2\varrho )-2} -(M-1-k) e^{\beta N})= \frac{1}{k}( e^{N(1-2\varrho )-2} - e^{ (\alpha +\beta ) N} ) + e^{\beta N} \end{aligned}$$

we clearly see that r(k) is increasing in k if \(\alpha +\beta <1-2\varrho \) is fulfilled. This condition will be proven later on. Thus:

$$\begin{aligned} \sum _{k=0}^{\lfloor 2p(M-1)\rfloor }&\left( {\begin{array}{c}M-1\\ k\end{array}}\right) kr(k) p^{k-1}(1-p)^{M-1-k}\\&\le \frac{1}{p}\max _{k\in \{0,\ldots , \lfloor 2p(M-1)\rfloor \}} kr(k)\\&\le 2(M-1) \cdot r( 2p(M-1)). \end{aligned}$$

Let us examine this last term: since \(p< e^{-N I(\beta )}\), we observe

$$\begin{aligned} r( 2p(M-1))&= {\mathbb {P}}\Bigl ( \exp (\langle \xi ^2 | \widetilde{\xi }^1 \rangle )> \frac{1}{2p}\bigl ( e^{N(1-2\varrho -\alpha )-2} - e^{\beta N}\bigr )+e^{\beta N}\Bigr )\\&\le {\mathbb {P}}\Bigl ( \exp (\langle \xi ^2 | \widetilde{\xi }^1 \rangle ) > \frac{1}{2}\bigl ( e^{N(1-2\varrho -\alpha +I(\beta ))-2} - e^{(\beta +I(\beta )) N}\bigr ) \Bigr ). \end{aligned}$$

First of all let us show that \(1-2\varrho -\alpha +I(\beta )>\beta +I(\beta )\), so that the first term dominates the second term: indeed, this is equivalent to proving that \(\alpha + \beta < 1-2\varrho \). By concavity we have for \(x\in (0,1)\):

$$\begin{aligned} I(x)\le \log \left( \frac{(1+x)^2}{2} +\frac{(1-x)^2}{2}\right) = \log ( 1+x^2) \le x^2 \le x. \end{aligned}$$

From this we obtain

$$\begin{aligned}\alpha +\beta <\alpha +\beta _o = \alpha +I(\alpha ) \le 2\alpha \le I(1-2\varrho ) \le 1-2\varrho . \end{aligned}$$

This proves that \(1-2\varrho -\alpha +I(\beta )>\beta +I(\beta )\) and also concludes the statement that r(k) is increasing (see above). Now take \(\gamma \) such that \(1-2\varrho -\varepsilon<\gamma <1-2\varrho -\frac{\varepsilon }{2}\) for an \(\varepsilon > 0\). Then

$$\begin{aligned}1-2\varrho -\varepsilon<\gamma <1-2\varrho -\alpha +I(\beta ). \end{aligned}$$

For N sufficiently large we get:

$$\begin{aligned}r( 2p(M-1)) \le {\mathbb {P}}\left( \exp (\langle \xi ^2 | \widetilde{\xi }^1 \rangle ) > e^{\gamma N}\right) . \end{aligned}$$

By applying (7) and for N sufficiently large one sees that:

$$\begin{aligned} r( 2p(M-1)) \le {\mathbb {P}}\left( \exp (\langle \xi ^2 | \widetilde{\xi }^1 \rangle ) > e^{\gamma N}\right) \le \exp \left( -N I(\gamma )\right) . \end{aligned}$$

Now if we bring everything together, we finally arrive at:

$$\begin{aligned}&{\mathbb {P}}\Bigl (\exists \mu \; \exists i : T_i\bigl (\widetilde{\xi }^\mu \bigr ) \ne \xi ^\mu _i \Bigr ) \\&\quad \le N \cdot M \cdot {\mathbb {P}}\Bigl (\sum \limits _{\mu =2}^M \exp \bigl ( -\xi ^\mu _i \xi ^1_i + \sum \limits _{j\ne i} \xi ^\mu _j \widetilde{\xi }^1_j \bigr ) - \exp \bigl ( \xi ^\mu _i \xi ^1_i + \sum \limits _{j\ne i} \xi ^\mu _j \widetilde{\xi }^1_j \bigr ) > ce^{N(1-2\varrho )} \Bigr )\\&\quad \le N\cdot M\Bigl ( 2(M-1)\exp \bigl ( -NI(\gamma )\bigr ) + (M-1) \exp \bigl ( -\frac{p(M-2)}{12} \bigr )\Bigr )\\&\quad \le 2N\frac{M}{M-1}\exp \left( -N(I(\gamma )-2\alpha )\right) +N\frac{M}{M-1}\exp \Bigl ( 2\alpha -\frac{M-2}{M-1}\cdot \frac{p(M-1)}{12}\Bigr ). \end{aligned}$$

But by the definition of \(\gamma \), we have \(I(\gamma )>2\alpha \). So the first term clearly tends to 0. For the second term by using the lower bound on p, we have that:

$$\begin{aligned} p(M-1)\ge \exp ( N(\alpha -I(\beta )-\eta )). \end{aligned}$$

But we know that \(I(\beta )<I(\beta _o)=\alpha \) and \(\eta = \frac{1}{2} (\alpha - I(\beta ))\), so \(\alpha -I(\beta )-\eta >0\). Therefore the second term converges also to 0. Also a straightforward consequence of this bound is that \(p(M-1)\) goes to infinity and thus establishes the still open fact that \(\lfloor 2p(M-1)\rfloor > \frac{3}{2}p(M-2)\), used above. So clearly if the condition \(\alpha < \frac{I(1-2\varrho )}{2}\) is fulfilled, we obtain

$$\begin{aligned} {\mathbb {P}}\left( \exists \mu \; \exists i : T_i\left( \widetilde{\xi }^\mu \right) \ne \xi ^\mu _i \right) \rightarrow 0 \end{aligned}$$

This finishes the proof.\(\square \)