1 Introduction

In cryptography, outputs of block ciphers, stream ciphers, hash functions, and key generators should be indistinguishable from the outputs of random generators. To decide whether a generator is distinguishable from a random generator or not, a common approach is to compare certain statistics of the output of the generator with the theoretical expected values computed for the sample space consisting of all strings of the same length.

As in the case of stream ciphers, when the outputs of a given generator are long strings (106 or more terms), decision on randomness is based on a sample output (key stream). Under such circumstances where one deals with long strings, asymptotic computations and approximate values of theoretical constants can be employed safely.

On the other hand, outputs of block ciphers, hash functions, and key generators are in general short binary strings (128 to 4096 bits) and statistical randomness tests of such generators are based on a large collection of their output strings. By concatenating all the strings in that collection, one obtains a long string which can be evaluated as if it were the output of a stream cipher. However, a more natural approach is to consider each output string individually and then to evaluate the entire collection. In this case, since we deal with relatively short strings, it is not safe to employ approximations or asymptotic formulations. Main difficulty in this approach is not only to obtain the probability distribution functions, but also to express these functions in such a form that their evaluation is feasible.

In the present article, we design a family of statistical randomness tests for a collection of short binary strings by defining a number of random variables and deriving recursive formulae with computationally feasible probability distribution functions.

Whether a given string is “like” a random string is conventionally decided by means of Golomb’s pseudorandomness postulates R-1, R-2 and R-3 [1], which we now state for the sake of completeness albeit the first and the last will be beyond the scope of this work:

  1. R-1

    The difference between the number of 1’s and the number of 0’s should be at most one.

  2. R-2

    The number of runs of length one should be at least half of the total number of runs, while the number of runs of length two should be at least one-fourth of the total number of runs, and the number of runs of length three should be at least one-eighth of the total number of runs and so on. Moreover the number of runs of ones and the number of runs of zeros should not differ more than one for each fixed run length.

  3. R-3

    The autocorrelation function should be two-valued.

The first postulate considers the frequencies of ones and zeros. The second one is about the number of runs in a string, where a run is defined as an uninterrupted maximal strings of identical bits. Lastly, the third postulate reveals information about the amount of similarities between the string and its shifted versions. A two-valued autocorrelation function is required to assure that the amount of mentioned similarity remains constant for all shifted versions of the string.

Almost all test suites include a test function based on R-1, under the name frequency test or weight test, (see for instance [2, 3, 6, 7]). Similarly, most of the test suites include a test considering certain properties of runs, however none of these tests correspond R-2 completely, [3, 6, 7]. Up to our knowledge, a test function for R-3 is not included in any test suite. The main purpose of this study is to design a family of randomness tests that depend on the R-2 requirements.

In the context of runs, randomness test suites generally consider the following quantities:

  • the number of all runs,

  • length of the longest run,

  • the number of runs of a specified length.

The distribution function related with the number of all runs involves considerably straightforward computations and it is included in almost all test suites. For instance, NIST test suite has “Runs Test”, TESTU01 has “Run and gap tests”, see [3, 7] respectively. Similarly, length of the longest run test appears in both NIST test suite and TESTU01.

However, obtaining a feasible formulation of the distribution function related with the number of runs with a specified length is a difficult task (especially when specified length is enlarged). Mood obtained the distribution of runs of any length using a combinatorial approach [4]. Later, following similar methods, Doğanaksoy et al. designed randomness tests based on number of runs of lengths one, two and three, [8]. These explicit formulations heavily depend on binomial coefficients and hence as the length grows, calculations become more complex, and the time required for these calculations grows drastically. As a result, these tests are limited to strings of length 256 bits (for runs of length three) and 512 bits (for runs of length at most two).

By making use of Markov chains and transition matrices, Fu and Koutras proposed the statistics of runs of any length and also the length of the longest run [5]. Their work considers the general case where the probability of occurrences of 1’s and 0’s are not necessarily equal. On the other hand, this method naturally depends on computations performed on a transition matrix dimension of which gets very large depending on the lengths of strings and runs (for instance, for the strings of length 4096 and runs of length 2, dimension of the required transition matrix is 5459). This method is particularly useful for relatively short sequences.

In this work, we use an alternative method which depend on generating functions to obtain simple recursive formulation of related distribution functions. More specifically, we propose a family of test functions depending on the following probabilities:

  • the probability that a random string of length n has r runs,

  • the probability that a random string of length n has exactly k runs of length a,

  • the probability that a random string of length n has exactly k runs of length at least a,

  • the probability that a random string of length n has no runs of length longer than L,

where \(n, r, a, L\) are positive integers and k is a non-negative integer. Rather than complex explicit expressions, we focus specifically on getting simple recursive relations for the probability functions mentioned above. Different from the previous approaches which use binomial coefficients or transition matrices, all the relations we obtain are linear, thus have low complexities. Employing these relations, we obtained exact probabilities for sufficiently (from cryptographical point of view) long strings such as \(2^{14}\) bits.

The organization of this paper is as follows. In Section 2, compositions and certain restrictions on them are defined, and a correspondence between their numbers and the number of runs is explained. The notations about the number of compositions of certain types and their generating functions are also introduced in this section. In Section 3, generating functions for the number of all compositions of n with exactly r parts and for the number of all compositions of n and finally for the number of all compositions of n whose parts are from a predefined set \({\Gamma }\) are computed. Corresponding generating functions are given in examples for some specific choices of \({\Gamma }\). The generating functions for the last three restrictions in Table 1 are given in Theorems 1, 2, and 3. The results obtained in this sections are listed in the Table 2. In Section 4, an alternative method to compute recursions defining the number of certain compositions are introduced. Theorems 4, 5, and 6 are the ones analogous to the theorems given in Section 3. In Theorem 7, the recursion to calculate the number of compositions of n without parts exceeding L is derived. In Section 5, basic probabilities on which our tests depend are introduced in terms of compositions, and generating functions. Moreover these probability values are derived and then recursions for these probability values are obtained. Theorems 8, 9, 10, and 11 are the main results, and they form the basis of the new randomness tests. In Section 6, expected values and variances for the random variables are computed and they may be used to define new randomness tests. In Section 7.1 we defined the outline of the proposed randomness test. We give a pseudo code for the Algorithm 2 of R-2 Composition Test and an example to evaluate the outputs of AES algorithm with respect to the randomness. We have illustrated the p-values obtained for different types of runs in the Table 5. Section 8 contains several applications of the proposed randomness test to different type of binary string collections. Finally, in Section 9, we give a conclusion and a future work plan.

Table 1 Notations for the restricted compositions of the positive integer n
Table 2 Generating functions for the number of restricted compositions of the positive integer \(n \)

2 Compositions and restricted compositions

Definition 1

A composition of a positive integer n is an ordered array of one or more positive integers whose sum is n.

For example, the number 4 has 8 different compositions:

$$(4) (3,1) (1,3) (2,2) (2,1,1) (1,2,1) (1,1,2) (1,1,1,1) $$

Each term in a composition is called a part. As shown below, the number of compositions of a positive integer n is \( c(n)= 2^{n-1}\) and \(c(0)= 1\) by convention [9].

Let the numbers \(l_{1},{\ldots } ,l_{r}\) denote the lengths of the runs of a given binary string of length n. Obviously, the sum of these lengths is equal to the length of the string, thus the ordered array \((l_{1},{\ldots } ,l_{r})\) is a composition of n. The lengths of runs of a binary string of length n corresponds to a unique composition of n. Conversely, to each composition of n, there corresponds two strings (one starting with a one; the other starting with a zero) so that lengths of runs are equal to the corresponding parts of the composition. Due to this \(2-1\) correspondence, any restriction or any problem on the compositions can be visualized as a restriction or a problem on the number of runs of a binary string.

There are numerous studies on compositions with certain restrictions on the parts such as compositions with no occurrence of k (by Chinn et al. [10]), compositions whose parts are from a predefined set (Heubach et al. [11]), compositions with distinct parts (Richmond et al. [12]), compositions with no parts exceeding L (M. E. Malandro [13] and so on [14, 15]. Almost all studies result in recursive relations; Jaklic et al. [16] obtained some results on closed form formulas of restricted compositions.

We now list our notation for the numbers of compositions under certain restrictions:

  • \(c(n)\): number of all compositions of n,

  • \(c(n,r)\): number of compositions of n with r parts,

  • \(c_{\Gamma }(n)\): number of compositions of n with all parts in a given set \({\Gamma }\subset Z\),

  • \(c_{L}(n)\): number of compositions of n with no part larger than L,

  • \(c_{a}(n,r,k)\): number of compositions of n with r parts and k of these parts equal to a,

  • \(c_{a}(n,k)\): number of compositions of n with k parts equal to a,

  • \(c_{\bar {a}}(n,k)\): number of compositions of n with k parts larger than or equal to a.

Table 1 below summarizes these notations and introduces the corresponding generating functions.

First two cases in the table are basic cases and can be found in many text books on combinatorics or discrete mathematics (for instance, Lint [9]). \(c_{\Gamma }(n)\) is given in [11] and \(c_{a}(n,k)\) is studied in [10] for only \( k = 0\).

The main contribution of this paper is the introduction of recursions for \(c_{a}(n,k,r) \), \(c_{a}(n,k)\) and \(c_{\bar {a}}(n,k)\) for \(k>0\) and giving the related generating functions.

The convention \(c(0)= 1\) leads to the conventions \( c(0,0)= c_{A}(0)=c_{a}(0,0,0)=c_{a}(0,0)=c_{\bar {a}}(0,0)= 1\). In addition to these conventions, in order to introduce some ease in computations, we extend the domain of all counting functions (mentioned in the list) to all nonnegative integers by setting the value as 0 whenever variables are not in the natural domain of the function. For instance, we assume that \(c_{a}(n,k)= 0\) when \(n<ka\).

3 Generating functions

There are several ways to approach the problem of finding the number of compositions [16]. We firstly concentrate on generating functions. Recall that, for positive integers \(n,r\), to find the number of nonnegative integer solutions of the equation \(x_{1}+x_{2}+{\cdots } +x_{r}=n\), after associating the formal sum \(1+z+z^{2}+{\cdots } \) with each summand (xi), it is sufficient to compute the coefficient of \(z^{n}\) in \( (1+z+z^{2}+{\cdots } )^{r}\). Any restriction on summands has a natural reflection to the corresponding factors. For instance, the number of solutions of \(x_{1}+x_{2}+{\cdots } +x_{8}= 30\) subject to the condition \( 3\le x_{i}\le 10\) is given by the coefficient of \(z^{30}\) in \((z^{3}+z^{4}+{\cdots } +z^{10})^{8}\). Similarly, coefficient of \(z^{30} \) in \((1+z^{2}+z^{4}+{\cdots } )^{4}(z+z^{3}+z^{5}+{\cdots } )^{4}\) is the number of solutions of the same equation, if x1,x2,x3 and x4 are to be even and the others to be odd.

If we return to the compositions again, we see that \(c(n,r)\), the number of compositions of the positive integer n with \(r\ge 1\) parts is the number of positive integer solutions of the equation

$$ x_{1}+x_{2}+{\cdots} +x_{r}=n $$
(1)

and this number is given by the coefficient of \(\displaystyle z^{n}\) in \(\displaystyle (z+z^{2}+{\cdots } )^{r}=\left (\frac {z}{1-z}\right )^{r}\) or equivalently, \(c(n,r)\) is the coefficient of znxr in \( \displaystyle 1+\sum \limits _{r = 1}^{\infty }{\left (\frac {z}{1-z}\right )^{r}x^{r}}=\frac {1}{1-\left (\frac {z}{1-z}\right )x}\) which means

$$C(z,x)=\frac{1-z}{1-z-zx}. $$

On the other hand, as \( \displaystyle \frac {z^{r}}{(1-z)^{r}}=z^{r}\sum \limits _{i = 0 }^{\infty }{\binom {-r}{i}(-z)^{i}}= z^{r}\sum \limits _{i = 0 }^{\infty }{\binom {r+i-1}{r-1}z^{i}}\) by taking \(i=n-r,\) the coefficient of \(z^{n}\) in \( (z+z^{2}+{\cdots } )^{r}\) can be obtained as

$$c(n,r)=\binom{n-1}{r-1}. $$

It follows that the recursion satisfied by \(c(n,r)\) is the Pascal’s identity

$$c(n,r)=c(n-1,r)+c(n-1,r-1). $$

By definition, for \(n \geq 1\) we have \( \displaystyle c(n)=\sum \limits _{r = 1 }^{n}{c(n,r)}\). In other words, including the case for \(n = 0\), \(c(n) \) is the sum of coefficients of \(z^{n}\) in \( \displaystyle 1,\frac {z}{1-z} , \frac {z^{2}}{(1-z)^{2}}\), \({\ldots } \) or equivalently, \( c(n)\) is the coefficient of \(z^{n}\) in \( \displaystyle 1+\sum \limits _{r = 1}^{\infty }{(\frac {z}{1-z})^{r}}=\frac {1}{1-\frac {z}{1-z}}=\frac {1-z}{1-2z}\). It follows that

$$C(z)=\frac{1-z}{1-2z}. $$

Since

$$\begin{array}{@{}rcl@{}} C(z)&=&\frac{1-z}{1-2z} = \frac{1}{1-2z}-\frac{z}{1-2z}= \sum\limits_{n = 0}^{\infty}{2^{n}z^{n}} - \sum\limits_{n = 0}^{\infty}{2^{n}z^{n + 1}}= 1+\sum\limits_{n = 1}^{\infty}{2^{n-1}z^{n}}\\ &=&1+\sum\limits_{n = 1}^{\infty}{c(n)z^{n}}= \sum\limits_{n = 0}^{\infty}{c(n)z^{n}} \end{array} $$

for \(n\ge 1\) we get

$$c(n)= 2^{n-1}. $$

In (1), if we require \( x_{i} \in {\Gamma } , i = 1,{\ldots } ,r\) for some subset \( {\Gamma } \) of nonnegative integers then \({\sum }_{\gamma \in {\Gamma } }{z^{\gamma } }\) is the factor corresponding to each summand. For simplicity, we will denote the sum \({\sum }_{\gamma \in {\Gamma } }{z^{\gamma } }\) by \( z^{{\Gamma } }\). Consequently, the number of compositions with r parts and all parts in \( {\Gamma } \), is the coefficient of \(z^{n}\) in(zΓ)r (for more details see [11]). Then it follows that, the number of compositions with all parts in \( {\Gamma } \), namely \( c_{\Gamma }(n)\), is the coefficient of \(z^{n}\) in \({\sum }_{r = 1}^{\infty }{(z^{{\Gamma } })^{r}}\) so that

$$C_{{\Gamma} }(z)= 1+\sum\limits_{r = 1}^{\infty }{(z^{{\Gamma} })^{r}}= \frac{1}{1-z^{{\Gamma} }}. $$

For the following examples recall that the Fibonacci string \({f_{n}}\) is defined by the recursion \(f_{n}=f_{n-1}+f_{n-2}\) subject to the initial conditions \( f_{0}= 0, f_{1}= 1. \) Generating function of this string is \( \displaystyle f(z)=\frac {z}{1-z-z^{2}}\).

Example 1

\(\displaystyle {\Gamma } = \{ 1,2 \}:\) For the number of compositions of n with each part equal to either 1 or 2 we have \(z^{{\Gamma } }=z+z^{2}\) and generating function is

$$\frac{1}{1-z-z^{2}}-1.$$

Hence the number of such compositions is the Fibonacci number \(f_{n + 1}\).

Example 2

\(\displaystyle {\Gamma } = \mathbb {Z}-\{ 1 \}:\) the number of compositions of n with no part equal to 1, we have \( \displaystyle z^{{\Gamma } }=z^{2}+z^{3}+\cdots =\frac {z^{2}}{1-z}\) and generating function is

$$\frac{1}{1-\left( \frac{z^{2}}{1-z}\right)}-1=\frac{1-z}{1-z-z^{2}}-1.$$

It follows that the number of compositions of n without 1 is the Fibonacci number \(f_{n-1}\).

Example 3

\(\displaystyle {\Gamma } = \mathbb {Z}-2\mathbb {Z}:\) For the number of compositions of n with all parts odd we have \( \displaystyle z^{{\Gamma } }=z+z^{3}+z^{5}+{\cdots } =\frac {z}{1-z^{2}}\) and generating function is

$$\frac{z}{1-z-z^{2}}. $$

Hence the number of such compositions is the Fibonacci number \(f_{n}\).

Example 4

\(\displaystyle {\Gamma } = \mathbb {Z}-\{ a \}:\) For the number of compositions of n without any occurrence of a, \( \displaystyle z^{{\Gamma } }=\sum \limits _{i = 1}^{\infty }{z^{i}-z^{a}}=\frac {z}{1-z}-z^{a} \) so that the associated generating function is

$$\frac{1-z}{1-2z+z^{a}-z^{a + 1}}. $$

Example 5

\(\displaystyle {\Gamma } = \{ 0,1,\ldots , L \}:\) For the number of compositions of n with no part exceeding L, \( \displaystyle z^{\Gamma }=\sum \limits _{i = 1}^{L}{z^{i}}=\frac {z(1-z^{L})}{1-z} \) so that the associated generating function is

$$C_{L}(z)=\frac{1-z}{1-2z+z^{L + 1}}. $$

We give the generating functions for the last three restrictions in Table 1, namely generating functions of \( c_{a}(n,k,r)\), of \(c_{a}(n,k)\) and of \( c_{\bar {a}}(n,k)\) in the following three theorems.

Theorem 1

If a is a fixed positive integer then

$$C_{a}(z,y,x)=\frac{1-z}{1-z(x + 1)+z^{a}x(1-z)(1-y)}. $$

Proof

ca(n,k,r) is the number of positive solutions of the equation \(x_{1}+x_{2}+{\cdots } +x_{r}=n\) such that \( x_{i_{1}}={\cdots } =x_{i_{k}}=a \) for some {i1,…,ik}⊂{1,…,r} and xia for i∉{i1,…,ik} . The subset {i1,…,ik} can be chosen in \( \displaystyle \binom {r}{k}\) distinct ways and once this set is determined, the number of solutions of the equation is given by the coefficient of \(z^{n}\) in

$$(z^{a})^{k}\left( (z+z^{2}+\cdots )-z^{a}\right)^{r-k}=(z^{a})^{k} \left( \frac{z}{1-z}-z^{a} \right)^{r-k}. $$

It follows that \(c_{a}(n,k,r)\) is the coefficient of \(z^{n}y^{k}x^{r} \) in

$$ C_{a}(z,y,x)= 1+\sum\limits_{k = 0}^{\infty }{\sum\limits_{r = 1}^{\infty}{\binom{r}{k}z^{ka}U^{r-k}y^{k}x^{r}}} $$
(2)

where \(\displaystyle U=\frac {z}{1-z}-z^{a}\). The double sum on the right hand side can be separated for the cases \(k = 0\) and \(k = 1,2,{\ldots } \) to obtain

$$\begin{array}{@{}rcl@{}} C_{a}(z,y,x)&=&1+\sum\limits_{r = 1}^{\infty }{U^{r}x^{r}}+\sum\limits_{k = 1}^{\infty }{z^{ka}y^{k}x^{k}\sum\limits_{r = 1}^{\infty }{\binom{r}{k}U^{r-k}x^{r-k}}}\\ &=&\frac{1}{1-Ux}+ \sum\limits_{k = 1}^{\infty }{z^{ka}y^{k}x^{k}\sum\limits_{r = 1}^{\infty }{\binom{r}{k}(Ux)^{r-k}}}\\ &=&\frac{1}{1-Ux}+\frac{1}{1-Ux}\sum\limits_{k = 1}^{\infty }{\left( \frac{z^{a}yx}{1-Ux} \right)^{k} }\\ &=&\frac{1}{1-Ux-z^{a}yx} \end{array} $$

and finally by substituting \(\displaystyle U=\frac {z}{1-z}+z^{a}\), desired expression can be obtained. □

Theorem 2

If a is a fixed positive integer then

$$C_{a}(z,y)=\frac{1-z}{1-2z+z^{a}(1-z)(1-y)}. $$

Proof

From the proof of Theorem 1, we see that \(c_{a}(n,k)\) is the coefficient of \(z^{n}\) in \( \displaystyle \sum \limits _{r = 1}^{\infty }{\binom {r}{k}(z^{a})^{k}U^{r-k}} \) . It follows that ca(n,k) is the coefficient of \(z^{n}y^{k}\) in

$$1+\sum\limits_{k = 0}^{\infty }{\sum\limits_{r = 1}^{\infty }{\binom{r}{k}z^{ka}U^{r-k}y^{k}}}. $$

From (2), we observe that this expression is in fact \( C_{a}(z,y,1)\), thus \( C_{a}(z,y)=C_{a}(z,y,1)\). □

Note that, the coefficient of \(y^{0}\) in \( C_{a}(z,y)\) is \( \displaystyle 1+\sum \limits _{r = 1}^{\infty }{U^{r}}=\frac {1}{1-U}=\frac {1-z}{1-z+z^{a}-z^{a + 1}} \). This expression is the generating function for the number of strings which do not contain \( a\), as we have previously obtained in Example 4.

Theorem 3

If a is a fixed positive integer then

$$C_{\bar{a}}(z,y)=\frac{1-z}{1-2z+z^{a}(1-y)}. $$

Proof

The number of positive solutions of the equation \( x_{1}+x_{2}+{\cdots } +x_{r}=n \) such that \(x_{i_{1}}={\cdots } =x_{i_{k}}\ge a \) for some \(\lbrace i_{1},{\ldots } ,i_{k}\rbrace \subset \{1,{\ldots } ,r \}\) and \( x_{i}<a \) for \(i\notin \{i_{1},{\ldots } ,i_{k} \}\) is given by the coefficient of zn in

$$\binom{r}{k}(z^{a}+z^{a + 1}+{\cdots} )^{k}(z+z^{2}+\cdots +z^{a-1})^{r-k}=\!\binom{r}{k}\left( \frac{z^{a}}{1-z}\right)^{k}z^{r-k}\left( \frac{1-z^{a-1}}{1-z}\right)^{r-k}. $$

Then \(c_{\bar {a}}(n,k)\) is the coefficient of \(z^{n} \) in \( \displaystyle \sum \limits _{r = 1}^{\infty }{\binom {r}{k}\left (\frac {z^{a}}{1-z}\right )^{k}z^{r-k}U^{r-k}y^{k}} \) where \(\displaystyle U=\frac {1-z^{a-1}}{1-z} \). But this means that

$$C_{\bar{a}}(z,y)= 1+\sum\limits_{k = 0}^{\infty }{\sum\limits_{r = 1}^{\infty }{\binom{r}{k}\left( \frac{z^{a}}{1-z}\right)^{k}U^{r-k}z^{r-k}y^{k}}} $$

The double sum on the right hand side can be separated for the cases \( k = 0 \) and \(k = 1,2,{\ldots } \) to obtain

$$\begin{array}{@{}rcl@{}} C_{\bar{a}}(z,y)&=&1+\sum\limits_{r = 1}^{\infty }{U^{r}z^{r}}+\sum\limits_{k = 1}^{\infty }{\left( \frac{z^{a}}{1-z}\right)^{k}y^{k}\sum\limits_{r = 1}^{\infty }{\binom{r}{k}U^{r-k}z^{r-k}}}\\ &=&\frac{1}{1-Uz}+\sum\limits_{k = 1}^{\infty }{ \left( \frac{z^{a}y}{1-z}\right)^{k}} \sum\limits_{r = 1}^{\infty }{\binom{r}{k}\left( Uz \right)^{r-k}}\\ &=&\frac{1}{1-Uz}+ \frac{1}{1-Uz} \sum\limits_{k = 1}^{\infty }{ \left( \frac{z^{a}y}{(1-z)(1-Uz)}\right)^{k}} \\ &=&\frac{1-z}{(1-z)(1-Uz)-z^{a}y} \end{array} $$

and substituting \((1-z)U = 1-z^{a-1}\) we obtain \( \displaystyle C_{\bar {a}}(z,y)=\frac {1-z}{1-2z+z^{a}(1-y)} \). □

We summarize the results in the following table on the number of restricted compositions and generating functions.

All the generating functions we have obtained are rational functions, hence the strings which they correspond satisfy constant coefficient homogeneous recursive relations. Moreover, these relations can be obtained immediately from the denominators of the generating functions. However, we can obtain these recursions by employing some elementary and straightforward counting arguments.

4 Recursions

When we delete the first part, say \( a<n \), of a composition of \(n \) we obtain a composition of \(n-a \). Thus, by deleting the first parts of all compositions of n we obtain the complete list of all compositions of integers \( 1,2,{\ldots } ,n-1 \).

For instance consider the compositions \(4: \)

$$(4) \ (3,1) \ (1,3) \ (2,2) \ (2,1,1) \ (1,2,1) \ (1,1,2) \ (1,1,1,1). $$

If the first part of the composition \((2,1,1) \) of \(4 \) is deleted, we obtain the composition \( (1,1) \) of \(2 \). By deleting the first parts of all compositions of 4 we obtain the following list

figure a

Besides the compositions of \( 1,2,{\ldots } ,n-1 \), this list contains one more object, namely the empty composition which is obtained when the first part of composition \( (n) \) is deleted. We may visualize the empty composition to be the unique composition of \( 0 \) as an interpretation of \( c(0)= 1 \). In a similar manner, by appending the positive integer \( a \), as the first part, to any composition of \( n-a \), we obtain a unique composition of \( n \). Therefore the set of all compositions of \( n \) is equivalent to the set of all compositions of all integers less than \(n, \) including 0, hence for any integer \(n\ge 1 \)

$$ c(n)=c(n-1)+{\cdots} +c(1)+c(0)=\sum\limits_{i = 1}^{n}{c(n-i)}. $$
(3)

Theorem 4

The string\(\lbrace c(n)\rbrace _{n = 0}^{\infty } \)isdetermined with the initial conditions\(c(0)=c(1)= 1\)andthe recurrence relationc(n) = 2c(n − 1) forall integers\( n\ge 2. \)

Proof

By convention we have \(c(0)= 1 \) and thus \(c(1)= 1 \) is obvious. We can write the recursion (3) as

$$c(n-1)=\sum\limits_{i = 1}^{n-1}{c(n-1-i)}=\sum\limits_{i = 2}^{n}{c(n-i)}=\left( \sum\limits_{i = 1}^{n}{c(n-i)} \right) -c(n-1) $$

for \(n\ge 2 \). Comparing this expression to (3), one obtains \( c(n)= 2c(n-1) \) for \(n\ge 2\). □

Notice that, since the generating function of \(c(n) \) is \(\displaystyle C(z)=\sum {c(n)z^{n}}=\frac {1-z}{1-2z} \), from it’s denominator one can read the recursion for \(c(n) \) as \(c(n)= 2c(n-1) \) and this agrees with the recursion obtained above.

A recursion for \(c_{a}(n,k) \) can be obtained in a way analogous to the one used in obtaining (3). First consider the case \( k = 0 \). By deleting the first part of each composition of n which has no part equal to \( a \), we obtain such compositions of all integers less than \( n \), except \( n-a \). Thus, the recursion for \(c_{a}(n,0) \) differs from the recursion for c(a) only by the summand c(na,0), that is

$$ c_{a}(n,0)=\sum\limits_{i = 1}^{n}{c_{a}(n-i,0)}-c_{a}(n-a,0). $$
(4)

Delete the first part of a composition of \(n \) which has \(k\geq 1 \) parts equal to \( a\). If the deleted part is equal to \( a \), then the remaining parts constitute a composition of \(n-a\) with \(k-1 \) parts equal to \( a\). If the deleted term is equal to \( i\) (ia), then the remaining parts form a composition of \(n-i\) with k parts equal to \( a\). Then we can write

$$ c_{a}(n,k)=\sum\limits_{i = 1}^{n}{c_{a}(n-i,k)}-c_{a}(n-a,k)+c_{a}(n-a,k-1). $$
(5)

Theorem 5

Let a be a positive integer. The string \(\lbrace c_{a}(n,k)\rbrace _{n = 0}^{\infty }\) is determined with the initial conditions;

  • for \(k = 0\)

    $$c_{a}(n,0)= \left\{ \begin{array}{llll} 1 & if & n = 0 & \\ 2^{n-1} & if & 1 \leq n < a & \\ 2^{n-1}-1 & if & n=a & \end{array} \right. $$
  • for \(k \geq 1\)

    $$c_{a}(n,k)= \left\{ \begin{array}{llll} 0 & if & n\le ka-1 & \\ 1 & if & n=ka & \end{array} \right. $$

and the recurrence relations are

  • for \(n\ge a + 1:\)

    $$c_{a}(n,0)= 2c_{a}(n-1,0)-c_{a}(n-a,0)+c_{a}(n-1-a,0) $$
    (6)
  • for \(k\ge 1\) and \(n\ge ka+ 1:\)

    $$ c_{a}(n,k)= 2c_{a}(n-1,k)-c_{a}(n-a,k)+c_{a}(n-a-1,k)+c_{a}(n-a,k-1)-c_{a}(n-a-1,k-1). $$
    (7)

Proof

When \( k = 0\), we have \(c_{a}(0,0)= 1\) by convention. If \( n<a\), then no composition of \( n\) contains a as a part, so \(c_{a}(n,0)= 2^{n-1}\). For \(n=a\), only one composition of a contains \( a\), thus \( c_{a}(a,0)= 2^{n-1}-1\).

When \( k\ge 1\), then \(c(0,k)= 0\) by convention. If \( n<ka\), then no composition of \( n\) can contain k parts equal to \( a\), so \( c_{a}(n,k)= 0\) for \(n<ka\). Only one composition of n = ka consists of k parts, each equal to \( a\), thus \(c_{ka}(a,k)= 1\).

For \(n\ge a + 1, \) recursion (4) can be written as

$$\begin{array}{@{}rcl@{}} c_{a}(n-1,0) &=& \sum\limits_{i = 1}^{n-1}{c_{a}(n-1-i,0)}-c_{a}(n-1-a,0)\\ &=& \sum\limits_{i = 2}^{n}{c_{a}(n-i,0)}-c_{a}(n-1-a,0)\\ &=& \sum\limits_{i = 1}^{n}{c_{a}(n-i,0)}-c_{a}(n-1,0)-c_{a}(n-1-a,0). \end{array} $$

Comparing this expression to (4), we obtain (6).

In a similar manner, for \(n\ge ka+ 1\) we write the recursion (5) as

$$\begin{array}{@{}rcl@{}} c_{a}(n-1,k) &=&\sum\limits_{i = 1}^{n-1}{c_{a}(n-1-i,k)-c_{a}(n-1-a,k)+c_{a}(n-1-a,k-1)}\\ &=&\sum\limits_{i = 2}^{n}{c_{a}(n-i,k)-c_{a}(n-1-a,k)+c_{a}(n-1-a,k-1)}\\ &=&\sum\limits_{i = 1}^{n}{c_{a}(n-i,k)-c_{a}(n - 1,k) +c_{a}(n - 1-a,k - 1) +c_{a}(n - 1-a,k - 1)}. \end{array} $$

Comparing this expression to (5), we obtain (7). □

Notice that, since the generating function for \(c_{a}(n,k)\) is \(\displaystyle C_{a}(z,y)=\sum {c_{a}(n,k)z^{n}y^{k}}=\frac {1-z}{1-2z+z^{a}(1-z)(1-y)} = \frac {1-z}{1-2z+z^{a}-z^{a + 1}-z^{a}y+z^{a + 1}y } \) , from it’s denominator one can read the recursion as ca(n,k) = 2ca(n− 1,k)−ca(na,k) + ca(na− 1,k) + ca(na,k− 1)−ca(na− 1,k− 1) and this agrees with recursion we obtained above.

Finally, for \(c_{\bar {a}}(n,k) \) consider the case \( k = 0\). By deleting the first part of each composition of n which has all parts less than a, we obtain such compositions of all integers \(n-1, n-2, {\cdots } , n-a + 1\). Thus, the recursion for \(c_{\bar {a}}(n,0)\) differs from the recursion for \(c(a)\) only by the summand \( c_{\bar {a}}(n-a,0)\), that is

$$ c_{\bar{a}}(n,0)=\sum\limits_{i = 1}^{a-1}{c_{\bar{a}}(n-i,0)}. $$
(8)

Delete the first part of a composition of \(n \) which has \(k>1\) parts equal to \( a\). If the deleted part is \(i\ge a\), then the remaining parts constitute a composition of \(n-i\) with k − 1 parts equal to or larger than \( a\). If the deleted term is \(i<a\), then the remaining parts form a composition of \(n-i\) with k parts equal to or larger than \( a\). Then we can write

$$ c_{\bar{a}}(n,k)=\sum\limits_{i = 1}^{a-1}{c_{\bar{a}}(n-i,k)}+\sum\limits_{i=a}^{n}{c_{\bar{a}}(n-i,k-1)}. $$
(9)

Theorem 6

Leta be a positive integer. Thestring\(\lbrace c_{\bar {a}}(n,k)\rbrace _{n = 0}^{\infty }\)isdetermined with the initial conditions:

  • for \(k = 0\)

    $$c_{\bar{a}}(n,0)=\left\{ \begin{array}{llll} 1 & if & n = 0 & \\ 2^{n-1} & if & a>1 \text{ and } 1\le n\le a-1 & \\ 2^{a-1}-1 & if & n=a & \end{array} \right. $$
  • for \(k\geq 1\)

    $$c_{\bar{a}}(n,k)=\left\{ \begin{array}{llll} 0 & if & n\le ka-1 & \\ 1 & if & n=ka & \end{array} \right. $$

and the recurrence relations

$$ c_{\bar{a}}(n,0)= 2c_{\bar{a}}(n-1,0)-c_{\bar{a}}(n-a,0) $$
(10)

and

$$ c_{\bar{a}}(n,k)= 2c_{\bar{a}}(n-1,k)-c_{\bar{a}}(n-a,k)+c_{\bar{a}}(n-a,k-1) $$
(11)

for\(k\ge 1\)and\(n\ge ka+ 1\).

Proof

Initial conditions are same with the ones given in Theorem 5. For \(n\ge a + 1, \) recursion (8) can be written as \(c_{\bar {a}}(n-1,0)={\sum }_{i = 1}^{a-1}{c_{\bar {a}}(n-1-i,0)}\).

Comparing this expression to (8), we obtain (10).

In a similar manner, for \(n\ge ka+ 1\) we write the recursion (9) as

$$\begin{array}{@{}rcl@{}} c_{\bar{a}}(n-1,k)&=&\sum\limits_{i = 1}^{a-1}{c_{\bar{a}}(n-1-i,k)}+\sum\limits_{i=a}^{n-1}{c_{\bar{a}}(n-1-i,k-1)}\\ &=&\sum\limits_{i = 2}^{a}{c_{\bar{a}}(n-i,k)}+\sum\limits_{i=a + 1}^{n}{c_{\bar{a}}(n-i,k-1)}\\ &=&\sum\limits_{i = 1}^{a-1}{c_{\bar{a}}(n-i,k)}+ c_{\bar{a}}(n-a,k) + \sum\limits_{i=a}^{n}{c_{\bar{a}}(n-i,k-1)} - c_{\bar{a}}(n-a,k-1) \end{array} $$

Comparing this expression to (9), we obtain (11). □

Notice that, since the generating function for \(c_{\bar {a}}(n,k)\) is \(\displaystyle C_{\bar {a}}(z,y)=\sum {c_{\bar {a}}(n,k)z^{n}y^{k}}=\frac {1-z}{1-2z+z^{a}(1-y)}\), from it’s denominator one can read the recursion as \(\displaystyle c_{\bar {a}}(n,k)= 2c_{\bar {a}}(n-1,k)-c_{\bar {a}}(n-a,k)+c_{\bar {a}}(n-a,k-1)\) and this agrees with recursion we obtained above.

Theorem 7

LetL be a fixed positive integer. Thestring\(\lbrace c_{L}(n)\rbrace _{n = 0}^{\infty }\)isdetermined with the initial conditions

$$c_{L}(n)=\left\{ \begin{array}{llll} 1 & if & n = 0 & \\ 2^{n-1} & if & n\le L & \end{array} \right. $$

and the recurrence relation

$$c_{L}(n)= 2c_{L}(n-1)-c_{L}(n-L-1). $$

for\(n\ge L + 1\).

Proof

Call a composition as an L-composition if no part exceeds L. A positive integer n has two type of L-compositions:

  1. Type 1:

    L-Compositions with the first part equal to 1. Each such composition is equivalent to a unique L-composition of \(n-1\). Thus the number of L-compositions of n with the first part equal to 1 is \(c_{L}(n-1)\).

  2. Type 2:

    L-Compositions with the first part larger than 1. If we decrease the first part of such a composition by 1, we obtain a unique \( L\)-composition of \(n-1\). By this way all L-compositions of \( n-1\), except the ones with the first part equal to L, are obtained. But, the set of exceptional compositions is equivalent to the set of L-compositions of \(n-1-L\). Hence the number of Type 2 compositions is \(c_{L}(n-1)-c_{L}(n-L-1)\).

Again notice that, since the generating function for \(c_{L}(n)\) is \(\displaystyle C_{L}(z)=\sum {c_{L}(n)z^{n}}=\frac {1-z}{1-2z+z^{L + 1}}\), from it’s denominator one can read the recursion as \(c_{L}(n)= 2c_{L}(n-1)-c_{L}(n-L-1)\) and this agrees with recursion we obtained above.

5 Restricted compositions statistics

5.1 Generating functions for probability sequences

In this section we compute the basic probabilities on which our tests depend. Let \({\Omega }_{n}\) be the set of binary strings \( \sigma \) of length n. Define the value of nonnegative integer valued random variables \(X, X_{\max }, X_{a}, X_{\bar {a}}\) on \( \sigma \in {\Omega }_{n}\) as

  • \(X(\sigma )=\) number of runs of \(\sigma ,\)

  • \(X_{\max }(\sigma )=\) length of a longest run of \(\sigma ,\)

  • \(X_{a}(\sigma )=\) number of runs of length a of \(\sigma ,\)

  • \(X_{\bar {a}}(\sigma )=\) number of runs of length at least a of σ.

We denote the probability mass functions of these random variables as

  • \(p(n,r)=\textit {probability}(X=r)\),

  • \(p_{L}(n)=\textit {probability}(X_{\max }\le L)\),

  • \(p_{a}(n,r)=\) probability(Xa = r),

  • \(p_{\bar {a}}(n,r)=\) probability\((X_{\bar {a}}=r)\).

After giving the definition of a composition, we have showed that to each binary string of length n there corresponds a unique composition of \( n\) and conversely, to each composition of n there corresponds exactly two binary strings of length 2. Thus, for example, the number of binary strings of length n which have \( r\) runs is twice the number of compositions of n with \(r \) parts. Since the number of all binary strings of length n is \(2^{n}\), we immediately get

$$p(n,r)=\frac{1}{2^{n}} \left( 2c(n,r) \right)= 2^{1-n}c(n,r)$$

and analogously

$$p_{L}(n)= 2^{1-n}c_{L}(n),$$
$$p_{a}(n,k)= 2^{1-n}c_{a}(n,k)$$

and

$$p_{\bar{a}}(n,k)= 2^{1-n}c_{\bar{a}}(n,k).$$

Let n be a fixed nonnegative integer, then the probability generating function \(\displaystyle \sum \limits _{r = 0}^{\infty }{p(n,r)}x^{r}\)of X is the coefficient of \(z^{n} \) in:

$$\begin{array}{@{}rcl@{}} P(z,y)&=&\sum\limits_{n,r = 0}^{\infty }{p(n,r)}z^{n}y^{r} = 2^{1-n}\sum\limits_{n,r = 0}^{\infty }{c(n,r)}z^{n}y^{r}\\ &=&2\sum\limits_{n,r = 0}^{\infty }{c(n,r)}\left( \frac{z}{2}\right)^{n}y^{r} = 2C(z/2,y)\\ &=&\frac{2(2-z)}{2-z-zy}. \end{array} $$

Similarly, probability generating functions of \(X_{a}\) and \( X_{\bar {a}}\) are coefficients of \(z^{n}\) in

$$\begin{array}{@{}rcl@{}} P_{a}(z,y)&=&2C_{a}(z/2,y)=\frac{2-z}{1-z + 2^{-a-1}z^{a}(2-z)(1-y)},\\ P_{\bar{a}}(z,y)&=&2C_{\bar{a}}(z/2,y)=\frac{2-z}{1-z + 2^{-a}z^{a}(1-y)} \end{array} $$

respectively.

5.2 Probability calculations

Theorem 8

The string\(\lbrace p(n,r)\rbrace _{n = 0}^{\infty }\)isdetermined with the initial conditions

$$p(n,r)=\left\{ \begin{array}{llll} \frac{1}{2^{n-1}} & if & r=n & \\ 0 & if & r = 0 \text{ and } n>0 & \\ 0 & if & r>0 \text{ and } r>n & \end{array} \right. $$

and the recursion

$$p(n,r)=\frac{1}{2}(p(n-1,r)+p(n-1,r-1)) $$

for\(n>r>0\).

Proof

We already know that \(\displaystyle c(n,r)=\binom {n-1}{r-1}\). Pascal’s identity \(\displaystyle \binom {n-1}{r-1}=\binom {n-2}{r-1}+\binom {n-2}{r-2}\) leads the desired expression. □

Theorem 9

Let a be a positive integer. The string \(\lbrace p_{a}(n,r)\rbrace _{n = 0}^{\infty }\) is determined with the initial conditions:

for\(r = 0\)

$$p_{a}(n,0)=\left\{ \begin{array}{llll} 1 & if & n \in \{0,\ldots, a-1 \} & \\ 1-2^{1-a} & if & n\in \{a, a + 1\} & \end{array} \right. $$

for \(r \geq 1\)

$$p_{a}(n,r)=\left\{ \begin{array}{llll} 0 & if & n\le ra-1 & \\ 2^{1-n} & if & n\in \{ra,ra+ 1\} & \end{array} \right. $$

and the recurrence relations

  • for \(n\ge a + 2\)

    $$p_{a}(n,0)=p_{a}(n-1,0)-2^{-a}p_{a}(n-a,0)+ 2^{-a-1}p_{a}(n-1-a,0) $$
  • for \(r\ge 1\) and \(n\ge ra+ 2\)

    $$\begin{array}{@{}rcl@{}} p_{a}(n,r)=p_{a}(n-1,r)&-&2^{-a}p_{a}(n-a,r)+ 2^{-a-1}p_{a}(n-a-1,r)\\ &+&2^{-a}p_{a}(n-a,r-1)-2^{-a-1}p_{a}(n-a-1,r-1). \end{array} $$

Proof

By convention, \(p_{a}(0,0)= 1\). Initial conditions can be checked by direct computing. For the other cases, just substitute \(p_{a}(n,r)= 2^{1-n}c_{a}(n,r)\) in Theorem 5. □

Theorem 10

Let a be a positive integer. The string \(\lbrace p_{\bar {a}}(n,r)\rbrace _{n = 0}^{\infty }\) is determined with the initial conditions:

for \(r = 0\)

$$p_{\bar{a}}(n,0)=\left\{ \begin{array}{llll} 1 & if & n\le a-1 & \\ 1-2^{1-n} & if & n=a & \end{array} \right. $$

for \(r>1\)

$$p_{\bar{a}}(n,r)=\left\{ \begin{array}{llll} 0 & if & n\le ra-1 & \\ 2^{1-n} & if & n=ra & \end{array} \right. $$

and the recurrence relations

  • for \(n\ge a + 1\)

    $$p_{\bar{a}}(n,0)=p_{\bar{a}}(n-1,0)-2^{-a}p_{\bar{a}}(n-a,0) $$
  • for \(r\ge 1\) and \(n\ge ra+ 1\)

    $$p_{\bar{a}}(n,r)=p_{\bar{a}}(n-1,r)-2^{-a}p_{\bar{a}}(n-a,r)+ 2^{-a}p_{\bar{a}}(n-a,r-1). $$

Proof

Just substitute \( p_{\bar {a}}(n,r)= 2^{1-n}c_{\bar {a}}(n,r)\) in Theorem 6. □

Theorem 11

Let a be a positive integer. The string \(\lbrace p_{L}(n)\rbrace _{n = 0}^{\infty }\) is determined with the initial conditions \( p_{L}(0)=p_{L}(1)={\cdots } =p_{L}(L)= 1\) and the recurrence relation

$$p_{L}(n)=p_{L}(n-1)-2^{-L-1}p_{L}(n-L-1). $$

Proof

Just substitute \(p_{L}(n)= 2^{1-n}c_{L}(n)\) in Theorem 7. □

As for the complexities, to find the complexity of the computations needed in determination of the value of \(p_{a}(n_{0},r)\) for a specific value of a, using the recursive formula given in theorem 9, first we assume that all of the values \(p_{a}(n,r)\) are computed for \(n \leq n_{0}\) and \(1 \leq r \leq \lfloor \frac {n_{0}}{a} \rfloor \). Then to compute pa(n0,r) we need 4 addition operations for each r. Thus, in total one has to perform \(\displaystyle \sum \limits _{n = 0}^{n_{0}} 4 \frac {n}{a}= \frac {2n(n + 1)}{a}\) additions. This means that the complexity to determine \(p_{a}(n_{0},r)\) is \(\mathcal {O}(n^{2})\). Similar arguments can be given for the formulas stated in the other theorems and the same complexity bound is also valid for them.

6 Expected values and variance

For our purposes we need neither expected values nor variances. On the other hand, these quantities are quite important tools for various similar statistical computations. For the sake of completeness and possible future references here we compute expected values and variances of the random variables X, \(X_{a}\) and \(X_{\bar {a}}\).

It is well known that when n is fixed the expected value and the variance of X appear as the coefficient of \(z^{n}\), respectively in \(\displaystyle \frac {\partial P}{\partial x}\) and \(\displaystyle \frac {\partial ^{2}P}{\partial x^{2}}+\frac {\partial P}{\partial x}-\left (\frac {\partial P}{\partial x}\right )^{2} \) where all partial derivatives are evaluated at \(x = 1\). Similar expressions can be written for \(X_{a}\) and \(X_{\bar {a}}\). After some straightforward, but boring computations we obtain the following:

  • Coefficient of \(z^{n}\) in

    $$\frac{\partial }{\partial x}P(z,x)\vert_{x = 1}=\frac{1}{2}\frac{(2-z)z}{(1-z)^{2}}=\left( z-\frac{z^{2}}{2}\right)\sum\limits_{i = 0}^{\infty }{(i + 1)z^{i}} $$

    is \(\displaystyle \frac {n + 1}{2}\).

  • Coefficient of \(z^{n}\) in

    $$\frac{\partial^{2}}{\partial x^{2}}P(z,x)\vert_{x = 1}=\frac{1}{2}\frac{(2-z)z^{2}}{(1-z)^{3}}=\frac{1}{2}(2-z)z^{2}\sum\limits_{i = 0}^{\infty }{\binom{i + 2}{2}z^{i}} $$

    is \(\displaystyle \frac {(n-1)(n + 2)}{4}\).

  • Coefficient of \(z^{n}\) in

    $$\frac{\partial }{\partial y}P_{a}(z,y)\vert_{y = 1}=\frac{1}{2^{a + 1}}\frac{(2-z)^{2}z^{a}}{(1-z)^{2}}=\frac{1}{2^{a + 1}}(4z^{a}-4z^{a + 1}+z^{a + 2})\sum\limits_{i = 0}^{\infty }{(i + 1)z^{i}} $$

    is \(\displaystyle \frac {n-a + 3}{2^{a + 1}}\).

  • Coefficient of \(z^{n}\) in

    $$\frac{\partial^{2}}{\partial y^{2}}P_{a}(z,y)\vert_{y = 1}=-\frac{1}{2^{2a + 1}}\frac{(2-z)^{3}z^{2a}}{(1-z)^{3}} $$

    is \(\displaystyle \frac {1}{4^{a + 1}}(4a^{2}-4an-18a+n^{2}+ 9n + 14)\).

  • Coefficient of \(z^{n}\) in

    $$\frac{\partial }{\partial y}P_{\bar{a}}(z,y)\vert_{y = 1}=\frac{1}{2^{a}}\frac{(2-z)z^{a}}{(1-z)^{2}} $$

    is \(\displaystyle \frac {n-a + 2}{2^{a}}\).

  • Coefficient of \(z^{n}\) in

    $$\frac{\partial^{2}}{\partial y^{2}}P_{\bar{a}}(z,y)\vert_{y = 1}=\frac{1}{2^{2a-1}}\frac{(2-z)z^{2a}}{(1-z)^{3}} $$

    is \(\displaystyle \frac {4a^{2}-4an-10a+n^{2}+ 5n + 4}{2^{2a}}\).

The following theorem follows immediately.

Theorem 12

Let n be a fixed positive integer. Expected values and variances of the randomvariables\(X, \ X_{a}\)and\(X_{\bar {a}}\)definedon\( {\Omega }_{n}\)are asfollows:

$$ E(X)=\frac{n + 1}{2}, \hspace{1.5cm} Var(X)=\frac{n-1}{4}, $$
$$E(X_{a})=\frac{n + 3-a}{2^{a + 1}}, \hspace{0.8cm} Var(X_{a})=\frac{(2^{a + 1}-2a + 3)n + 3(a^{2}-4a + 2)+ 2^{a + 1}(3-a)}{4^{a + 1}}, $$
$$E(X_{\bar{a}})=\frac{n + 2-a}{2^{a}}, \hspace{0.8cm} Var(X_{\bar{a}})=\frac{(2^{a}-2a + 1)n+(3a-2^{a})+(a-2)}{4^{a}}. $$

7 R-2 composition tests

R-2 composition tests count the number of pre-specified runs in a string. The input of the test is a collection of binary strings with equal length n. We apply the test function to determine the number of occurrences of the restricted runs in each string, and call them as observed values. Afterwards, we apply \(\chi ^{2}\) test and produce p-value using the bin probability tables (as described in [17]). We give the probabilities for \(n\in \{128,256,1024,4096 \} \). It should be noted that the 8 test statistics defined in this paper are not necessarily independent.

7.1 Walkthrough

Tables 101112, and 13 present the number of bins, bin values and the probabilities corresponding to each bins, for n = 128,256,1024 and 4096 respectively. As an example, to test the randomness of a collection of N binary strings of length \(n = 128\), the first row of Table 10, that is the line labeled as “Total Runs” suggests the use of 8 bins, and gives the expected values of the number of total runs to be between 0 and 57 as \(0.106982 \times N\), to be between 58 and 60 as \(0.131978 \times N\), ... , to be between 72 and 128 as \(0.106982 \times N\).

The procedure to test a collection of N binary strings of length n, using the R-2 Composition Tests family can be summarized as follows:

  1. 1.

    Depending on n, use the appropriate bin probability table given in the Appendix (Tables 101112, or 13) to determine the corresponding number of bins for each of the test functions:

    • number of all runs,

    • number of runs of length 1,

    • number of runs of length 2,

    • number of runs of length 3,

    • number of runs of length 4,

    • number of runs of length 5,

    • number of runs of length at least 5,

    • length of the longest run.

  2. 2

    Apply \(\chi ^{2}\) Goodness of Fit Test, that is evaluate

    $$\chi^{2} = \sum\limits_{i = 1}^{B} \frac{(O_{i} - N \cdot p_{i})^{2}}{N \cdot p_{i}}\ \ \ \ \text{and}\ \ \ \ \ \text{\textit{p}-value} = \text{\texttt{igamc}}\left( \frac{B-1}{2},\frac{\chi^{2}}{2}\right) $$

    where \(p_{i}\)’s are obtained from bin probability Tables 101112, and 13.

  3. 3

    Print the p-value.

Pseudocode of the test is given at Algorithm 2 in the Appendix. The source code is available online at http://users.metu.edu.tr/muhid/stats.html.

Example 6

We apply R-2 Composition Test with \(a = 2\) as an example to AES algorithm [18]. First we fix the key as 0. Then, by encrypting the plaintexts corresponding to the numbers from 0 to 100,000 we generate 100,000 many 128-bit ciphertexts. The observed number of runs of length \(a = 2\) in each block are given in Table 3. Using these bin values, we compute the \(\chi ^{2}\) and p-values as follows.

$$\chi^{2} = \sum\limits_{i = 1}^{B} \frac{(O_{i} - N \cdot p_{i})^{2}}{N\cdot p_{i}} = 2.2682 \ \ \ \ \text{and}\ \ \ \ \ \text{\textit{p}-value} = \text{\texttt{igamc}}\left( \frac{8-1}{2},\frac{2.2682 }{2}\right)= 0.9435. $$
Table 3 Expected and observed values for AES, with \(n = 128\), \(a = 2\), \(N = 100,000\)

The complete list of p-values for this case is given in the second column of the following Table 4. The second row in that column shows the p-value for the number of all runs. Next 5 rows of the same column, labeled with numbers “1” to “5” show the p-value for the number of all runs of length a equal to 1 till 5. The row labeled as “≥ 5” is for the p-value of all runs of length a greater than or equal to 5. Finally, the last row shows the p-value for the longest run test.

Table 4 p-values for a, \(N = 100,000\)

8 Application

This section reveals the results obtained from the application of the R-2 Composition Tests to various collections of strings in order to show the sensitivity of the tests. For this purpose, we generate pseudorandom and non-random data sets. The details are as follows.

First, we applied the tests to the data set, generated by using the AES algorithm in the same manner explained in the Example 6 above. For each of the 8 different tests, the p-values obtained are presented in the first column of the Table 5. As a second application, we considered the collection of outputs of SHA-3 algorithm, for the case n = 256. Starting with initial message 0 and recursive application of the algorithm, \(S_{i}=H(S_{i-1})\), we collected the hash values and evaluated this collection with the 8 different tests introduced. The p-values obtained are presented in the first column of the Table 6. Then, we applied the R-2 Composition Tests to the binary expansions of \(\pi \) and \(\sqrt {2}\). In the case of \(\pi \), we produced 100,000 many 1024 bit blocks from its binary expansion and applied the tests for the case \(n = 1024\). Similarly, we applied the test for a collection obtained from the binary expansion of \(\sqrt {2}\), taking \(n = 4096\). The p-values obtained are given in the first columns of the Tables 7 and 8.

Table 5 128 bits
Table 6 256 bits
Table 7 1024 bits
Table 8 4096 bits

Second, we applied the tests to the collection of biased binary strings. By setting each bit \(a_{i}\) as 1 when a randomly generated integer between 0 and 999 is greater than 505, and to 0 otherwise, we produced 100,000 weight-biased binary strings of the lengths 128, 256, 1024, and 4096 with bias 1% (p1 = Pr(ai = 1) = 0.505). We produced another collection with bias 5% (p1 = Pr(ai = 1) = 0.525) using the same idea. The results are presented in the second and third columns of the Tables 567, and 8. It is observed that the bias \(p_{1}= 0.505\) is detected only by the longest runs test for \(n = 128\), \(n = 256\) and \(n = 1024\); and is detected by five of these tests for \(n = 4096\). The bias \(p_{1}= 0.525\) is detected by all the tests.

After getting results for the weight-biased data set, we test the sensitivity of the R-2 Composition Tests on the run-biased binary strings. Since there is no canonical method to introduce a bias to the number of runs, we developed two methods for this purpose. First, we introduced a bias to the number of runs by flipping the last bit of the first run of length 3 in each sequence of the Example 6. We name the collection obtained by this method as DataSet-1. Notice that, this manipulation does not change the total number of runs, whereas it changes the number of runs of length 3. As seen in the Table 9, this bias is not detected by tests in the NIST test suite, namely the runs test and the test for the longest run of ones when default parameters are used. On the other hand, all the tests related with the number of runs of a specified length which are defined in this paper detected this bias. The manipulation is detected by sstringRun test of TestU01. The p-values are given in the Table 9.

Table 9 Results of this work, NIST and TestU01 on manipulated data

The main advantage of the R-2 composition test is that it can evaluate a collection of relatively short strings directly as a collection, while the other tests in the literature can evaluate such a collection only after concatenating them. However, after concatenating the collection, the defects in the individual strings may not be recognizable by the randomness tests. This can be a major problem in testing key generators in cryptographic applications. Moreover, depending on the order they are concatenated may cause different randomness properties. In order to illustrate this scenario we produced a non-random data set called DataSet-2 using Algorithm 1 below. First, we take a random collection generated using the AES algorithm in Example 6. Then, we take a subset of this collection and modify each string in this subset so that each string starts with 1 and ends with 0. Note that, this collection can be flagged as non random easily as the number of total runs deviates from the expected number of total runs. In order make this defect in the individual strings not recognizable after concatenation, we decrease the number of totals runs by a second operation: if the sequence starts with 101, replace it by 111 and if it ends with 010, replace it by 000. We determine the subset of the collection to be modified as every sixth sequence in the collection. In short, for each of the every \(6^{th}\) sequence \(\{s_{i}\}_{i = 0}^{127}\), we replace the first three bits of the sequence as follows

$$000 \rightarrow 100, \ \ \ \ 001 \rightarrow 111, \ \ \ \ 010 \rightarrow 110, \ \ \ \ 011 \rightarrow 111 $$

and do not make any change otherwise, the last three bits of the sequence as follows

$$001 \rightarrow 000, \ \ \ \ 011 \rightarrow 000, \ \ \ \ 101 \rightarrow 100, \ \ \ \ 111 \rightarrow 110 $$

and do not make any change otherwise. Algorithm 1 operates on 100,000 strings to generate DataSet-2.

R-2 Composition Tests detected the manipulation, that is the data set gets relatively low p-values from the runs of length 1 and the total number of runs tests. This manipulation is also detected by the NIST Runs test, but not detected by the test for the longest run of ones. However it is not detected by the run tests in Test U01. The p-values are given in the Table 9.

These experiments show that the new tests can detect some defects that other run tests do not.

figure b

9 Conclusion

In this work, we define a family of randomness tests in the spirit of Golomb’s second randomness postulate. We give recursive formulas that are feasible to compute the exact probabilities of different type of runs, namely number of all runs in a binary segment of length n, that of all runs of length a, all runs of length at least a, and the number of all binary segments whose longest run has length at most L. Moreover, we find the exact distributions for all those run types and define new randomness tests. Afterwards, we apply the family of randomness tests to various collections of strings.

Throughout this work, we naturally consider binary strings where zeros and ones are produced with equal probabilities. As a future work, the same computations can be generalized for strings which are produced with unequal probabilities, which can be used for estimating the entropy of the source.