Keywords

1 Introduction

The goal of grammar-based compression is to represent a word w by a small context-free grammar that produces exactly \(\{w\}\). Such a grammar is called a straight-line program (SLP) for w. In the best case, one gets an SLP of size \(\varTheta (\log n)\) for a word of length n, where the size of an SLP is the total length of all right-hand sides of the rules of the grammar. A grammar-based compressor is an algorithm that produces an SLP for a given word w. There are various grammar-based compressors that can be found at many places in the literature. Well-known examples are the classical LZ78-compressorFootnote 1 of Lempel and Ziv [21], BISECTION [12] and SEQUITUR [17], just to mention a few. In this paper, we study the class of global grammar-based compressors which are also called global algorithms. A key concept of those algorithms are maximal strings. A maximal string of an SLP \(\mathbb A\) is a word that has length at least two and occurs at least twice without overlap as a factor of the right-hand sides of the rules of \(\mathbb A\). Further, no strictly longer word appears at least as many times without overlap as a factor of the right-hand sides of \(\mathbb A\). For an input word w, a global grammar-based compressor starts with the SLP that has a single rule \(S\rightarrow w\), where S is the start nonterminal of the grammar. The SLP is then recursively updated by choosing a maximal string \(\gamma \) of the current SLP and replacing a maximal set of pairwise non-overlapping occurrences of \(\gamma \) by a fresh nonterminal X. Additionally, a new rule \(X\rightarrow \gamma \) is introduced. The algorithm stops when the obtained SLP has no maximal string. The probably best known example for a global algorithm is \(\mathsf {RePair}\) [13], which selects in each round a most frequent maximal string. Note that the \(\mathsf {RePair}\) algorithm as it is proposed in [13] always selects a word of length 2, but in this paper we follow the definition of [8], where the algorithm possibly selects longer words. However, both definitions coincide for unary input strings as considered in this work. Other global algorithms are \(\mathsf {LongestMatch}\) [11], which chooses a longest maximal string in each round, and \(\mathsf {Greedy}\) [2,3,4], which selects a maximal string that minimizes the size of the SLP obtained in the current round. It is again worth mentioning that the \(\mathsf {Greedy}\) algorithm as originally presented in [2,3,4] is different from the version studied in this work as well as in [8]: The original \(\mathsf {Greedy}\) algorithm only considers the right-hand side of the start rule for the choice and the replacement of the maximal string. In particular, all other rules do not change after they are introduced.

In the seminal work of Charikar et al. [8], the worst case approximation ratio of grammar-based compressors is studied. For a grammar-based compressor \(\mathcal {C}\) that computes an SLP \(\mathcal {C}(w)\) for a given word w, one defines the approximation ratio of \(\mathcal {C}\) on w as the quotient of the size of \(\mathcal {C}(w)\) and the size g(w) of a smallest SLP for w. The approximation ratio \(\alpha _{\mathcal {C}}(n)\) is the maximal approximation ratio of \(\mathcal {C}\) among all words of length n. In [8] the authors provide upper and lower bounds for the approximation ratios of several grammar-based compressors (among them are all compressors mentioned so far), but for none of the compressors the lower and upper bounds match. For LZ78 and BISECTION those gaps were closed in [15]. For all global algorithms, the best upper bound on the approximation ratio is \(\mathcal {O}((n/\log n)^{2/3})\) [8], while the best known lower bounds are \(\varOmega (\log n/\log \log n)\) for \(\mathsf {RePair}\) [14], \(\varOmega (\log \log n)\) for \(\mathsf {LongestMatch}\) and \(5/(3\log _3(5))=1.137...\) for \(\mathsf {Greedy}\) [8]. In the context of our work, it is worth mentioning that the lower bound for \(\mathsf {Greedy}\) uses words over a unary alphabet. In general, the achieved bounds “leave a large gap of understanding surrounding the global algorithms” as the authors in [8] conclude.

We aim to strengthen the understanding of global grammar-based compressors in this paper by studying the behavior of these algorithms on unary inputs, i.e., words of the form \(a^n\) for some symbol a. Grammar-based compression on unary words is strongly related to the field of addition chains, which has been studied for decades (see [16, Chapter 4.6.3] for a survey) and still is an active topic due to the strong connection to public key cryptosystems (see [18] for a review from that point of view). An addition chain for an integer n of size m is a sequence of integers \(1=k_1,k_2,\dots ,k_m=n\) such that for each d (\(2\le d \le m\)), there exists ij (\(1\le i,j<d\)) such that \(k_i+k_j=k_d\). It is straightforward to compute from an addition chain for an integer n of size m an SLP for \(a^n\) of size \(2m-2\). Vice versa, an SLP for \(a^n\) of size m yields an addition chain for n of size m. So this paper can also be interpreted as a study of global algorithms as addition chain solvers. For \(\mathsf {RePair}\) and \(\mathsf {LongestMatch}\), the restriction to unary inputs allows a full understanding of the produced SLPs and it turns out that for all unary inputs the SLP produced by \(\mathsf {RePair}\) has the same size as the SLP produced by \(\mathsf {LongestMatch}\). In fact, both algorithms are basically identical to the binary method that produces an addition chain for n by creating powers of two using repeated squaring, and then the integer n is represented as the sum of those powers of two that correspond to a one in the binary representation of n. Based on that information, we show that for any unary input w the produced SLPs of \(\mathsf {RePair}\) and \(\mathsf {LongestMatch}\) have size at most \(\log _2(3)\cdot g(w)\), and we provide a matching lower bound.

Unfortunately, even for unary inputs it is hard to analyze the general behavior of \(\mathsf {Greedy}\) due to the discrete optimization problem in each round of the algorithm. The (probably weak) upper bound that we achieve for the approximation ratio of \(\mathsf {Greedy}\) on unary inputs is \(\mathcal {O}(n^{1/4}/\log n)\). We derive this bound by estimating the size of the SLP obtained by \(\mathsf {Greedy}\) after three rounds, which already indicates space for improvement. For the original \(\mathsf {Greedy}\) algorithm where only the start rule is compressed, it is a direct consequence of our analysis that the approximation ratio on unary input strings is \(\varTheta (\sqrt{n}/\log n)\). On the positive side, we provide a new lower bound of 1.348... for the approximation ratio of \(\mathsf {Greedy}\) (in the variant where all right-hand sides are considered) that improves the best known lower bound for inputs over arbitrary alphabets. The key to achieve the new bound is the sequence \(y_k=y_{k-1}^2+1\) with \(y_0=2\), which has been studied in [1] (among other sequences), where it is shown that \(y_k= \lfloor \gamma ^{2^k}\rfloor \) for \(\gamma =2.258...\). In order to prove the lower bound, we show that the SLP produced by \(\mathsf {Greedy}\) on input \(a^{y_k}\) has size \(3\cdot 2^k-1\), while a smallest SLP for \(a^{y_k}\) has size \(3\cdot \log _3(\gamma )\cdot 2^k+o(2^k)\) (this follows from a construction used to prove the lower bound for \(\mathsf {Greedy}\) in [8]).

Related Work. One of the first appearances of straight-line programs in the literature are [6, 9], where they are called word chains (since they generalize addition chains from numbers to words). In [6], Berstel and Brlek prove that the function \(g(k,n) = \max \{ g(w) \mid w \in \{1,\ldots ,k\}^n \}\) is in \(\varTheta (n/\log _k n)\). Recall that g(w) is the size of a smallest SLP for the word w and thus g(kn) measures the worst case SLP-compression over all words of length n over a k-letter alphabet.

The smallest grammar problem is the problem of computing a smallest SLP for a given input word. It is known from [8, 20] that in general no grammar-based compressor can solve the smallest grammar problem in polynomial time unless \(\mathsf {P} = \mathsf {NP}\). Even worse, unless \(\mathsf {P} = \mathsf {NP}\) one cannot compute in polynomial time for a given word w an SLP of size at most \(\frac{8569}{8568} \cdot g(w)\) [8]. One should mention that the constructions to prove those hardness results use alphabets of unbounded size. While in [8] it is remarked that the construction in [20] works for words over a ternary alphabet, Casel et al. [7] argue that this is not clear at all and provide a construction for fixed alphabets of size at least 24. However, for grammar-based compression on unary strings as it is studied in this work (as well as for the problem of computing a smallest addition chain), there is no \(\mathsf {NP}\)-hardness result, so there might be an optimal polynomial-time algorithm even though it is widely believed that there is none.

Other notable systematic investigations of grammar-based compression are provided in [11, 19]. Whereas in [11], grammar-based compressors are used for universal lossless compression (in the information-theoretical sense), it is shown in [19] that the size of so-called irreducible SLPs (that include SLPs produced by global algorithms) can be upper bounded by the (unnormalized) k-th order empirical entropy of the produced string plus some lower order terms.

2 Preliminaries

For \(i,j\in \mathbb {N}\), let \([i,j] = \{i,i+1,\dots ,j\}\) for \(i\le j\) and \([i,j]=\emptyset \) otherwise. For integers mn, we denote by \(m\;\mathrm {div}\;n\) the integer division of m and n. We denote by \(m\bmod n\) the modulo of m and n, i.e., \(m\bmod n\in [0,n-1]\) and

$$m=(m\;\mathrm {div}\;n)\cdot n+(m\bmod n).$$

If m/n or \(\frac{m}{n}\) is used, then this refers to the standard division over \(\mathbb R\). Note that \(m\;\mathrm {div}\;n=\lfloor m/n \rfloor \) and \((m\;\mathrm {div}\;n)+(m\bmod n)\ge m/n\).

For an alphabet \(\varSigma \), let \(w=a_1\cdots a_n\) (\(a_1,\dots ,a_n\in \varSigma \)) be a word or string over \(\varSigma \). The length |w| of w is n and we denote by \(\varepsilon \) the word of length 0. A unary word is a word of the form \(a^n\) for \(a\in \varSigma \). Let \(\varSigma ^+=\varSigma ^*\setminus \{\varepsilon \}\) be the set of all nonempty words. For \(w \in \varSigma ^+\), we call \(v\in \varSigma ^+\) a factor of w if there exist \(x,y\in \varSigma ^*\) such that \(w=xvy\).

2.1 Straight-Line Programs

A straight-line program, briefly SLP, is a context-free grammar that produces a single word \(w\in \varSigma ^+\). Formally, it is a tuple \(\mathbb A = (N,\varSigma , P, S)\), where N is a finite set of nonterminals with \(N\cap \varSigma = \emptyset \), \(S \in N\) is the start nonterminal, and P is a finite set of productions (or rules) of the form \(A \rightarrow w\) for \(A \in N\), \(w \in (N \cup \varSigma )^+\) such that:

  • For every \(A \in N\), there exists exactly one production of the form \(A \rightarrow w\), and

  • the binary relation \(\{ (A, B) \in N \times N \mid (A \rightarrow w) \in P,\;B \text { occurs in } w \}\) is acyclic.

Every nonterminal \(A \in N\) produces a unique, nonempty word. The word defined by the SLP \(\mathbb A\) is the word produced by the start nonterminal S. The size of the SLP \(\mathbb A\) is \(|\mathbb A| = \sum _{(A \rightarrow w) \in P} |w|\). We denote by g(w) the size of a smallest SLP producing the word \(w\in \varSigma ^+\). We will use the following inequalities that can be found in [8]:

Lemma 1

([8]). For all unary words w of length n, we have

$$3\log _3(n)-3\le g(w)\le 3\log _3(n)+o(\log n).$$

Note that the first inequality also holds when w is a word over an arbitrary alphabet. The proof of the first inequality can be found in Lemma 1 of [8] and the second inequality is shown in the proof of Theorem 11 of [8].

Approximation Ratio. A grammar-based compressor \(\mathcal C\) is an algorithm that computes for a nonempty word w an SLP \(\mathcal C(w)\) that produces the word w. The approximation ratio \(\alpha _{\mathcal C}(w)\) of \(\mathcal C\) for an input w is defined as \(|\mathcal C(w)|/g(w)\). The worst-case approximation ratio \(\alpha _{\mathcal C}(k,n)\) of \(\mathcal C\) is the maximal approximation ratio over all words of length n over an alphabet of size k:

$$\begin{aligned} \alpha _{\mathcal C}(k,n)=\max \{ \alpha _{\mathcal C}(w) \mid w \in [1,k]^n \} = \max \{ |\mathcal C(w)|/g(w) \mid w \in [1,k]^n \} \end{aligned}$$

In this paper we are mainly interested in the case \(k=1\), i.e., we are interested in grammar-based compression on unary words.

3 Global Algorithms

For a given SLP \(\mathbb A = (N,\varSigma , P, S)\), a word \(\gamma \in (N \cup \varSigma )^+\) is called a maximal string of \(\mathbb A\) if

  • \(|\gamma |\ge 2\),

  • \(\gamma \) appears at least twice without overlap as a factor of the right-hand sides of \(\mathbb A\),

  • and no strictly longer word appears at least as many times as a factor of the right-hand sides of \(\mathbb A\) without overlap.

A global grammar-based compressor starts on input w with the SLP \(\mathbb A_0=(\{S\},\varSigma , \{S\rightarrow w\}, S)\). In each round \(i\ge 1\), the algorithm selects a maximal string \(\gamma \) of \(\mathbb A_{i-1}\) and updates \(\mathbb A_{i-1}\) to \(\mathbb A_{i}\) by replacing a largest set of pairwise non-overlapping occurrences of \(\gamma \) in \(\mathbb A_{i-1}\) by a fresh nonterminal X. Additionally, the algorithm introduces the rule \(X\rightarrow \gamma \) in \(\mathbb A_i\). The algorithm stops when no maximal string occurs. Note that the replacement is not unique, e.g. the word \(a^5\) has a unique maximal string \(\gamma =aa\), which yields SLPs with rules \(S\rightarrow XXa, X\rightarrow aa\) or \(S\rightarrow XaX, X\rightarrow aa\) or \(S\rightarrow aXX, X\rightarrow aa\). We assume the first variant in this paper, i.e., maximal strings are replaced from left to right.

3.1 \(\mathsf {Greedy}\)

The global grammar-based compressor \(\mathsf {Greedy}\) selects in each round \(i\ge 1\) a maximal string of \(\mathbb A_{i-1}\) such that \(\mathbb A_{i}\) has minimal size among all possible choices of maximal strings of \(\mathbb A_{i-1}\).

We start with the main result of this paper, which is a new lower bound for the approximation ratio of \(\mathsf {Greedy}\). The best known lower bound [8] so far is

$$\alpha _\mathsf {Greedy}(k,n)\ge \frac{5}{3\log _3(5)}=1.13767699...$$

for all \(k\ge 1\) and infinitely many n. This bound is achieved using unary input strings. A key concept to prove a better lower bound is the sequence \(x_n\) described in the following lemma by [1]:

Lemma 2

([1, Example 2.2]). Let \(x_{n+1}=x_{n}^2 +1\) with \(x_0=1\) and

$$\beta =\exp \left( \sum _{i= 1}^{\infty }\frac{1}{2^i}\log \left( 1+\frac{1}{x_i^2}\right) \right) .$$

We have \(x_n=\left\lfloor \beta ^{2^n}\right\rfloor \).

In this work, we use the shifted sequence \(y_n=x_{n+1}\), i.e., we start with \(y_0=2\). It follows that \(y_n=\left\lfloor \gamma ^{2^n}\right\rfloor \), where \(\gamma =\beta ^2= 2.25851845...\). Additionally, we need the following lemma:

Lemma 3

Let \(m\ge 1\) be an integer. Let \(f_m:{\mathbb R}_{>0}\rightarrow \mathbb R\) with

$$f_m(x)=x+\frac{m^2+1}{x}.$$

We have \(f_m(x)>2m\) for all \(x>0\).

Proof

The unique minimum of \(f_m(x)\) is \(2\sqrt{m^2+1}\) for \(x= \sqrt{m^2+1}\). It follows that \(f_m(x)\ge 2\sqrt{m^2+1}>2\sqrt{m^2}=2m\).

Now we are able to prove the new lower bound for \(\mathsf {Greedy}\):

Theorem 1

For all \(k\ge 1\) and infinitely many n, we have

$$\alpha _\mathsf {Greedy}(k,n)\ge \frac{1}{\log _3(\gamma )}=1.34847194...\;\;.$$

Proof

Let \(\varSigma =\{a\}\) be a unary alphabet. We define \(w_k=a^{y_k}\). By Lemma 2, we have \(|w_k|\le \gamma ^{2^k}\). Applying Lemma 1 yields

$$g(w_k)\le 3\cdot \log _3(\gamma )\cdot 2^k+o(2^k).$$

In the remaining proof we show that on input \(w_k\), \(\mathsf {Greedy}\) produces an SLP of size \(3\cdot 2^k-1\), which directly implies \(\alpha _\mathsf {Greedy}(1,n)\ge 3/(3\log _3(\gamma ))\). We start with the SLP \(\mathbb A_0\) which has the single rule \(S\rightarrow a^{y_k}\). Consider now the first round of the algorithm, i.e., we need to find a maximal string \(a^x\) of \(\mathbb A_0\) such that the grammar \(\mathbb A_1\) with rules

$$X_1\rightarrow a^x,\;\;S\rightarrow X_1^{y_k\;\mathrm {div}\;x}a^{y_k\bmod \, x}$$

has minimal size. We have \(|\mathbb A_1|=x+(y_k\;\mathrm {div}\;x)+(y_k\bmod x)\ge x+y_k/x\). By the definition of \(y_k\) we have \(|\mathbb A_1|\ge x+ (y_{k-1}^2+1)/x\). Applying Lemma 3 yields \(|\mathbb A_1|\ge 2y_{k-1}+1\). Note that for \(x=y_{k-1}\) this minimum is achieved, i.e., we can assume that \(\mathsf {Greedy}\) selects the maximal string \(a^{y_{k-1}}\) and \(\mathbb A_1\) is

$$X_1\rightarrow a^{y_{k-1}},\;S\rightarrow X_1^{y_{k-1}}a.$$
Fig. 1.
figure 1

Three rounds of \(\mathsf {Greedy}\) on input \(a^{y_k}\).

Each maximal string of \(\mathbb A_1\) is either a unary word over X or a unary word over a, i.e., we can analyze the behavior of \(\mathsf {Greedy}\) on both rules independently. The rule \(X_1\rightarrow a^{y_{k-1}}\) is obviously treated similarly as the initial SLP \(\mathbb A_0\), so we continue with analyzing \(S\rightarrow X_1^{y_{k-1}}a\). But again, the same arguments as above show that \(\mathsf {Greedy}\) introduces a rule \(X_3\rightarrow X_1^{y_{k-2}}\) which yields \(S\rightarrow X_3^{y_{k-2}}X_1a\) as the new start rule. This process can be iterated using the same arguments for the leading unary strings of length \(y_i\) for some \(i\in [1,k]\).

The reader might think of this process as a binary tree, where each node is labelled with a rule (the root is labelled with \(S\rightarrow a^{y_k}\)) and the children of a node are the two rules obtained by \(\mathsf {Greedy}\) when the rule has been processed. We assume that the left child represents the rule for the chosen maximal string and the right child represents the parent rule where all occurrences of the maximal string are replaced by the fresh nonterminal. In Fig. 1 this binary tree is depicted for the steps we discussed above. Note that when a rule is processed, the longest common factor of the two new rules has length 1 (the remainder). More generally, after each round there is no word of length at least two that occurs as a factor in two different rules, since a possibly shared remainder has length 1 and otherwise only fresh nonterminals are introduced. It follows that we can iterate this process independently for each rule until no maximal string occurs. This is the case when each rule starts with a unary string of length \(y_0=2\) or, in terms of the interpretation as a binary tree, when a full binary tree of height k is produced. Each right branch occurring in this tree adds a new remainder to those remainders that already occur in the parent rule and a left branch introduces a new (smaller) instance of the start problem. We show by induction that on level \(i\in [0,k]\) of this full binary tree of height k, there is one rule of size \(y_{k-i}+i\) and \(2^{i-j-1}\) many rules of size \(y_{k-i}+j\) for \(j\in [0,i-1]\). On level 0, this is true since there is only a single rule of size \(y_k+0\). Assuming that our claim is true on level \(i<k\), we derive from each rule on level i two new rules on level \(i+1\): A right branch yields a rule that starts with a leading unary string of size \(y_{k-i-1}\) and adds a new remainder to the parent rule. A left branch yields a rule that contains only a unary string of size \(y_{k-i-1}\). If we first consider the left branches, we derive that each of the \(2^i\) many rules on level i adds a rule of size \(y_{k-i-1}\) on level \(i+1\). For the right branches, the single rule of size \(y_{k-i}+i\) on level i yields a rule of size \(y_{k-i-1}+i+1\) on level \(i+1\). Further, each of the \(2^{i-j-1}\) many rules of size \(y_{k-i}+j\) (\(j\in [0,i-1]\)) yields a rule of size \(y_{k-i-1}+j+1\). When we put everything together, we get that on level \(i+1\) there is a single rule of size \(y_{k-i-1}+i+1\) and \(2^{i-j}\) many rules of size \(y_{k-i-1}+j\) for \(j\in [0,i]\). That finishes the induction. It follows that the final SLP (which consists of the rules on level k) has a single rule of size \(y_0+k=2+k\) and \(2^{k-j-1}\) many rules of size \(2+j\) for \(j=0,\dots ,k-1\). This gives a total size of

$$\begin{aligned} 2+k+\sum _{j=0}^{k-1}2^{k-j-1}(2+j)&= 2+k+2^k\sum _{j=0}^{k-1}2^{-j}+2^k\sum _{j=0}^{k-1}2^{-j-1}j \\&= 2+k+2^k(2-2^{-k+1})+2^k(-2^{-k}k-2^{-k}+1)\\&= 2+k+2^{k+1}-2-k-1+2^k\\&= 2^{k+1}+2^k-1 \\&= 3\cdot 2^k -1. \end{aligned}$$

In the remaining part of this section, we prove an upper bound on the size of the SLP produced by \(\mathsf {Greedy}\) on input \(a^n\):

Theorem 2

The SLP produced by \(\mathsf {Greedy}\) (after three rounds) on input \(a^n\) has size \(\mathcal {O}(n^{1/4})\).

Proof

Consider an input \(a^n\) with \(n\ge 4\) (otherwise \(S\rightarrow a^n\) is the final SLP since there is no maximal string). The SLP \(\mathbb A_1\) obtained by \(\mathsf {Greedy}\) after the first round has the form

$$\begin{aligned} X\rightarrow a^x,\;\;S\rightarrow X^{n\;\mathrm {div}\;x}a^{n\,\bmod \, x}, \end{aligned}$$
(1)

where \(a^x\) is the selected maximal string. We first show

$$\frac{1}{3}\sqrt{n}\le x \le 3\sqrt{n},\;\;\frac{1}{3}\sqrt{n}\le n\;\mathrm {div}\;x \le 3\sqrt{n},\;\;n\bmod x<3\sqrt{n}.$$

Assume \(x=\left\lceil {\sqrt{n}}\right\rceil \) in Eq. (1). In this case, the size of the SLP is

$$\left\lceil {\sqrt{n}}\right\rceil +\left\lfloor \frac{n}{\left\lceil {\sqrt{n}}\right\rceil }\right\rfloor +n\bmod \left\lceil {\sqrt{n}}\right\rceil \le 3\sqrt{n}+1.$$

Since the maximal string \(a^x\) is selected greedily such that \(\mathbb A_1\) has minimal size, we have \(|\mathbb A_1|\le 3\sqrt{n}+1\). It follows that \(x\le 3\sqrt{n}\), because otherwise the size of \(\mathbb A_1\) would be at least \(3\sqrt{n}+2\) due to the fact that a maximal string (represented by the nonterminal X) occurs at least twice. It follows that \(n\bmod x<3\sqrt{n}\) and \(n\;\mathrm {div}\;x\ge 1/3 \sqrt{n}\). Further, we have \(n\;\mathrm {div}\;x\le 3 \sqrt{n}\) because otherwise the size of \(\mathbb A_1\) would be at least \(3\sqrt{n}+2\) due to the fact that \(x\ge 2\) (a maximal string has length at least two). It follows that \(x\ge 1/3\sqrt{n}\). Actually, a slightly more careful analysis allows sharper bounds for x, \(n\;\mathrm {div}\;x\) and \(n\bmod x\), but for the matter of this proof it is easier to work with the constants 3 and 1/3.

Now the only maximal strings occurring in \(\mathbb A_1\) are of the form \(X^y\) or \(a^z\) (for integers \(y,z \ge 2\)) since no other factor of the right-hand sides of \(\mathbb A_1\) occurs at least twice. Note that both optimization problems (for \(X^y\) and \(a^z\)) are independent, so we assume the chosen maximal string in the second round has the form \(X^z\), afterwards we proceed with \(a^z\). Let \(d=n\;\mathrm {div}\;x\), where \(a^x\) is again the maximal string that has been selected in the first round. Then the SLP \(\mathbb A_2\) obtained after the second round of \(\mathsf {Greedy}\) has the form

$$\begin{aligned} X\rightarrow a^x,\;\;Y\rightarrow X^y,\;\;S\rightarrow Y^{d\;\mathrm {div}\;y}X^{d\,\bmod \, y}a^{n\,\bmod \, x}. \end{aligned}$$
(2)

Let \(g(x)=x+(n\bmod x)\), which is the size of those parts of \(\mathbb A_2\) that are independent of the choice of the maximal string \(X^y\). Assume \(y=\lceil n^{1/4}\rceil \) in (2) and let n be large enough such that Y occurs at least twice in the start rule. This yields an SLP of size

$$\lceil n^\frac{1}{4}\rceil +\left\lfloor \frac{d}{\lceil n^\frac{1}{4}\rceil }\right\rfloor +d\bmod \lceil n^\frac{1}{4}\rceil +g(x)\le 5n^\frac{1}{4}+1+g(x).$$

The inequality is achieved by using \(d=n\;\mathrm {div}\;x\le 3\sqrt{n}\) as argued above. It follows again from the greedy nature of the algorithm that \(|\mathbb A_2|\le 5n^{1/4}+1+g(x)\) and similar arguments as above show that the exponents y, \(d\;\mathrm {div}\;y\) and \(d\bmod y\) can be upper bounded by \(5n^{1/4}\).

All maximal strings occurring in \(\mathbb A_2\) are again unary words, but since \(a^x\) has length at least \(1/3\sqrt{n}\) and the lengths of all other unary factors over X or Y are bounded by \(5n^{1/4}\), we can assume that (for n large enough) \(\mathsf {Greedy}\) selects \(a^z\) for some integer \(z\ge 2\) as the maximal string in order to achieve a minimal size SLP \(\mathbb A_3\). Note that if we would have assumed that the chosen maximal string in round two is \(a^z\) instead of \(X^y\), then similar arguments would show that \(X^y\) is selected in round three if n is large enough. Now, let \(e=n\bmod x\), then the SLP \(\mathbb A_3\) obtained after the third round has the form

$$\begin{aligned} \begin{aligned}&Z\rightarrow a^z,\;\;X\rightarrow Z^{x\;\mathrm {div}\;z}a^{x\,\bmod \, z},\;\;Y\rightarrow X^y,\\&S\rightarrow Y^{d\;\mathrm {div}\;y}X^{d\,\bmod \, y}Z^{e\;\mathrm {div}\;z}a^{e\,\bmod \, z}. \end{aligned} \end{aligned}$$
(3)

Let \(h(y)=y+(d\;\mathrm {div}\;y)+(d\bmod y)\), which is the size of those parts of \(\mathbb A_3\) that are independent of the choice of the maximal string \(a^z\). Assume now \(z=\lceil n^{1/4}\rceil \) and let n be large enough such that Z occurs at least twice in the right-hand sides of (3). The obtained SLP has size

$$\lceil n^{1/4}\rceil + \left\lfloor \frac{x}{\lceil n^{1/4}\rceil }\right\rfloor +(x\bmod \lceil n^{1/4}\rceil )+ \left\lfloor \frac{e}{\lceil n^{1/4}\rceil }\right\rfloor +(e\bmod \lceil n^{1/4}\rceil )+h(y).$$

Using \(x\le 3\sqrt{n}\) and \(e=n\bmod x<3\sqrt{n}\) we can upper bound this size by \(9n^{1/4}+1+h(y)\). Note that we have already bounded the size of h(y) by \(5n^{1/4}+1\) in the previous step, so the SLP \(\mathbb A_3\) obtained by \(\mathsf {Greedy}\) after three rounds has size at most \(14n^{1/4}+2\).

The bound on the approximation ratio is now achieved using Lemma 1, which shows that a smallest grammar for \(a^n\) has size \(\varOmega (\log n)\).

Corollary 1

We have \(\alpha _\mathsf {Greedy}(1,n)\in \mathcal {O}(n^{1/4}/\log n)\).

The reader might wonder why our estimation stops after three rounds of \(\mathsf {Greedy}\). The most important reason is that a precise invariant is missing in order to iterate our arguments for a non constant number of rounds. On the other hand, it seems likely that similar arguments as provided in the proof of Theorem 2 can be used to show that the SLP produced by \(\mathsf {Greedy}\) after some more rounds has size \(\mathcal {O}(n^{1/8})\) and maybe again after some rounds has size \(\mathcal {O}(n^{1/16})\). However, further analysis would require more and more case distinctions since it is not clear anymore that the selected maximal string is always unary as the reader can see in (3), where factors of the form \(Z^*a^*\) can occur more than once on the right-hand sides. It seems therefore necessary to apply some new information in order to improve the upper bound beyond \(\mathcal {O}(n^{1/2^k})\) for some fixed k.

An interesting consequence of the proof of Proposition 2 applies to the originally proposed \(\mathsf {Greedy}\) variant [2,3,4]. Recall from the introduction that in this setting, the algorithm recursively chooses the maximal string only in dependence on the right-hand side of the start rule and replaces the occurrences of the chosen string only there. In other words, the right-hand side of the start rule is compressed in a greedy way and all other rules do not change after they are introduced. Note that in the first round both variants of \(\mathsf {Greedy}\) (the one studied here and the original one) are identical, because in this case the only rule of the SLP is the start rule \(S\rightarrow a^n\). Hence, our analysis of the first step in the proof of Proposition 2 applies to the original variant as well. We have shown that the selected maximal string in the first round (and thus the right-hand side of the introduced rule) has length \(\varTheta (\sqrt{n})\) and since the original variant does not modify the corresponding rule any further, it follows directly that the SLP produced by the original algorithm has size \(\varOmega (\sqrt{n})\). But since the modified start rule has also size \(\varTheta (\sqrt{n})\) after the first step, it follows that the SLP produced by the original algorithm has size \(\mathcal {O}(\sqrt{n})\) as well (the size of the SLP does not increase later). Together with Lemma 1, it follows that this variant of the \(\mathsf {Greedy}\) algorithm has approximation ratio \(\varTheta (\sqrt{n}/\log n)\) on unary inputs of length n.

3.2 \(\mathsf {RePair}\) and \(\mathsf {LongestMatch}\)

In this section we analyze the global grammar-based compressors \(\mathsf {RePair}\) and \(\mathsf {LongestMatch}\). In each round i, \(\mathsf {RePair}\) selects a most frequent maximal string of \(\mathbb A_{i-1}\) and \(\mathsf {LongestMatch}\) selects a longest maximal string of \(\mathbb A_{i-1}\).

We will abbreviate the approximation ratio \(\alpha _\mathsf {LongestMatch}\) by \(\alpha _\mathsf {LM}\) for better readability. We will first show that \(\mathsf {RePair}\) and \(\mathsf {LongestMatch}\) produce SLPs of equal size for unary inputs \(a^n\) and we prove the exact size of those SLPs in dependency on n. In a second step, we use this information to obtain our result for \(\alpha _\mathsf {RePair}(1,n)\), respectively \(\alpha _\mathsf {LM}(1,n)\). Fix an integer \(n\ge 2\) and consider the binary representation

$$\begin{aligned} n=\sum _{i=0}^{\lfloor \log _2 n \rfloor } b_{i}\cdot 2^i \end{aligned}$$
(4)

of n, where \(b_i\in \{0,1\}\) for \(i\in [0,\lfloor \log _2 n \rfloor ]\). We denote by \(\nu (n)\) the number of 1’s in the binary representation of n, i.e.,

$$\nu (n)=\sum _{i=0}^{\lfloor \log _2 n \rfloor } b_{i}.$$

For example, we have \(11=1\cdot 2^3+0\cdot 2^2+1\cdot 2^1+1\cdot 2^0\) and thus \(b_0=b_1=b_3=1\), \(b_2=0\) and \(\nu (11)=3\).

Proposition 1

For \(n\ge 2\), let \(\mathbb A\) be the SLP produced by \(\mathsf {RePair}\) on input \(a^n\) and \(\mathbb B\) be the SLP produced by \(\mathsf {LongestMatch}\) on input \(a^n\). We have

$$|\mathbb A|=|\mathbb B|=2\lfloor \log _2 n \rfloor +\nu (n)-1.$$

Proof

If \(n=2\) or \(n=3\) (we only consider \(n\ge 2\)), then \(a^n\) has no maximal string and thus the final SLP of any global algorithm has a single rule \(S\rightarrow a^n\). The reader can easily verify the claimed result for those cases.

We assume \(n\ge 4\) in the following. Let \(m=\lfloor \log _2 n \rfloor -1\). We prove the claim for \(\mathsf {RePair}\) first, afterwards we proceed with \(\mathsf {LongestMatch}\). On input \(a^n\), \(\mathsf {RePair}\) runs for exactly m rounds and creates rules \(X_1\rightarrow aa\) and \(X_i\rightarrow X_{i-1}X_{i-1}\) for \(i\in [2,m]\), i.e., the nonterminal \(X_i\) produces the string \(a^{2^i}\). This rules have total size 2m. After this steps, the start rule is

$$S\rightarrow X_{m}X_{m}X_{m}^{b_{m}}X_{m-1}^{b_{m-1}}\cdots X_1^{b_1}a^{b_0},$$

where the \(b_i\)’s are the coefficients occurring in the binary representation of n, see Eq. (4). In other words, the symbol a only occurs in the start rule if the least significant bit \(b_0=1\), and the nonterminal \(X_i\) (\(i\in [1,m-1]\)) occurs in the start rule if and only if \(b_i=1\). Since \(\mathsf {RePair}\) only replaces words with at least two occurrences, the most significant bit \(b_{m+1}=1\) is represented by \(X_{m}X_{m}\). A third \(X_{m}\) occurs in the start rule if and only if \(b_m=1\). The size of the start rule is \(2+\sum _{i=0}^{m}b_i\). It follows that the total size of the SLP produced by \(\mathsf {RePair}\) on input \(a^n\) is \(2m+2+\sum _{i=0}^{m}b_i\), which together with \(m=\lfloor \log n\rfloor -1\) and \(b_{\lfloor \log n\rfloor }=1\) (the most significant bit is always 1) yields the claimed size.

Now we prove the same result for \(\mathsf {LongestMatch}\). In the first round, the chosen maximal string is \(a^{\lfloor n/2\rfloor }\), which yields rules \(X_1\rightarrow a^{\lfloor n/2\rfloor }\) and \(S\rightarrow X_1X_1a^{b_0}\), i.e., the symbol a occurs in the start rule if and only if n is odd and thus the least significant bit \(b_0=1\). Assuming \(n\ge 8\), this procedure is now repeated for the rule \(X_1\rightarrow a^{\lfloor n/2\rfloor }\) (for \(n<8\) there is no maximal string and the algorithm stops after the first round). This yields \(X_2 \rightarrow a^{\lfloor n/4\rfloor }\), \(X_1\rightarrow X_2X_2a^{b_{1}}\) and \(S\rightarrow X_1X_1a^{b_0}\) (note that \(\lfloor (\lfloor n/2\rfloor ) / 2\rfloor =\lfloor n/4\rfloor \)). After \(m=\lfloor \log n\rfloor -1\) steps, the iteration of that process results in the final SLP with rules \(S\rightarrow X_1X_1a^{b_0}\), \(X_i\rightarrow X_{i+1}X_{i+1}a^{b_i}\) for \(i\in [1,m-1]\) and \(X_m\rightarrow aaa^{b_m}\). The size of this SLP is \(2\cdot (m+1)+\sum _{i=0}^{m}b_i\), which directly implies the claimed result for \(\mathsf {LongestMatch}\).

Using Proposition 1, we prove the matching bounds for \(\alpha _\mathsf {RePair}(1,n)\), \(\alpha _\mathsf {LM}(1,n)\):

Theorem 3

For all n, we have \(\alpha _\mathsf {RePair}(1,n)=\alpha _\mathsf {LM}(1,n)\le \log _2(3)\).

Proof

As a consequence of Proposition 1, \(\mathsf {RePair}\) and \(\mathsf {LongestMatch}\) produce on input \(a^n\) SLPs of size at most \(3\log _2 n\), since \(\nu (n)-1\le \log _2 n\). By Lemma 1, we have \(g(a^n)\ge 3\log _3 n -3\). The equality \(\log _2 n/\log _3 n = \log _2(3)\) finishes the proof.

Theorem 4

For infinitely many n, we have \(\alpha _\mathsf {RePair}(1,n)=\alpha _\mathsf {LM}(1,n)\ge \log _2(3)\).

Proof

Let \(w_k=a^{2^k-1}\). We have \(2^k-1=\sum _{i=0}^{k-1}2^i\) and thus \(\nu (2^k-1)=k\). By Proposition 1, the size of the SLPs produced by \(\mathsf {RePair}\) and \(\mathsf {LongestMatch}\) is \(3k-3\). By Lemma 1, we have

$$g(w_k)\le 3\log _3(2^k-1)+o(\log (2^k-1)\le 3\log _3(2)\cdot k+ o(k).$$

The equality \(1/\log _3(2)=\log _2(3)\) finishes the proof.

4 Future Work

The obvious question concerns the gap between the lower and upper bound for \(\mathsf {Greedy}\). First of all, it might be possible to improve our lower bound by finding a similar sequence such that \(\mathsf {Greedy}\) produces larger remainders in each round, but care has to be taken since for larger remainders it is not true anymore that the rules can be analyzed independently because the rules could share factors of length greater 1. Concerning the upper bound, we conjecture that \(\mathsf {Greedy}\) achieves logarithmic compression for all unary inputs and thus the approximation ratio is constant, but the direct analysis of the algorithm as we tried in Theorem 2 misses a clear invariant for a non constant number of rounds. For arbitrary alphabets, a non-constant lower bound for \(\mathsf {Greedy}\) as well as an improvement of the upper bound of \(\mathcal {O}((n/\log n)^{2/3})\) for any global algorithm seems to be natural starting points for future work.