1 Introduction

Dictionary compression has proved to be an effective tool to exploit the repetitiveness that most of the fastest-growing datasets feature [24]. Lempel-Ziv (LZ77 for short) [23, 33] stands out as the most popular and effective compression method for repetitive texts. Further, it can be run in linear time and even in external memory [18]. LZ77 has the important drawback, however, that accessing random positions of the compressed text requires, essentially, to decompress it from the beginning. Therefore, it is not suitable to be used as a compressed data structure that represents the text in little space while simulating direct access to it. Grammar compression [19] is an alternative that offers better guarantees in this sense. The aim is to build a small context-free grammar (or Straight-Line Program, SLP) that generates (only) the text. The smallest SLP generating a text is always larger than its LZ77 parse, but only by a logarithmic factor that is rarely reached in practice. With an SLP we can access any text substring with only an additive logarithmic time penalty [3, 5], which has led to the development of various self-indexes building on SLPs [4, 9, 12, 13, 15, 26]. Many other richer queries on sequences have also been supported by associating summary information with the nonterminals of the SLP [1, 2, 5, 7, 10, 11]. There are applications in which SLPs are preferable to LZ77 for other reasons, as well; see, e.g., [22, 25].

Although finding the smallest SLP for a text is NP-complete [8, 28], there are several grammar construction algorithms that guarantee at most a logarithmic blowup on the LZ77 parse [8, 16, 17, 28, 29]. In practice, however, they are sharply outperformed by RePair [21], a heuristic that runs in linear time and obtains grammars of size very close to that of the LZ77 parse in most cases. This has made RePair the compressor of choice to build grammar-based compressed data structures [1, 7, 10, 11]. A serious problem with RePair, however, is that, despite running in linear time and space, in practice the constant of proportinality is high and it can be built only on inputs that are about one tenth of the available memory. This significantly hampers its applicability on large datasets.

In this paper we introduce a scalable SLP compression algorithm that uses space very close to that of RePair and can be applied on very large inputs. We prove a constant-approximation factor with respect to any SLP construction algorithm to which our technique is applied. Our experimental results show that we can compress a very repetitive 50 GB text in less than an hour, using less than 650MB of RAM and obtaining very competitive compression ratios.

2 Preliminaries

For the sake of brevity, we assume the reader is familiar with SLPs, LZ77, and the links between the two. To prove theoretical bounds for our approach, we consider a variant of LZ77 in which if S[i..j] is a phrase then either \(i = j\) and S[i] is the first occurrence of a distinct character, or S[i..j] occurs in \(S [1..j - 1]\) and \(S [i..j + 1]\) does not occur in S[1..j]. We refer to this variant as LZSS due to its similarity to Storer and Szymanski’s version of LZ77 [30], even though they allow substrings to be stored as raw text and we do not.

The best-known algorithm for building SLPs is probably RePair [21], for which there are many implementations (see [14] and references therein). It works by repeatedly finding the most common pair of symbols and replacing them with a new non-terminal. Although it is not known to have a good worst-case approximation ratio with respect to the size of LZ77 parsing, in practice it outperforms other constructions. RePair uses linear time and space but the coefficient in the space bound is quite large and so the standard implementations are practical only on small inputs. A more recent and more space economical alternative to RePair is SOLCA [31] that we will consider in Sect. 5.

Context-triggered piecewise hashing (CTPH) is a technique for parsing strings into blocks such that long repeated substrings are parsed the same way (except possibly at the beginning or end of the substrings). The name CTPH seems to be due to Kornblum [20] but the ideas go back to Tridgell’s Rsync [32] and Spamsum (https://www.samba.org/ftp/unpacked/junkcode/spamsum/README): “The core of the spamsum algorithm is a rolling hash similar to the rolling hash used in ‘rsync’. The rolling hash is used to produce a series of ’reset points’ in the plaintext that depend only on the immediate context (with a default context width of seven characters) and not on the earlier or later parts of the plaintext.”

Specifically, in this paper we choose a rolling hash function and a threshold p, run a sliding window of fixed size w over S and end the current block whenever the window contains a triggering substring, which is a substring of length w whose hash is congruent to 0 modulo p. When we end a block, we shift the window ahead w characters so all the blocks are disjoint and form a parse, which we call the Rsync parse. We call the set of distinct blocks the Rsync dictionary: if the input text contains many repetitions, we expect the dictionary to be much smaller than the text.

3 Algorithms

Given a string S, we can use Rsync parsing to help build an SLP for S with Algorithm 1 (“Rpair”). The final SLP can be viewed as first generating the parse, then replacing each block ID in the parse by the sublist of non-terminals that generate each block, and finally replacing the sublists by the blocks themselves.

figure a

Since each separator character appears only once in D and its parse tree, any non-terminal whose expansion includes a separator character also appears only once and is deleted. Since the parse tree of an SLP is binary and each non-terminal we delete appears only once, the number of distinct non-terminals we delete is at least the length of the list of non-terminals at the roots of the maximal remaining subtrees of the parse tree, minus one. Therefore, creating rules to generate the sublists does not cause the number of distinct non-terminals to grow to more than the number in the original SLP for D, plus one.

Algorithm 1 works with any algorithm for building SLPs for D and P. In Sect. 4 we show that, if we choose an algorithm that builds SLPs for D and P at most an \(\alpha \)-factor larger than their LZ77 parses, then we obtain an SLP an \(O (\alpha )\)-factor larger than the LZ77 parse of S. In the process we will refer to Algorithm 2 (“Rparse”), which produces an LZSS-like parse of S but is intended only to simplify our analysis of Algorithm 1 (not to compete with cutting-edge LZ-based compressors). By “LZSS-like” we mean a parse in which each phrase is either a single character that has not occurred before, or a copy of an earlier substring. We note in passing that, if the parse in Step 3 is still too big for a normal construction, then we can apply Algorithm 1 to it. We will show in the full version of this paper that, if we recurse only a constant number of times, then we worsen our compression bounds by only a constant factor.

figure b

4 Analysis

The main advantage of using Rsync parsing to preprocess S is that Rsync parsing is quite easy to parallelize, apply over streamed data, or apply in external memory. The resulting dictionary and parse may be significantly smaller than S, making it easier to apply grammar-based compression. In the full version of this paper we will analyze how much time and workspace Algorithms 1 and 2 use in terms of the total size of the dictionary and parse, but for now we are mainly concerned with the quality of the compression.

Let b be the number of distinct blocks in the Rsync parse of S, and let z be the number of phrases in the LZ77 parse of S. The first block is obviously the first occurrence of that substring and if S[i..j] is the first occurrence of another block, then \(S [i - w..j]\) (i.e., the block extended backward to include the previous triggering substring) is the first occurrence of that substring. Since the first occurrence of any non-empty substring overlaps or ends at a phrase boundary in the LZ77 parse, we can charge S[i..j] to such a boundary in \(S [i - w..j]\). Since blocks have length at least w and overlap by only w characters when extended backwards, each boundary has the first occurrences of at most two blocks charged to it, so \(b = O (z)\).

In Step 5 of Algorithm 2, we discard O(b) of the phrases in the LZSS parses of D and P when mapping to the phrases in the LZSS-like parse of S. Therefore, by showing that the number of phrases in the LZSS-like parse of S is O(z), we show that the total number of phrases in the LZSS parses of D and P is also \(O (z + b) = O (z)\), so the total number of phrases in their LZ77 parses is O(z) as well.

Lemma 1

If the t-th phrase in the LZSS parse of S is \(S [j..j + \ell - 1]\) then the 5t-th phrase resulting from Algorithm 2, if it exists, ends at or after \(S [j + \ell - 1]\).

Proof

Our claim is trivially true for \(t = 1\), since the first phrases in both parses are the single character S[1], so let t be greater than 1 and assume our claim is true for \(t - 1\), meaning the \(5 (t - 1)\)st phrase in our parse ends at \(S [k - 1]\) with \(k \ge j\). If \(k \ge j + \ell \) then our claim is also trivially true for t, so assume \(j \le k < j + \ell \). We must show that our parse divides \(S [k..j + \ell - 1]\) into at most five phrases, in order to prove our claim for t.

First suppose that \(S [k..j + \ell - 1]\) does not completely contain a triggering substring, so it overlaps at most two blocks. (It can overlap two blocks without containing a triggering substring if and only if a prefix of length less than w lies in one block and the rest lies in the next block.) Let \(S [i..i + \ell - 1]\) be \(S [j..j + \ell - 1]\)’s source and let \(k' = i + k - j\), so in the LZSS parse \(S [k..j + \ell - 1]\) is copied from \(S [k'..i + \ell - 1]\). Since \(S [k'..i + \ell - 1]\) does not completely contain a triggering substring either, it too overlaps at most two blocks.

Without loss of generality (since the other cases are easier), assume \(S [k..j + \ell - 1]\) and \(S [k'..i + \ell - 1]\) each overlap two blocks and they are split differently: \(S [k..k + d - 1]\) lies in one block and \(S [k + d..j + \ell - 1]\) lies in the next, and \(S [k'..k' + d' - 1]\) lies in one block and \(S [k' + d'..i + \ell - 1]\) in the next, with \(d \ne d'\). Assume also that \(d < d'\), since the other case is symmetric. Since \(S [k..k + d - 1]\) is completely contained in a block and occurs earlier completely contained in a block, as \(S [k'..k' + d - 1]\), our parse does not divide it. Similarly, since \(S [k + d..k + d' - 1]\) and \(S [k + d'..j + \ell - 1]\) are each completely contained in a block and occur earlier each completely contained in a block, as \(S [k' + d..k' + d' - 1]\) and \(S [k' + d'..i + \ell - 1]\), respectively, our parse does not divide them. Therefore, our parse divides \(S [k..j + \ell - 1]\) into at most three phrases.

Now suppose the first and last triggering substrings completely contained in \(S [k..j + \ell - 1]\) are \(S [x..x + w - 1]\) and \(S [y..y + w - 1]\) (possibly with \(x = y\)). By the arguments above, our parse divides \(S [k..x + w - 1]\) into at most three phrases. Since \(S [x + w..y + w - 1]\) is a sequence of complete blocks that have occurred earlier (in \(S [k'..i + \ell - 1]\)), our parse does not divide it unless \(S [k..x + w - 1]\) is a complete block that has occurred before as a complete block, in which case it may divide \(S [k..y + w - 1]\) once between \(S [x + w]\) and \(S [y + w - 1]\). Since \(S [y + w..j + \ell - 1]\) is completely contained in a block and occurs earlier completely contained in a block (in \(S [k'..i + \ell - 1]\)), our parse does not divide it. Therefore, our parse divides \(S [k..j + \ell - 1]\) into at most five phrases.    \(\square \)

We note that we can quite easily can reduce the five in Lemma 1, at the cost of complicating our algorithm slightly. We leave a detailed analysis for the full version of this paper.

Corollary 1

Algorithm 2 yields an LZSS-like parse of S with at most five times as many phrases as its LZSS parse.

Proof

If the LZSS parse has t phrases then the t-th phrase ends at S[n] so, by Lemma 1, Algorithm 2 yields a parse with at most 5t phrases.    \(\square \)

Theorem 1

Algorithm 2 yields an LZSS-like parse of S with O(z) phrases.

Proof

It is well known that the LZSS parse of S has at most twice as many phrases as the its LZ77 parse (since dividing each LZ77 phrase into a prefix with an earlier occurrence and a mismatch character yields an LZSS-like parse with at most twice as many phrases, and the LZSS parse has the fewest phrases of any LZSS-like parse). Therefore, by Corollary 1, Algorithm 2 yields a parse with at most O(z) phrases.    \(\square \)

Corollary 2

The LZ77 parses of D and P have O(z) phrases.

Proof

Immediate, from Theorem 1, the fact that the LZ77 parse is no larger than the LZSS parse, and inspection of Algorithm 1.    \(\square \)

Let A be any algorithm that builds an SLP at most an \(\alpha \)-factor larger than the LZ77 parse of its input. For example, with Rytter’s construction [28] we have \(\alpha = O (\log (|S| / z))\).

By Corollary 2, applying A to D—Step 2b in Algorithm 1—yields an SLP for D with \(O (\alpha z)\) rules. As explained in Sect. 3, Steps 2c to 2g then increase the number of rules by at most one while modifying the SLP such that, for each block in the dictionary, there is a non-terminal whose expansion is that block.

Similarly, applying A to P—Step 3—yields an SLP for P with O(z) rules. Replacing the terminals in the SLP by the non-terminals generating the blocks and then combining the two SLPs—Steps 4 and 5—yields an SLP for S with \(O (\alpha z)\) rules. This gives us our main result of this section:

Theorem 2

Using A in Steps 2b and 3 of Algorithm 1 yields an SLP for S with \(O (\alpha z)\) rules.

5 Experiments

We use two genome collections in our experiments: cN consists of N concatenated variants of the human chromosome chr19, of about 59 MB each; sN consists of N concatenated variants of salmonella genomes, of widely different sizes.

The chr19 collection was downloaded from the 1000 Genomes Project. Each chr19 sequence was derived by using the bcftools consensus tool to combine the haplotype-specific (maternal or paternal) variant calls for an individual with the chr19 sequence in the GRCH37 human reference. The salmonella genomes were downloaded from NCBI (BioProject PRJNA183844) and preprocessed by assembling each individual sample with IDBA-UD [27] setting kMaxShortSequence to 1024 per public advice from the author to accommodate the longer paired end reads that modern sequencers produce. More details of the collections are available in previous work [6, Sec. 4].

We compare two grammar compressors: RePair [21] produces the best known compression ratios but uses a lot of main memory space, whereas SOLCA [31] aims at optimizing main memory usage. Their versions combined with parallelized CTPH parsing are BigRepair and BigSOLCA. RePair could be run only on the smaller collections. Our experiments ran on a Intel(R) I7-4770 @ 3.40 GHz machine with 32 GB memory using 8 threads; currently only the CTPH parsing takes advantage of the multiple threads.

For RePair we use Navarro’s implementation for large files, at http://www.dcc.uchile.cl/gnavarro/software/repair.tgz, letting it use 10 GB of main memory, whereas the implementation of SOLCA is at https://github.com/tkbtkysms/solca. To measure their compression ratios in a uniform way, we consider the following encodings of their output: if RePair produces r (binary) rules and an initial rule of length c, we account 2r bits to encode the topology of the pruned parse tree (where the nonterminal ids become the preorder of their internal node in this tree) and \((r+c) \lceil \log _2 r \rceil \) bits to encode the leaves of the tree and the initial rule. SOLCA is similar, with \(c=1\). Our code is available at https://gitlab.com/manzai/bigrepair.

Table 1. Performance of the compressors. File sizes are expressed in GB, compression ratios in percentage of compressed file over uncompressed file, compression times in seconds per input GB, and compression main memory usage in MBs per input GB.

Table 1 shows the results in terms of compression ratio, time, and space in RAM. On the more repetitive chr19 genomes, BigRePair is clearly the best choice for large files. It loses to RePair in compression ratio, but RePair takes 11 h just to process 5.5 GB, so it is not a choice for larger files. Instead, BigRepair processes 55 GB in about 20 min and 6.5 GB. Similarly, SOLCA obtains better compression but more compression time than BigSOLCA, though the latter uses more space. The comparison between the two compressors shows that BigRepair performs better than both SOLCA and BigSOLCA in both compression ratio (reaching nearly half the compressed size of SOLCA on the largest files) and time (half the time of BigSOLCA). Still SOLCA uses much less space: it compresses 55 GB in 3.6 h, but using less than 750 MB.

The results start similarly on the less compressible salmonella collection, but, as the size of the input grows, there are significant differences. The time of BigRePair on chr19 was stable around 2GBs per minute, but on salmonella it is not: When moving from 10 GB to 20 GB of input data, the time per processed GB of BigRePair jumps by a factor of 3.6, and when moving from 20 GB to 50 GB it jumps by more than 10. To process the largest 53 GB file, BigRePair requires more than 37 h and over 15 GB of RAM. SOLCA, instead, handles this file in nearly 9 h and less than 5 GB, and BigSOLCA in less than 2 h and 11 GB, being the fastest. What happens is that, being less compressible, the output of the CTPH parse is still too large for RePair, and thus it slows down drastically as soon as it cannot fit its structures in main memory. The much lower memory footprint of SOLCA, instead, pays off on these large and less compressible files, though its compression ratio is worse than that of BigRePair. In the full version of this paper we will investigate applying BigRePair and BigSOLCA recursively, following the strategy mentioned at the end of Sect. 3.