Abstract
Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while keeping the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give experimental evidence that our approach is competitive in practice.
Partially funded with Basal Funds FB0001, Conicyt, Chile. Partially funded by JST CREST Grant Number JPMJCR1402 (TI, HS, YT), KAKENHI Grant Numbers 19K20213 (TI), 17H01791 (HS), 18K18111 (YT). Partially funded by PRIN Grant Number 2017WR7SHH and by the LSBC_19-21 Project from the University of Eastern Piedmont (GM).
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Dictionary compression has proved to be an effective tool to exploit the repetitiveness that most of the fastest-growing datasets feature [24]. Lempel-Ziv (LZ77 for short) [23, 33] stands out as the most popular and effective compression method for repetitive texts. Further, it can be run in linear time and even in external memory [18]. LZ77 has the important drawback, however, that accessing random positions of the compressed text requires, essentially, to decompress it from the beginning. Therefore, it is not suitable to be used as a compressed data structure that represents the text in little space while simulating direct access to it. Grammar compression [19] is an alternative that offers better guarantees in this sense. The aim is to build a small context-free grammar (or Straight-Line Program, SLP) that generates (only) the text. The smallest SLP generating a text is always larger than its LZ77 parse, but only by a logarithmic factor that is rarely reached in practice. With an SLP we can access any text substring with only an additive logarithmic time penalty [3, 5], which has led to the development of various self-indexes building on SLPs [4, 9, 12, 13, 15, 26]. Many other richer queries on sequences have also been supported by associating summary information with the nonterminals of the SLP [1, 2, 5, 7, 10, 11]. There are applications in which SLPs are preferable to LZ77 for other reasons, as well; see, e.g., [22, 25].
Although finding the smallest SLP for a text is NP-complete [8, 28], there are several grammar construction algorithms that guarantee at most a logarithmic blowup on the LZ77 parse [8, 16, 17, 28, 29]. In practice, however, they are sharply outperformed by RePair [21], a heuristic that runs in linear time and obtains grammars of size very close to that of the LZ77 parse in most cases. This has made RePair the compressor of choice to build grammar-based compressed data structures [1, 7, 10, 11]. A serious problem with RePair, however, is that, despite running in linear time and space, in practice the constant of proportinality is high and it can be built only on inputs that are about one tenth of the available memory. This significantly hampers its applicability on large datasets.
In this paper we introduce a scalable SLP compression algorithm that uses space very close to that of RePair and can be applied on very large inputs. We prove a constant-approximation factor with respect to any SLP construction algorithm to which our technique is applied. Our experimental results show that we can compress a very repetitive 50 GB text in less than an hour, using less than 650MB of RAM and obtaining very competitive compression ratios.
2 Preliminaries
For the sake of brevity, we assume the reader is familiar with SLPs, LZ77, and the links between the two. To prove theoretical bounds for our approach, we consider a variant of LZ77 in which if S[i..j] is a phrase then either \(i = j\) and S[i] is the first occurrence of a distinct character, or S[i..j] occurs in \(S [1..j - 1]\) and \(S [i..j + 1]\) does not occur in S[1..j]. We refer to this variant as LZSS due to its similarity to Storer and Szymanski’s version of LZ77 [30], even though they allow substrings to be stored as raw text and we do not.
The best-known algorithm for building SLPs is probably RePair [21], for which there are many implementations (see [14] and references therein). It works by repeatedly finding the most common pair of symbols and replacing them with a new non-terminal. Although it is not known to have a good worst-case approximation ratio with respect to the size of LZ77 parsing, in practice it outperforms other constructions. RePair uses linear time and space but the coefficient in the space bound is quite large and so the standard implementations are practical only on small inputs. A more recent and more space economical alternative to RePair is SOLCA [31] that we will consider in Sect. 5.
Context-triggered piecewise hashing (CTPH) is a technique for parsing strings into blocks such that long repeated substrings are parsed the same way (except possibly at the beginning or end of the substrings). The name CTPH seems to be due to Kornblum [20] but the ideas go back to Tridgell’s Rsync [32] and Spamsum (https://www.samba.org/ftp/unpacked/junkcode/spamsum/README): “The core of the spamsum algorithm is a rolling hash similar to the rolling hash used in ‘rsync’. The rolling hash is used to produce a series of ’reset points’ in the plaintext that depend only on the immediate context (with a default context width of seven characters) and not on the earlier or later parts of the plaintext.”
Specifically, in this paper we choose a rolling hash function and a threshold p, run a sliding window of fixed size w over S and end the current block whenever the window contains a triggering substring, which is a substring of length w whose hash is congruent to 0 modulo p. When we end a block, we shift the window ahead w characters so all the blocks are disjoint and form a parse, which we call the Rsync parse. We call the set of distinct blocks the Rsync dictionary: if the input text contains many repetitions, we expect the dictionary to be much smaller than the text.
3 Algorithms
Given a string S, we can use Rsync parsing to help build an SLP for S with Algorithm 1 (“Rpair”). The final SLP can be viewed as first generating the parse, then replacing each block ID in the parse by the sublist of non-terminals that generate each block, and finally replacing the sublists by the blocks themselves.
Since each separator character appears only once in D and its parse tree, any non-terminal whose expansion includes a separator character also appears only once and is deleted. Since the parse tree of an SLP is binary and each non-terminal we delete appears only once, the number of distinct non-terminals we delete is at least the length of the list of non-terminals at the roots of the maximal remaining subtrees of the parse tree, minus one. Therefore, creating rules to generate the sublists does not cause the number of distinct non-terminals to grow to more than the number in the original SLP for D, plus one.
Algorithm 1 works with any algorithm for building SLPs for D and P. In Sect. 4 we show that, if we choose an algorithm that builds SLPs for D and P at most an \(\alpha \)-factor larger than their LZ77 parses, then we obtain an SLP an \(O (\alpha )\)-factor larger than the LZ77 parse of S. In the process we will refer to Algorithm 2 (“Rparse”), which produces an LZSS-like parse of S but is intended only to simplify our analysis of Algorithm 1 (not to compete with cutting-edge LZ-based compressors). By “LZSS-like” we mean a parse in which each phrase is either a single character that has not occurred before, or a copy of an earlier substring. We note in passing that, if the parse in Step 3 is still too big for a normal construction, then we can apply Algorithm 1 to it. We will show in the full version of this paper that, if we recurse only a constant number of times, then we worsen our compression bounds by only a constant factor.
4 Analysis
The main advantage of using Rsync parsing to preprocess S is that Rsync parsing is quite easy to parallelize, apply over streamed data, or apply in external memory. The resulting dictionary and parse may be significantly smaller than S, making it easier to apply grammar-based compression. In the full version of this paper we will analyze how much time and workspace Algorithms 1 and 2 use in terms of the total size of the dictionary and parse, but for now we are mainly concerned with the quality of the compression.
Let b be the number of distinct blocks in the Rsync parse of S, and let z be the number of phrases in the LZ77 parse of S. The first block is obviously the first occurrence of that substring and if S[i..j] is the first occurrence of another block, then \(S [i - w..j]\) (i.e., the block extended backward to include the previous triggering substring) is the first occurrence of that substring. Since the first occurrence of any non-empty substring overlaps or ends at a phrase boundary in the LZ77 parse, we can charge S[i..j] to such a boundary in \(S [i - w..j]\). Since blocks have length at least w and overlap by only w characters when extended backwards, each boundary has the first occurrences of at most two blocks charged to it, so \(b = O (z)\).
In Step 5 of Algorithm 2, we discard O(b) of the phrases in the LZSS parses of D and P when mapping to the phrases in the LZSS-like parse of S. Therefore, by showing that the number of phrases in the LZSS-like parse of S is O(z), we show that the total number of phrases in the LZSS parses of D and P is also \(O (z + b) = O (z)\), so the total number of phrases in their LZ77 parses is O(z) as well.
Lemma 1
If the t-th phrase in the LZSS parse of S is \(S [j..j + \ell - 1]\) then the 5t-th phrase resulting from Algorithm 2, if it exists, ends at or after \(S [j + \ell - 1]\).
Proof
Our claim is trivially true for \(t = 1\), since the first phrases in both parses are the single character S[1], so let t be greater than 1 and assume our claim is true for \(t - 1\), meaning the \(5 (t - 1)\)st phrase in our parse ends at \(S [k - 1]\) with \(k \ge j\). If \(k \ge j + \ell \) then our claim is also trivially true for t, so assume \(j \le k < j + \ell \). We must show that our parse divides \(S [k..j + \ell - 1]\) into at most five phrases, in order to prove our claim for t.
First suppose that \(S [k..j + \ell - 1]\) does not completely contain a triggering substring, so it overlaps at most two blocks. (It can overlap two blocks without containing a triggering substring if and only if a prefix of length less than w lies in one block and the rest lies in the next block.) Let \(S [i..i + \ell - 1]\) be \(S [j..j + \ell - 1]\)’s source and let \(k' = i + k - j\), so in the LZSS parse \(S [k..j + \ell - 1]\) is copied from \(S [k'..i + \ell - 1]\). Since \(S [k'..i + \ell - 1]\) does not completely contain a triggering substring either, it too overlaps at most two blocks.
Without loss of generality (since the other cases are easier), assume \(S [k..j + \ell - 1]\) and \(S [k'..i + \ell - 1]\) each overlap two blocks and they are split differently: \(S [k..k + d - 1]\) lies in one block and \(S [k + d..j + \ell - 1]\) lies in the next, and \(S [k'..k' + d' - 1]\) lies in one block and \(S [k' + d'..i + \ell - 1]\) in the next, with \(d \ne d'\). Assume also that \(d < d'\), since the other case is symmetric. Since \(S [k..k + d - 1]\) is completely contained in a block and occurs earlier completely contained in a block, as \(S [k'..k' + d - 1]\), our parse does not divide it. Similarly, since \(S [k + d..k + d' - 1]\) and \(S [k + d'..j + \ell - 1]\) are each completely contained in a block and occur earlier each completely contained in a block, as \(S [k' + d..k' + d' - 1]\) and \(S [k' + d'..i + \ell - 1]\), respectively, our parse does not divide them. Therefore, our parse divides \(S [k..j + \ell - 1]\) into at most three phrases.
Now suppose the first and last triggering substrings completely contained in \(S [k..j + \ell - 1]\) are \(S [x..x + w - 1]\) and \(S [y..y + w - 1]\) (possibly with \(x = y\)). By the arguments above, our parse divides \(S [k..x + w - 1]\) into at most three phrases. Since \(S [x + w..y + w - 1]\) is a sequence of complete blocks that have occurred earlier (in \(S [k'..i + \ell - 1]\)), our parse does not divide it unless \(S [k..x + w - 1]\) is a complete block that has occurred before as a complete block, in which case it may divide \(S [k..y + w - 1]\) once between \(S [x + w]\) and \(S [y + w - 1]\). Since \(S [y + w..j + \ell - 1]\) is completely contained in a block and occurs earlier completely contained in a block (in \(S [k'..i + \ell - 1]\)), our parse does not divide it. Therefore, our parse divides \(S [k..j + \ell - 1]\) into at most five phrases. \(\square \)
We note that we can quite easily can reduce the five in Lemma 1, at the cost of complicating our algorithm slightly. We leave a detailed analysis for the full version of this paper.
Corollary 1
Algorithm 2 yields an LZSS-like parse of S with at most five times as many phrases as its LZSS parse.
Proof
If the LZSS parse has t phrases then the t-th phrase ends at S[n] so, by Lemma 1, Algorithm 2 yields a parse with at most 5t phrases. \(\square \)
Theorem 1
Algorithm 2 yields an LZSS-like parse of S with O(z) phrases.
Proof
It is well known that the LZSS parse of S has at most twice as many phrases as the its LZ77 parse (since dividing each LZ77 phrase into a prefix with an earlier occurrence and a mismatch character yields an LZSS-like parse with at most twice as many phrases, and the LZSS parse has the fewest phrases of any LZSS-like parse). Therefore, by Corollary 1, Algorithm 2 yields a parse with at most O(z) phrases. \(\square \)
Corollary 2
The LZ77 parses of D and P have O(z) phrases.
Proof
Immediate, from Theorem 1, the fact that the LZ77 parse is no larger than the LZSS parse, and inspection of Algorithm 1. \(\square \)
Let A be any algorithm that builds an SLP at most an \(\alpha \)-factor larger than the LZ77 parse of its input. For example, with Rytter’s construction [28] we have \(\alpha = O (\log (|S| / z))\).
By Corollary 2, applying A to D—Step 2b in Algorithm 1—yields an SLP for D with \(O (\alpha z)\) rules. As explained in Sect. 3, Steps 2c to 2g then increase the number of rules by at most one while modifying the SLP such that, for each block in the dictionary, there is a non-terminal whose expansion is that block.
Similarly, applying A to P—Step 3—yields an SLP for P with O(z) rules. Replacing the terminals in the SLP by the non-terminals generating the blocks and then combining the two SLPs—Steps 4 and 5—yields an SLP for S with \(O (\alpha z)\) rules. This gives us our main result of this section:
Theorem 2
Using A in Steps 2b and 3 of Algorithm 1 yields an SLP for S with \(O (\alpha z)\) rules.
5 Experiments
We use two genome collections in our experiments: cN consists of N concatenated variants of the human chromosome chr19, of about 59 MB each; sN consists of N concatenated variants of salmonella genomes, of widely different sizes.
The chr19 collection was downloaded from the 1000 Genomes Project. Each chr19 sequence was derived by using the bcftools consensus tool to combine the haplotype-specific (maternal or paternal) variant calls for an individual with the chr19 sequence in the GRCH37 human reference. The salmonella genomes were downloaded from NCBI (BioProject PRJNA183844) and preprocessed by assembling each individual sample with IDBA-UD [27] setting kMaxShortSequence to 1024 per public advice from the author to accommodate the longer paired end reads that modern sequencers produce. More details of the collections are available in previous work [6, Sec. 4].
We compare two grammar compressors: RePair [21] produces the best known compression ratios but uses a lot of main memory space, whereas SOLCA [31] aims at optimizing main memory usage. Their versions combined with parallelized CTPH parsing are BigRepair and BigSOLCA. RePair could be run only on the smaller collections. Our experiments ran on a Intel(R) I7-4770 @ 3.40 GHz machine with 32 GB memory using 8 threads; currently only the CTPH parsing takes advantage of the multiple threads.
For RePair we use Navarro’s implementation for large files, at http://www.dcc.uchile.cl/gnavarro/software/repair.tgz, letting it use 10 GB of main memory, whereas the implementation of SOLCA is at https://github.com/tkbtkysms/solca. To measure their compression ratios in a uniform way, we consider the following encodings of their output: if RePair produces r (binary) rules and an initial rule of length c, we account 2r bits to encode the topology of the pruned parse tree (where the nonterminal ids become the preorder of their internal node in this tree) and \((r+c) \lceil \log _2 r \rceil \) bits to encode the leaves of the tree and the initial rule. SOLCA is similar, with \(c=1\). Our code is available at https://gitlab.com/manzai/bigrepair.
Table 1 shows the results in terms of compression ratio, time, and space in RAM. On the more repetitive chr19 genomes, BigRePair is clearly the best choice for large files. It loses to RePair in compression ratio, but RePair takes 11 h just to process 5.5 GB, so it is not a choice for larger files. Instead, BigRepair processes 55 GB in about 20 min and 6.5 GB. Similarly, SOLCA obtains better compression but more compression time than BigSOLCA, though the latter uses more space. The comparison between the two compressors shows that BigRepair performs better than both SOLCA and BigSOLCA in both compression ratio (reaching nearly half the compressed size of SOLCA on the largest files) and time (half the time of BigSOLCA). Still SOLCA uses much less space: it compresses 55 GB in 3.6 h, but using less than 750 MB.
The results start similarly on the less compressible salmonella collection, but, as the size of the input grows, there are significant differences. The time of BigRePair on chr19 was stable around 2GBs per minute, but on salmonella it is not: When moving from 10 GB to 20 GB of input data, the time per processed GB of BigRePair jumps by a factor of 3.6, and when moving from 20 GB to 50 GB it jumps by more than 10. To process the largest 53 GB file, BigRePair requires more than 37 h and over 15 GB of RAM. SOLCA, instead, handles this file in nearly 9 h and less than 5 GB, and BigSOLCA in less than 2 h and 11 GB, being the fastest. What happens is that, being less compressible, the output of the CTPH parse is still too large for RePair, and thus it slows down drastically as soon as it cannot fit its structures in main memory. The much lower memory footprint of SOLCA, instead, pays off on these large and less compressible files, though its compression ratio is worse than that of BigRePair. In the full version of this paper we will investigate applying BigRePair and BigSOLCA recursively, following the strategy mentioned at the end of Sect. 3.
References
Abeliuk, A., Cánovas, R., Navarro, G.: Practical compressed suffix trees. Algorithms 6(2), 319–351 (2013)
Bannai, H., Gagie, T., I, T.: Online LZ77 parsing and matching statistics with RLBWTs. In: CPM, pp. 7:1–7:12 (2018)
Belazzougui, D., Cording, P.H., Puglisi, S.J., Tabei, Y.: Access, rank, and select in grammar-compressed strings. In: ESA, pp. 142–154 (2015)
Bille, P., Ettienne, M.B., Gørtz, I.L., Vildhøj, H.W.: Time-space trade-offs for Lempel-Ziv compressed indexing. In: CPM, pp. 16:1–16:17 (2017)
Bille, P., Landau, G.M., Raman, R., Sadakane, K., Rao, S.S., Weimann, O.: Random access to grammar-compressed strings and trees. SIAM J. Comput. 44(3), 513–539 (2015)
Boucher, C., Gagie, T., Kuhnle, A., Manzini, G.: Prefix-free parsing for building big BWTs. In: WABI, pp. 2:1–2:16 (2018)
Brisaboa, N., Gómez-Brandón, A., Navarro, G., Paramá, J.: Gract: a grammar-based compressed index for trajectory data. Inf. Sci. 483, 106–135 (2019)
Charikar, M., et al.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)
Christiansen, A.R., Ettienne, M.B.: Compressed indexing with signature grammars. In: Bender, M.A., Farach-Colton, M., Mosteiro, M.A. (eds.) LATIN 2018. LNCS, vol. 10807, pp. 331–345. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77404-6_25
Claude, F., Fariña, A., Martínez-Prieto, M., Navarro, G.: Universal indexes for highly repetitive document collections. Inf. Sys. 61, 1–23 (2016)
Claude, F., Munro, J.I.: Document listing on versioned documents. In: Kurland, O., Lewenstein, M., Porat, E. (eds.) SPIRE 2013. LNCS, vol. 8214, pp. 72–83. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-02432-5_12
Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fund. Inf. 111(3), 313–337 (2010)
Claude, F., Navarro, G.: Improved grammar-based compressed indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 180–192. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34109-0_19
Furuya, I., Takagi, T., Nakashima, Y., Inenaga, S., Bannai, H., Kida, T.: MR-RePair: grammar compression based on maximal repeats. In: DCC, pp. 508–517 (2019)
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Pardo, A., Viola, A. (eds.) LATIN 2014. LNCS, vol. 8392, pp. 731–742. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54423-1_63
Jeż, A.: Approximation of grammar-based compression via recompression. Theor. Comput. Sci. 592, 115–134 (2015)
Jeż, A.: A really simple approximation of smallest grammar. Theor. Comput. Sci. 616, 141–150 (2016)
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lempel-Ziv parsing in external memory. In: DCC, pp. 153–162 (2014)
Kieffer, J.C., Yang, E.-H.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46(3), 737–754 (2000)
Kornblum, J.D.: Identifying almost identical files using context triggered piecewise hashing. Digit. Invest. 3, 91–97 (2006)
Larsson, J., Moffat, A.: Off-line dictionary-based compression. Proc. IEEE 88(11), 1722–1732 (2000)
Lasch, R., Oukid, I., Dementiev, R., May, N., Demirsoy, S.S., Sattler, K.-U.: Fast & strong: The case of compressed string dictionaries on modern CPUs. In: DaMoN, pp. 4:1–4:10 (2019)
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)
Navarro, G.: Indexing highly repetitive collections. In: Arumugam, S., Smyth, W.F. (eds.) IWOCA 2012. LNCS, vol. 7643, pp. 274–279. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35926-2_29
Nevill-Manning, C.G., Witten, I.H.: Identifying hierarchical structure in sequences: a linear-time algorithm. J. Artif. Intell. Res. 7, 67–82 (1997)
Nishimoto, T., I, T., Inenaga, S., Bannai, H., Takeda, M.: Dynamic index, LZ factorization, and LCE queries in compressed space. CoRR, abs/1504.06954 (2015)
Peng, Y., Leung, H.C.M., Yiu, S.M., Chin, F.Y.L.: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11), 1420–1428 (2012)
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1–3), 211–222 (2003)
Sakamoto, H.: A fully linear-time approximation algorithm for grammar-based compression. J. Discr. Algorithm 3(2–4), 416–430 (2005)
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982)
Takabatake, Y., I, T., Sakamoto, H.: A space-optimal grammar compression. In: ESA, pp. 67:1–67:15 (2017)
Tridgell, A.: Efficient Algorithms for Sorting and Synchronization. Ph.D. thesis, The Australian National University (1999)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory IT–23(3), 337–349 (1977)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Gagie, T., I, T., Manzini, G., Navarro, G., Sakamoto, H., Takabatake, Y. (2019). Rpair: Rescaling RePair with Rsync. In: Brisaboa, N., Puglisi, S. (eds) String Processing and Information Retrieval. SPIRE 2019. Lecture Notes in Computer Science(), vol 11811. Springer, Cham. https://doi.org/10.1007/978-3-030-32686-9_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-32686-9_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32685-2
Online ISBN: 978-3-030-32686-9
eBook Packages: Computer ScienceComputer Science (R0)