Rpair: Rescaling RePair with Rsync

Gagie, Travis; I, Tomohiro; Manzini, Giovanni; Navarro, Gonzalo; Sakamoto, Hiroshi; Takabatake, Yoshimasa

doi:10.1007/978-3-030-32686-9_3

Travis Gagie^10,11,
Tomohiro I¹²,
Giovanni Manzini¹³,
Gonzalo Navarro^10,14,
Hiroshi Sakamoto¹² &
…
Yoshimasa Takabatake¹²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11811))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

770 Accesses
10 Citations

Abstract

Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while keeping the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give experimental evidence that our approach is competitive in practice.

Partially funded with Basal Funds FB0001, Conicyt, Chile. Partially funded by JST CREST Grant Number JPMJCR1402 (TI, HS, YT), KAKENHI Grant Numbers 19K20213 (TI), 17H01791 (HS), 18K18111 (YT). Partially funded by PRIN Grant Number 2017WR7SHH and by the LSBC_19-21 Project from the University of Eastern Piedmont (GM).

Access provided by Autonomous University of Puebla. Download conference paper PDF

Practical Random Access to SLP-Compressed Texts

Two-Pass Greedy Regular Expression Parsing

Lempel–Ziv-Like Parsing in Small Space

Article 25 May 2020

1 Introduction

Dictionary compression has proved to be an effective tool to exploit the repetitiveness that most of the fastest-growing datasets feature [24]. Lempel-Ziv (LZ77 for short) [23, 33] stands out as the most popular and effective compression method for repetitive texts. Further, it can be run in linear time and even in external memory [18]. LZ77 has the important drawback, however, that accessing random positions of the compressed text requires, essentially, to decompress it from the beginning. Therefore, it is not suitable to be used as a compressed data structure that represents the text in little space while simulating direct access to it. Grammar compression [19] is an alternative that offers better guarantees in this sense. The aim is to build a small context-free grammar (or Straight-Line Program, SLP) that generates (only) the text. The smallest SLP generating a text is always larger than its LZ77 parse, but only by a logarithmic factor that is rarely reached in practice. With an SLP we can access any text substring with only an additive logarithmic time penalty [3, 5], which has led to the development of various self-indexes building on SLPs [4, 9, 12, 13, 15, 26]. Many other richer queries on sequences have also been supported by associating summary information with the nonterminals of the SLP [1, 2, 5, 7, 10, 11]. There are applications in which SLPs are preferable to LZ77 for other reasons, as well; see, e.g., [22, 25].

Although finding the smallest SLP for a text is NP-complete [8, 28], there are several grammar construction algorithms that guarantee at most a logarithmic blowup on the LZ77 parse [8, 16, 17, 28, 29]. In practice, however, they are sharply outperformed by RePair [21], a heuristic that runs in linear time and obtains grammars of size very close to that of the LZ77 parse in most cases. This has made RePair the compressor of choice to build grammar-based compressed data structures [1, 7, 10, 11]. A serious problem with RePair, however, is that, despite running in linear time and space, in practice the constant of proportinality is high and it can be built only on inputs that are about one tenth of the available memory. This significantly hampers its applicability on large datasets.

In this paper we introduce a scalable SLP compression algorithm that uses space very close to that of RePair and can be applied on very large inputs. We prove a constant-approximation factor with respect to any SLP construction algorithm to which our technique is applied. Our experimental results show that we can compress a very repetitive 50 GB text in less than an hour, using less than 650MB of RAM and obtaining very competitive compression ratios.

2 Preliminaries

For the sake of brevity, we assume the reader is familiar with SLPs, LZ77, and the links between the two. To prove theoretical bounds for our approach, we consider a variant of LZ77 in which if S[i..j] is a phrase then either \(i = j\) and S[i] is the first occurrence of a distinct character, or S[i..j] occurs in \(S [1..j - 1]\) and \(S [i..j + 1]\) does not occur in S[1..j]. We refer to this variant as LZSS due to its similarity to Storer and Szymanski’s version of LZ77 [30], even though they allow substrings to be stored as raw text and we do not.

The best-known algorithm for building SLPs is probably RePair [21], for which there are many implementations (see [14] and references therein). It works by repeatedly finding the most common pair of symbols and replacing them with a new non-terminal. Although it is not known to have a good worst-case approximation ratio with respect to the size of LZ77 parsing, in practice it outperforms other constructions. RePair uses linear time and space but the coefficient in the space bound is quite large and so the standard implementations are practical only on small inputs. A more recent and more space economical alternative to RePair is SOLCA [31] that we will consider in Sect. 5.

Context-triggered piecewise hashing (CTPH) is a technique for parsing strings into blocks such that long repeated substrings are parsed the same way (except possibly at the beginning or end of the substrings). The name CTPH seems to be due to Kornblum [20] but the ideas go back to Tridgell’s Rsync [32] and Spamsum (https://www.samba.org/ftp/unpacked/junkcode/spamsum/README): “The core of the spamsum algorithm is a rolling hash similar to the rolling hash used in ‘rsync’. The rolling hash is used to produce a series of ’reset points’ in the plaintext that depend only on the immediate context (with a default context width of seven characters) and not on the earlier or later parts of the plaintext.”

Specifically, in this paper we choose a rolling hash function and a threshold p, run a sliding window of fixed size w over S and end the current block whenever the window contains a triggering substring, which is a substring of length w whose hash is congruent to 0 modulo p. When we end a block, we shift the window ahead w characters so all the blocks are disjoint and form a parse, which we call the Rsync parse. We call the set of distinct blocks the Rsync dictionary: if the input text contains many repetitions, we expect the dictionary to be much smaller than the text.

3 Algorithms

Given a string S, we can use Rsync parsing to help build an SLP for S with Algorithm 1 (“Rpair”). The final SLP can be viewed as first generating the parse, then replacing each block ID in the parse by the sublist of non-terminals that generate each block, and finally replacing the sublists by the blocks themselves.

Since each separator character appears only once in D and its parse tree, any non-terminal whose expansion includes a separator character also appears only once and is deleted. Since the parse tree of an SLP is binary and each non-terminal we delete appears only once, the number of distinct non-terminals we delete is at least the length of the list of non-terminals at the roots of the maximal remaining subtrees of the parse tree, minus one. Therefore, creating rules to generate the sublists does not cause the number of distinct non-terminals to grow to more than the number in the original SLP for D, plus one.

Algorithm 1 works with any algorithm for building SLPs for D and P. In Sect. 4 we show that, if we choose an algorithm that builds SLPs for D and P at most an \(\alpha \)-factor larger than their LZ77 parses, then we obtain an SLP an \(O (\alpha )\)-factor larger than the LZ77 parse of S. In the process we will refer to Algorithm 2 (“Rparse”), which produces an LZSS-like parse of S but is intended only to simplify our analysis of Algorithm 1 (not to compete with cutting-edge LZ-based compressors). By “LZSS-like” we mean a parse in which each phrase is either a single character that has not occurred before, or a copy of an earlier substring. We note in passing that, if the parse in Step 3 is still too big for a normal construction, then we can apply Algorithm 1 to it. We will show in the full version of this paper that, if we recurse only a constant number of times, then we worsen our compression bounds by only a constant factor.

4 Analysis

The main advantage of using Rsync parsing to preprocess S is that Rsync parsing is quite easy to parallelize, apply over streamed data, or apply in external memory. The resulting dictionary and parse may be significantly smaller than S, making it easier to apply grammar-based compression. In the full version of this paper we will analyze how much time and workspace Algorithms 1 and 2 use in terms of the total size of the dictionary and parse, but for now we are mainly concerned with the quality of the compression.

Let b be the number of distinct blocks in the Rsync parse of S, and let z be the number of phrases in the LZ77 parse of S. The first block is obviously the first occurrence of that substring and if S[i..j] is the first occurrence of another block, then \(S [i - w..j]\) (i.e., the block extended backward to include the previous triggering substring) is the first occurrence of that substring. Since the first occurrence of any non-empty substring overlaps or ends at a phrase boundary in the LZ77 parse, we can charge S[i..j] to such a boundary in \(S [i - w..j]\). Since blocks have length at least w and overlap by only w characters when extended backwards, each boundary has the first occurrences of at most two blocks charged to it, so \(b = O (z)\).

In Step 5 of Algorithm 2, we discard O(b) of the phrases in the LZSS parses of D and P when mapping to the phrases in the LZSS-like parse of S. Therefore, by showing that the number of phrases in the LZSS-like parse of S is O(z), we show that the total number of phrases in the LZSS parses of D and P is also \(O (z + b) = O (z)\), so the total number of phrases in their LZ77 parses is O(z) as well.

Lemma 1

If the t-th phrase in the LZSS parse of S is \(S [j..j + \ell - 1]\) then the 5t-th phrase resulting from Algorithm 2, if it exists, ends at or after \(S [j + \ell - 1]\).

Proof

Our claim is trivially true for \(t = 1\), since the first phrases in both parses are the single character S[1], so let t be greater than 1 and assume our claim is true for \(t - 1\), meaning the \(5 (t - 1)\)st phrase in our parse ends at \(S [k - 1]\) with \(k \ge j\). If \(k \ge j + \ell \) then our claim is also trivially true for t, so assume \(j \le k < j + \ell \). We must show that our parse divides \(S [k..j + \ell - 1]\) into at most five phrases, in order to prove our claim for t.

First suppose that \(S [k..j + \ell - 1]\) does not completely contain a triggering substring, so it overlaps at most two blocks. (It can overlap two blocks without containing a triggering substring if and only if a prefix of length less than w lies in one block and the rest lies in the next block.) Let \(S [i..i + \ell - 1]\) be \(S [j..j + \ell - 1]\)’s source and let \(k' = i + k - j\), so in the LZSS parse \(S [k..j + \ell - 1]\) is copied from \(S [k'..i + \ell - 1]\). Since \(S [k'..i + \ell - 1]\) does not completely contain a triggering substring either, it too overlaps at most two blocks.

Without loss of generality (since the other cases are easier), assume \(S [k..j + \ell - 1]\) and \(S [k'..i + \ell - 1]\) each overlap two blocks and they are split differently: \(S [k..k + d - 1]\) lies in one block and \(S [k + d..j + \ell - 1]\) lies in the next, and \(S [k'..k' + d' - 1]\) lies in one block and \(S [k' + d'..i + \ell - 1]\) in the next, with \(d \ne d'\). Assume also that \(d < d'\), since the other case is symmetric. Since \(S [k..k + d - 1]\) is completely contained in a block and occurs earlier completely contained in a block, as \(S [k'..k' + d - 1]\), our parse does not divide it. Similarly, since \(S [k + d..k + d' - 1]\) and \(S [k + d'..j + \ell - 1]\) are each completely contained in a block and occur earlier each completely contained in a block, as \(S [k' + d..k' + d' - 1]\) and \(S [k' + d'..i + \ell - 1]\), respectively, our parse does not divide them. Therefore, our parse divides \(S [k..j + \ell - 1]\) into at most three phrases.

Now suppose the first and last triggering substrings completely contained in \(S [k..j + \ell - 1]\) are \(S [x..x + w - 1]\) and \(S [y..y + w - 1]\) (possibly with \(x = y\)). By the arguments above, our parse divides \(S [k..x + w - 1]\) into at most three phrases. Since \(S [x + w..y + w - 1]\) is a sequence of complete blocks that have occurred earlier (in \(S [k'..i + \ell - 1]\)), our parse does not divide it unless \(S [k..x + w - 1]\) is a complete block that has occurred before as a complete block, in which case it may divide \(S [k..y + w - 1]\) once between \(S [x + w]\) and \(S [y + w - 1]\). Since \(S [y + w..j + \ell - 1]\) is completely contained in a block and occurs earlier completely contained in a block (in \(S [k'..i + \ell - 1]\)), our parse does not divide it. Therefore, our parse divides \(S [k..j + \ell - 1]\) into at most five phrases. \(\square \)

We note that we can quite easily can reduce the five in Lemma 1, at the cost of complicating our algorithm slightly. We leave a detailed analysis for the full version of this paper.

Corollary 1

Algorithm 2 yields an LZSS-like parse of S with at most five times as many phrases as its LZSS parse.

Proof

If the LZSS parse has t phrases then the t-th phrase ends at S[n] so, by Lemma 1, Algorithm 2 yields a parse with at most 5t phrases. \(\square \)

Theorem 1

Algorithm 2 yields an LZSS-like parse of S with O(z) phrases.

Proof

It is well known that the LZSS parse of S has at most twice as many phrases as the its LZ77 parse (since dividing each LZ77 phrase into a prefix with an earlier occurrence and a mismatch character yields an LZSS-like parse with at most twice as many phrases, and the LZSS parse has the fewest phrases of any LZSS-like parse). Therefore, by Corollary 1, Algorithm 2 yields a parse with at most O(z) phrases. \(\square \)

Corollary 2

The LZ77 parses of D and P have O(z) phrases.

Proof

Immediate, from Theorem 1, the fact that the LZ77 parse is no larger than the LZSS parse, and inspection of Algorithm 1. \(\square \)

Let A be any algorithm that builds an SLP at most an \(\alpha \)-factor larger than the LZ77 parse of its input. For example, with Rytter’s construction [28] we have \(\alpha = O (\log (|S| / z))\).

By Corollary 2, applying A to D—Step 2b in Algorithm 1—yields an SLP for D with \(O (\alpha z)\) rules. As explained in Sect. 3, Steps 2c to 2g then increase the number of rules by at most one while modifying the SLP such that, for each block in the dictionary, there is a non-terminal whose expansion is that block.

Similarly, applying A to P—Step 3—yields an SLP for P with O(z) rules. Replacing the terminals in the SLP by the non-terminals generating the blocks and then combining the two SLPs—Steps 4 and 5—yields an SLP for S with \(O (\alpha z)\) rules. This gives us our main result of this section:

Theorem 2

Using A in Steps 2b and 3 of Algorithm 1 yields an SLP for S with \(O (\alpha z)\) rules.

5 Experiments

We use two genome collections in our experiments: cN consists of N concatenated variants of the human chromosome chr19, of about 59 MB each; sN consists of N concatenated variants of salmonella genomes, of widely different sizes.

The chr19 collection was downloaded from the 1000 Genomes Project. Each chr19 sequence was derived by using the bcftools consensus tool to combine the haplotype-specific (maternal or paternal) variant calls for an individual with the chr19 sequence in the GRCH37 human reference. The salmonella genomes were downloaded from NCBI (BioProject PRJNA183844) and preprocessed by assembling each individual sample with IDBA-UD [27] setting kMaxShortSequence to 1024 per public advice from the author to accommodate the longer paired end reads that modern sequencers produce. More details of the collections are available in previous work [6, Sec. 4].

We compare two grammar compressors: RePair [21] produces the best known compression ratios but uses a lot of main memory space, whereas SOLCA [31] aims at optimizing main memory usage. Their versions combined with parallelized CTPH parsing are BigRepair and BigSOLCA. RePair could be run only on the smaller collections. Our experiments ran on a Intel(R) I7-4770 @ 3.40 GHz machine with 32 GB memory using 8 threads; currently only the CTPH parsing takes advantage of the multiple threads.

For RePair we use Navarro’s implementation for large files, at http://www.dcc.uchile.cl/gnavarro/software/repair.tgz, letting it use 10 GB of main memory, whereas the implementation of SOLCA is at https://github.com/tkbtkysms/solca. To measure their compression ratios in a uniform way, we consider the following encodings of their output: if RePair produces r (binary) rules and an initial rule of length c, we account 2r bits to encode the topology of the pruned parse tree (where the nonterminal ids become the preorder of their internal node in this tree) and \((r+c) \lceil \log _2 r \rceil \) bits to encode the leaves of the tree and the initial rule. SOLCA is similar, with \(c=1\). Our code is available at https://gitlab.com/manzai/bigrepair.

Table 1. Performance of the compressors. File sizes are expressed in GB, compression ratios in percentage of compressed file over uncompressed file, compression times in seconds per input GB, and compression main memory usage in MBs per input GB.

Full size table

Table 1 shows the results in terms of compression ratio, time, and space in RAM. On the more repetitive chr19 genomes, BigRePair is clearly the best choice for large files. It loses to RePair in compression ratio, but RePair takes 11 h just to process 5.5 GB, so it is not a choice for larger files. Instead, BigRepair processes 55 GB in about 20 min and 6.5 GB. Similarly, SOLCA obtains better compression but more compression time than BigSOLCA, though the latter uses more space. The comparison between the two compressors shows that BigRepair performs better than both SOLCA and BigSOLCA in both compression ratio (reaching nearly half the compressed size of SOLCA on the largest files) and time (half the time of BigSOLCA). Still SOLCA uses much less space: it compresses 55 GB in 3.6 h, but using less than 750 MB.

The results start similarly on the less compressible salmonella collection, but, as the size of the input grows, there are significant differences. The time of BigRePair on chr19 was stable around 2GBs per minute, but on salmonella it is not: When moving from 10 GB to 20 GB of input data, the time per processed GB of BigRePair jumps by a factor of 3.6, and when moving from 20 GB to 50 GB it jumps by more than 10. To process the largest 53 GB file, BigRePair requires more than 37 h and over 15 GB of RAM. SOLCA, instead, handles this file in nearly 9 h and less than 5 GB, and BigSOLCA in less than 2 h and 11 GB, being the fastest. What happens is that, being less compressible, the output of the CTPH parse is still too large for RePair, and thus it slows down drastically as soon as it cannot fit its structures in main memory. The much lower memory footprint of SOLCA, instead, pays off on these large and less compressible files, though its compression ratio is worse than that of BigRePair. In the full version of this paper we will investigate applying BigRePair and BigSOLCA recursively, following the strategy mentioned at the end of Sect. 3.

References

Abeliuk, A., Cánovas, R., Navarro, G.: Practical compressed suffix trees. Algorithms 6(2), 319–351 (2013)
Article MathSciNet Google Scholar
Bannai, H., Gagie, T., I, T.: Online LZ77 parsing and matching statistics with RLBWTs. In: CPM, pp. 7:1–7:12 (2018)
Google Scholar
Belazzougui, D., Cording, P.H., Puglisi, S.J., Tabei, Y.: Access, rank, and select in grammar-compressed strings. In: ESA, pp. 142–154 (2015)
Google Scholar
Bille, P., Ettienne, M.B., Gørtz, I.L., Vildhøj, H.W.: Time-space trade-offs for Lempel-Ziv compressed indexing. In: CPM, pp. 16:1–16:17 (2017)
Google Scholar
Bille, P., Landau, G.M., Raman, R., Sadakane, K., Rao, S.S., Weimann, O.: Random access to grammar-compressed strings and trees. SIAM J. Comput. 44(3), 513–539 (2015)
Article MathSciNet Google Scholar
Boucher, C., Gagie, T., Kuhnle, A., Manzini, G.: Prefix-free parsing for building big BWTs. In: WABI, pp. 2:1–2:16 (2018)
Google Scholar
Brisaboa, N., Gómez-Brandón, A., Navarro, G., Paramá, J.: Gract: a grammar-based compressed index for trajectory data. Inf. Sci. 483, 106–135 (2019)
Article MathSciNet Google Scholar
Charikar, M., et al.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)
Article MathSciNet Google Scholar
Christiansen, A.R., Ettienne, M.B.: Compressed indexing with signature grammars. In: Bender, M.A., Farach-Colton, M., Mosteiro, M.A. (eds.) LATIN 2018. LNCS, vol. 10807, pp. 331–345. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77404-6_25
Chapter Google Scholar
Claude, F., Fariña, A., Martínez-Prieto, M., Navarro, G.: Universal indexes for highly repetitive document collections. Inf. Sys. 61, 1–23 (2016)
Article Google Scholar
Claude, F., Munro, J.I.: Document listing on versioned documents. In: Kurland, O., Lewenstein, M., Porat, E. (eds.) SPIRE 2013. LNCS, vol. 8214, pp. 72–83. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-02432-5_12
Chapter Google Scholar
Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fund. Inf. 111(3), 313–337 (2010)
MathSciNet MATH Google Scholar
Claude, F., Navarro, G.: Improved grammar-based compressed indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 180–192. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34109-0_19
Chapter Google Scholar
Furuya, I., Takagi, T., Nakashima, Y., Inenaga, S., Bannai, H., Kida, T.: MR-RePair: grammar compression based on maximal repeats. In: DCC, pp. 508–517 (2019)
Google Scholar
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Pardo, A., Viola, A. (eds.) LATIN 2014. LNCS, vol. 8392, pp. 731–742. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54423-1_63
Chapter Google Scholar
Jeż, A.: Approximation of grammar-based compression via recompression. Theor. Comput. Sci. 592, 115–134 (2015)
Article MathSciNet Google Scholar
Jeż, A.: A really simple approximation of smallest grammar. Theor. Comput. Sci. 616, 141–150 (2016)
Article MathSciNet Google Scholar
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lempel-Ziv parsing in external memory. In: DCC, pp. 153–162 (2014)
Google Scholar
Kieffer, J.C., Yang, E.-H.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46(3), 737–754 (2000)
Article MathSciNet Google Scholar
Kornblum, J.D.: Identifying almost identical files using context triggered piecewise hashing. Digit. Invest. 3, 91–97 (2006)
Article Google Scholar
Larsson, J., Moffat, A.: Off-line dictionary-based compression. Proc. IEEE 88(11), 1722–1732 (2000)
Article Google Scholar
Lasch, R., Oukid, I., Dementiev, R., May, N., Demirsoy, S.S., Sattler, K.-U.: Fast & strong: The case of compressed string dictionaries on modern CPUs. In: DaMoN, pp. 4:1–4:10 (2019)
Google Scholar
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)
Article MathSciNet Google Scholar
Navarro, G.: Indexing highly repetitive collections. In: Arumugam, S., Smyth, W.F. (eds.) IWOCA 2012. LNCS, vol. 7643, pp. 274–279. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35926-2_29
Chapter Google Scholar
Nevill-Manning, C.G., Witten, I.H.: Identifying hierarchical structure in sequences: a linear-time algorithm. J. Artif. Intell. Res. 7, 67–82 (1997)
Article Google Scholar
Nishimoto, T., I, T., Inenaga, S., Bannai, H., Takeda, M.: Dynamic index, LZ factorization, and LCE queries in compressed space. CoRR, abs/1504.06954 (2015)
Google Scholar
Peng, Y., Leung, H.C.M., Yiu, S.M., Chin, F.Y.L.: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11), 1420–1428 (2012)
Article Google Scholar
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1–3), 211–222 (2003)
Article MathSciNet Google Scholar
Sakamoto, H.: A fully linear-time approximation algorithm for grammar-based compression. J. Discr. Algorithm 3(2–4), 416–430 (2005)
Article MathSciNet Google Scholar
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982)
Article MathSciNet Google Scholar
Takabatake, Y., I, T., Sakamoto, H.: A space-optimal grammar compression. In: ESA, pp. 67:1–67:15 (2017)
Google Scholar
Tridgell, A.: Efficient Algorithms for Sorting and Synchronization. Ph.D. thesis, The Australian National University (1999)
Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory IT–23(3), 337–349 (1977)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

CeBiB—Center for Biotechnology and Bioengineering, Santiago, Chile
Travis Gagie & Gonzalo Navarro
Faculty of Computer Science, Dalhousie University, Halifax, Canada
Travis Gagie
Department of Artificial Intelligence, Kyushu Institute of Technology, Fukuoka, Japan
Tomohiro I, Hiroshi Sakamoto & Yoshimasa Takabatake
Department of Science and Technological Innovation, University of Eastern Piedmont, Alessandria, Italy
Giovanni Manzini
Department of Computer Science, University of Chile, Santiago, Chile
Gonzalo Navarro

Authors

Travis Gagie
View author publications
You can also search for this author in PubMed Google Scholar
Tomohiro I
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Manzini
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar
Hiroshi Sakamoto
View author publications
You can also search for this author in PubMed Google Scholar
Yoshimasa Takabatake
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giovanni Manzini .

Editor information

Editors and Affiliations

University of A Coruña, A Coruña, Spain
Nieves R. Brisaboa
University of Helsinki, Helsinki, Finland
Simon J. Puglisi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gagie, T., I, T., Manzini, G., Navarro, G., Sakamoto, H., Takabatake, Y. (2019). Rpair: Rescaling RePair with Rsync. In: Brisaboa, N., Puglisi, S. (eds) String Processing and Information Retrieval. SPIRE 2019. Lecture Notes in Computer Science(), vol 11811. Springer, Cham. https://doi.org/10.1007/978-3-030-32686-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-32686-9_3
Published: 03 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32685-2
Online ISBN: 978-3-030-32686-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Rpair: Rescaling RePair with Rsync

Abstract

Similar content being viewed by others

Practical Random Access to SLP-Compressed Texts

Two-Pass Greedy Regular Expression Parsing

Lempel–Ziv-Like Parsing in Small Space

1 Introduction

2 Preliminaries

3 Algorithms

4 Analysis

Lemma 1

Proof

Corollary 1

Proof

Theorem 1

Proof

Corollary 2

Proof

Theorem 2

5 Experiments

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Rpair: Rescaling RePair with Rsync

Abstract

Similar content being viewed by others

Practical Random Access to SLP-Compressed Texts

Two-Pass Greedy Regular Expression Parsing

Lempel–Ziv-Like Parsing in Small Space

1 Introduction

2 Preliminaries

3 Algorithms

4 Analysis

Lemma 1

Proof

Corollary 1

Proof

Theorem 1

Proof

Corollary 2

Proof

Theorem 2

5 Experiments

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation